Excluded some notebooks from lengthy execution

statisticalbiotechnology · Nov 19, 2024 · 2a0582f · 2a0582f
1 parent 27aee15
commit 2a0582f
Show file tree

Hide file tree

Showing 9 changed files with 1,759 additions and 434 deletions.
diff --git a/dsbook/_config.yml b/dsbook/_config.yml
@@ -11,7 +11,11 @@ copyright: "2024"
 execute:
   execute_notebooks: auto
   timeout: 100
-
+  exclude_patterns:
+    - dsbook/network/gsea.ipynb
+    - dsbook/statistics/testing.ipynb
+    - dsbook/statistics/qvalue.ipynb
+    - dsbook/unsupervised/VAEofCarcinomas.ipynb
 
 # Define the name of the latex output file for PDF builds
 latex:

diff --git a/dsbook/_toc.yml b/dsbook/_toc.yml
@@ -37,3 +37,4 @@ parts:
     numbered: True
     chapters:
     - file: network/pathway.ipynb
+    - file: network/gsea.ipynb
diff --git a/dsbook/network/gsea.ipynb b/dsbook/network/gsea.ipynb
diff --git a/dsbook/network/pathway.ipynb b/dsbook/network/pathway.ipynb
@@ -2,19 +2,14 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "3c37aa1f",
+   "id": "165eca95",
    "metadata": {},
    "source": [
     "# Pathway Analysis\n",
     "\n",
-    "## Introduction"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "90b68e9f",
-   "metadata": {},
-   "source": [
+    "## Introduction\n",
+    "\n",
+    "\n",
     "A typical output from a high-throughput experiment is a list of genes, transcripts, or proteins. Given this list, one might want to identify common functions among these analytes. Here, pathway analysis has become a go-to method for associating functions with experimental findings. Such analysis provides essential insights into the complex relationships among biological molecules and how they contribute to specific cellular functions. A pathway represents a set of biochemical reactions and interactions that take place within a cell or organism, involving metabolites, genes, and proteins. Understanding these pathways allows researchers to link molecular changes to phenotypic outcomes, such as disease states or responses to treatment.\n",
     "\n",
     "## Pathway Databases\n",
@@ -69,10 +64,68 @@
     "\n",
     "### Enrichment Score and Null Distribution\n",
     "\n",
-    "GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise.\n",
+    "GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise. I.e. it reflects how many genes encountered as compared to what you would expect if they where uniformly distributed among the genes.\n",
     "\n",
     "To assess the statistical significance of the observed enrichment score, GSEA uses a **null distribution** obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.\n",
     "\n",
+    "Here is an illustration of the enrichment score. We generate a normal-distributed dataset of 30 samples covering 100 genes. We also include 10 genes that are from the same pathway, that we simulate as \"regulated\" i.e. an additional random offset between the \"Healthy\" and the \"Sick\" samples. GSEA ranks the data and displays the position of the genes in the pathway as black lines among the genes noyt in the pathway, which are shown as white lines. If the black lines where evenly distributed the enrichment of the pathway genes would be zero, however, we devised the test in such a way that the black lines are more to the left of the distribution. This results in an increased enrichment score."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe327d05",
+   "metadata": {
+    "tags": [
+     "hide-input"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import gseapy as gp\n",
+    "\n",
+    "# Seed for reproducibility\n",
+    "np.random.seed(42)\n",
+    "\n",
+    "n_genes_in_pathway = 15\n",
+    "n_genes_in_background = 100\n",
+    "n_samples_per_group = 15\n",
+    "\n",
+    "pathway_genes = { f\"PathwayGene{i}\" for i in range(1, n_genes_in_pathway + 1 ) }\n",
+    "pathway_db = {\"my_pathway\" : pathway_genes }\n",
+    "genes = list(pathway_genes) + [f\"Gene{i}\" for i in range(1, n_genes_in_background + 1 )]\n",
+    "samples = [f\"Sample{j}\" for j in range(1, 2*n_samples_per_group + 1)]\n",
+    "fake_data = pd.DataFrame(np.random.normal(0, 1, size=(len(genes), len(samples))), index=genes, columns=samples)\n",
+    "\n",
+    "for gene in pathway_genes:\n",
+    "    if gene in fake_data.index:\n",
+    "        # Make pathway genes have higher values in the first half of samples using iloc\n",
+    "        fake_data.loc[gene, fake_data.columns[:n_samples_per_group]] += np.random.normal(0.5, 0.5, size=n_samples_per_group)\n",
+    "\n",
+    "labels = [\"Healthy\"]*n_samples_per_group + [\"Sick\"]*n_samples_per_group\n",
+    "\n",
+    "gs = gp.GSEA(data=fake_data, \n",
+    "                 gene_sets=pathway_db, \n",
+    "                 classes=labels, # cls=class_vector\n",
+    "                 permutation_type='phenotype', # null from permutations of class labels\n",
+    "                 permutation_num=2000, # reduce number to speed up test\n",
+    "                 outdir=None,  # do not write output to disk\n",
+    "                 no_plot=True, # Skip plotting\n",
+    "                 method='signal_to_noise',\n",
+    "                 threads=4, # Number of allowed parallel processes\n",
+    "                 seed=42,\n",
+    "                 format='png',)\n",
+    "gs.run()\n",
+    "gs.plot(\"my_pathway\", show_ranking=False, legend_kws={'loc': (1.05, 0)}, )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e90b7ba",
+   "metadata": {},
+   "source": [
     "### Kolmogorov-Smirnov (KS) Test\n",
     "\n",
     "The **Kolmogorov-Smirnov (KS) test** is used in GSEA to calculate the enrichment score. The KS test is a non-parametric test that measures the maximum deviation between the observed cumulative distribution of the gene set and the expected distribution under the null hypothesis. In GSEA, the enrichment score is effectively the maximum deviation encountered as we move down the ranked list, capturing whether genes from the pathway are found disproportionately at the top or bottom of the list. This score is then compared against the null distribution to determine statistical significance.\n",
@@ -86,10 +139,10 @@
   }
  ],
  "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
   }
  },
  "nbformat": 4,

diff --git a/dsbook/network/pathway.md b/dsbook/network/pathway.md
@@ -1,3 +1,15 @@
+---
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+jupytext:
+  formats: md:myst
+  text_representation:
+    extension: .md
+    format_name: myst
+---
+
 # Pathway Analysis
 
 ## Introduction
@@ -57,10 +69,56 @@ In GSEA, gene sets corresponding to known biological pathways are tested for the
 
 ### Enrichment Score and Null Distribution
 
-GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise.
+GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise. I.e. it reflects how many genes encountered as compared to what you would expect if they where uniformly distributed among the genes.
 
 To assess the statistical significance of the observed enrichment score, GSEA uses a **null distribution** obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.
 
+Here is an illustration of the enrichment score. We generate a normal-distributed dataset of 30 samples covering 100 genes. We also include 10 genes that are from the same pathway, that we simulate as "regulated" i.e. an additional random offset between the "Healthy" and the "Sick" samples. GSEA ranks the data and displays the position of the genes in the pathway as black lines among the genes noyt in the pathway, which are shown as white lines. If the black lines where evenly distributed the enrichment of the pathway genes would be zero, however, we devised the test in such a way that the black lines are more to the left of the distribution. This results in an increased enrichment score for the low ranked genes. For anoying reasons the enrichment plot appears twice in the output below.
+
+```{code-cell} ipython3
+:tags: [hide-input]
+import numpy as np
+import pandas as pd
+import gseapy as gp
+
+# Seed for reproducibility
+np.random.seed(42)
+
+n_genes_in_pathway = 15
+n_genes_in_background = 100
+n_samples_per_group = 15
+
+pathway_genes = { f"PathwayGene{i}" for i in range(1, n_genes_in_pathway + 1 ) }
+pathway_db = {"my_pathway" : pathway_genes }
+genes = list(pathway_genes) + [f"Gene{i}" for i in range(1, n_genes_in_background + 1 )]
+samples = [f"Sample{j}" for j in range(1, 2*n_samples_per_group + 1)]
+fake_data = pd.DataFrame(np.random.normal(0, 1, size=(len(genes), len(samples))), index=genes, columns=samples)
+
+for gene in pathway_genes:
+    if gene in fake_data.index:
+        # Make pathway genes have higher values in the first half of samples using iloc
+        fake_data.loc[gene, fake_data.columns[:n_samples_per_group]] += np.random.normal(0.5, 0.5, size=n_samples_per_group)
+
+labels = ["Healthy"]*n_samples_per_group + ["Sick"]*n_samples_per_group
+
+gs = gp.GSEA(data=fake_data, 
+                 gene_sets=pathway_db, 
+                 classes=labels, # cls=class_vector
+                 permutation_type='phenotype', # null from permutations of class labels
+                 permutation_num=2000, # reduce number to speed up test
+                 outdir=None,  # do not write output to disk
+                 no_plot=True, # Skip plotting
+                 method='signal_to_noise',
+                 threads=4, # Number of allowed parallel processes
+                 seed=42,
+                 format='png',)
+gs.run()
+gs.plot("my_pathway", show_ranking=False)
+gs.res2d
+```
+
+For a more detailed explanation of the enrichment score, please check out the original paper, [Subramanian, et al.](https://www.pnas.org/doi/10.1073/pnas.0506580102).
+
 ### Kolmogorov-Smirnov (KS) Test
 
 The **Kolmogorov-Smirnov (KS) test** is used in GSEA to calculate the enrichment score. The KS test is a non-parametric test that measures the maximum deviation between the observed cumulative distribution of the gene set and the expected distribution under the null hypothesis. In GSEA, the enrichment score is effectively the maximum deviation encountered as we move down the ranked list, capturing whether genes from the pathway are found disproportionately at the top or bottom of the list. This score is then compared against the null distribution to determine statistical significance.

diff --git a/dsbook/statistics/multiple.ipynb b/dsbook/statistics/multiple.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "5243e4a6",
+   "id": "05ac3adf",
    "metadata": {},
    "source": [
     "# Multiple Hypothesis Corrections\n",
@@ -135,7 +135,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "88e76665",
+   "id": "4a4669ee",
    "metadata": {
     "tags": [
      "hide-input"
@@ -187,7 +187,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "111dc413",
+   "id": "82bc3085",
    "metadata": {},
    "source": [
     "In the plot above, we simulate $p$ values for a large number of hypotheses (1,000 in this case). Half of these hypotheses represent **nulls** (meaning there is no effect), while the other half represent **alternative hypotheses** (meaning there is a true effect).\n",

diff --git a/dsbook/statistics/qvalue.ipynb b/dsbook/statistics/qvalue.ipynb