Skip to content

Commit

Permalink
Excluded some notebooks from lengthy execution
Browse files Browse the repository at this point in the history
  • Loading branch information
percolator committed Nov 19, 2024
1 parent 27aee15 commit 2a0582f
Show file tree
Hide file tree
Showing 9 changed files with 1,759 additions and 434 deletions.
6 changes: 5 additions & 1 deletion dsbook/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ copyright: "2024"
execute:
execute_notebooks: auto
timeout: 100

exclude_patterns:
- dsbook/network/gsea.ipynb
- dsbook/statistics/testing.ipynb
- dsbook/statistics/qvalue.ipynb
- dsbook/unsupervised/VAEofCarcinomas.ipynb

# Define the name of the latex output file for PDF builds
latex:
Expand Down
1 change: 1 addition & 0 deletions dsbook/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ parts:
numbered: True
chapters:
- file: network/pathway.ipynb
- file: network/gsea.ipynb
502 changes: 140 additions & 362 deletions dsbook/network/gsea.ipynb

Large diffs are not rendered by default.

81 changes: 67 additions & 14 deletions dsbook/network/pathway.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,14 @@
"cells": [
{
"cell_type": "markdown",
"id": "3c37aa1f",
"id": "165eca95",
"metadata": {},
"source": [
"# Pathway Analysis\n",
"\n",
"## Introduction"
]
},
{
"cell_type": "markdown",
"id": "90b68e9f",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"\n",
"A typical output from a high-throughput experiment is a list of genes, transcripts, or proteins. Given this list, one might want to identify common functions among these analytes. Here, pathway analysis has become a go-to method for associating functions with experimental findings. Such analysis provides essential insights into the complex relationships among biological molecules and how they contribute to specific cellular functions. A pathway represents a set of biochemical reactions and interactions that take place within a cell or organism, involving metabolites, genes, and proteins. Understanding these pathways allows researchers to link molecular changes to phenotypic outcomes, such as disease states or responses to treatment.\n",
"\n",
"## Pathway Databases\n",
Expand Down Expand Up @@ -69,10 +64,68 @@
"\n",
"### Enrichment Score and Null Distribution\n",
"\n",
"GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise.\n",
"GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise. I.e. it reflects how many genes encountered as compared to what you would expect if they where uniformly distributed among the genes.\n",
"\n",
"To assess the statistical significance of the observed enrichment score, GSEA uses a **null distribution** obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.\n",
"\n",
"Here is an illustration of the enrichment score. We generate a normal-distributed dataset of 30 samples covering 100 genes. We also include 10 genes that are from the same pathway, that we simulate as \"regulated\" i.e. an additional random offset between the \"Healthy\" and the \"Sick\" samples. GSEA ranks the data and displays the position of the genes in the pathway as black lines among the genes noyt in the pathway, which are shown as white lines. If the black lines where evenly distributed the enrichment of the pathway genes would be zero, however, we devised the test in such a way that the black lines are more to the left of the distribution. This results in an increased enrichment score."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe327d05",
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import gseapy as gp\n",
"\n",
"# Seed for reproducibility\n",
"np.random.seed(42)\n",
"\n",
"n_genes_in_pathway = 15\n",
"n_genes_in_background = 100\n",
"n_samples_per_group = 15\n",
"\n",
"pathway_genes = { f\"PathwayGene{i}\" for i in range(1, n_genes_in_pathway + 1 ) }\n",
"pathway_db = {\"my_pathway\" : pathway_genes }\n",
"genes = list(pathway_genes) + [f\"Gene{i}\" for i in range(1, n_genes_in_background + 1 )]\n",
"samples = [f\"Sample{j}\" for j in range(1, 2*n_samples_per_group + 1)]\n",
"fake_data = pd.DataFrame(np.random.normal(0, 1, size=(len(genes), len(samples))), index=genes, columns=samples)\n",
"\n",
"for gene in pathway_genes:\n",
" if gene in fake_data.index:\n",
" # Make pathway genes have higher values in the first half of samples using iloc\n",
" fake_data.loc[gene, fake_data.columns[:n_samples_per_group]] += np.random.normal(0.5, 0.5, size=n_samples_per_group)\n",
"\n",
"labels = [\"Healthy\"]*n_samples_per_group + [\"Sick\"]*n_samples_per_group\n",
"\n",
"gs = gp.GSEA(data=fake_data, \n",
" gene_sets=pathway_db, \n",
" classes=labels, # cls=class_vector\n",
" permutation_type='phenotype', # null from permutations of class labels\n",
" permutation_num=2000, # reduce number to speed up test\n",
" outdir=None, # do not write output to disk\n",
" no_plot=True, # Skip plotting\n",
" method='signal_to_noise',\n",
" threads=4, # Number of allowed parallel processes\n",
" seed=42,\n",
" format='png',)\n",
"gs.run()\n",
"gs.plot(\"my_pathway\", show_ranking=False, legend_kws={'loc': (1.05, 0)}, )"
]
},
{
"cell_type": "markdown",
"id": "0e90b7ba",
"metadata": {},
"source": [
"### Kolmogorov-Smirnov (KS) Test\n",
"\n",
"The **Kolmogorov-Smirnov (KS) test** is used in GSEA to calculate the enrichment score. The KS test is a non-parametric test that measures the maximum deviation between the observed cumulative distribution of the gene set and the expected distribution under the null hypothesis. In GSEA, the enrichment score is effectively the maximum deviation encountered as we move down the ranked list, capturing whether genes from the pathway are found disproportionately at the top or bottom of the list. This score is then compared against the null distribution to determine statistical significance.\n",
Expand All @@ -86,10 +139,10 @@
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
Expand Down
60 changes: 59 additions & 1 deletion dsbook/network/pathway.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
---
kernelspec:
display_name: Python 3
language: python
name: python3
jupytext:
formats: md:myst
text_representation:
extension: .md
format_name: myst
---

# Pathway Analysis

## Introduction
Expand Down Expand Up @@ -57,10 +69,56 @@ In GSEA, gene sets corresponding to known biological pathways are tested for the

### Enrichment Score and Null Distribution

GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise.
GSEA works by calculating an **enrichment score (ES)**, which measures how often genes from the gene set of interest appear in the ranked list. Starting at the top of the ranked gene list, an enrichment score is computed by walking down the list, increasing when a gene is in the gene set and decreasing otherwise. I.e. it reflects how many genes encountered as compared to what you would expect if they where uniformly distributed among the genes.

To assess the statistical significance of the observed enrichment score, GSEA uses a **null distribution** obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.

Here is an illustration of the enrichment score. We generate a normal-distributed dataset of 30 samples covering 100 genes. We also include 10 genes that are from the same pathway, that we simulate as "regulated" i.e. an additional random offset between the "Healthy" and the "Sick" samples. GSEA ranks the data and displays the position of the genes in the pathway as black lines among the genes noyt in the pathway, which are shown as white lines. If the black lines where evenly distributed the enrichment of the pathway genes would be zero, however, we devised the test in such a way that the black lines are more to the left of the distribution. This results in an increased enrichment score for the low ranked genes. For anoying reasons the enrichment plot appears twice in the output below.

```{code-cell} ipython3
:tags: [hide-input]
import numpy as np
import pandas as pd
import gseapy as gp
# Seed for reproducibility
np.random.seed(42)
n_genes_in_pathway = 15
n_genes_in_background = 100
n_samples_per_group = 15
pathway_genes = { f"PathwayGene{i}" for i in range(1, n_genes_in_pathway + 1 ) }
pathway_db = {"my_pathway" : pathway_genes }
genes = list(pathway_genes) + [f"Gene{i}" for i in range(1, n_genes_in_background + 1 )]
samples = [f"Sample{j}" for j in range(1, 2*n_samples_per_group + 1)]
fake_data = pd.DataFrame(np.random.normal(0, 1, size=(len(genes), len(samples))), index=genes, columns=samples)
for gene in pathway_genes:
if gene in fake_data.index:
# Make pathway genes have higher values in the first half of samples using iloc
fake_data.loc[gene, fake_data.columns[:n_samples_per_group]] += np.random.normal(0.5, 0.5, size=n_samples_per_group)
labels = ["Healthy"]*n_samples_per_group + ["Sick"]*n_samples_per_group
gs = gp.GSEA(data=fake_data,
gene_sets=pathway_db,
classes=labels, # cls=class_vector
permutation_type='phenotype', # null from permutations of class labels
permutation_num=2000, # reduce number to speed up test
outdir=None, # do not write output to disk
no_plot=True, # Skip plotting
method='signal_to_noise',
threads=4, # Number of allowed parallel processes
seed=42,
format='png',)
gs.run()
gs.plot("my_pathway", show_ranking=False)
gs.res2d
```

For a more detailed explanation of the enrichment score, please check out the original paper, [Subramanian, et al.](https://www.pnas.org/doi/10.1073/pnas.0506580102).

### Kolmogorov-Smirnov (KS) Test

The **Kolmogorov-Smirnov (KS) test** is used in GSEA to calculate the enrichment score. The KS test is a non-parametric test that measures the maximum deviation between the observed cumulative distribution of the gene set and the expected distribution under the null hypothesis. In GSEA, the enrichment score is effectively the maximum deviation encountered as we move down the ranked list, capturing whether genes from the pathway are found disproportionately at the top or bottom of the list. This score is then compared against the null distribution to determine statistical significance.
Expand Down
6 changes: 3 additions & 3 deletions dsbook/statistics/multiple.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "5243e4a6",
"id": "05ac3adf",
"metadata": {},
"source": [
"# Multiple Hypothesis Corrections\n",
Expand Down Expand Up @@ -135,7 +135,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "88e76665",
"id": "4a4669ee",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -187,7 +187,7 @@
},
{
"cell_type": "markdown",
"id": "111dc413",
"id": "82bc3085",
"metadata": {},
"source": [
"In the plot above, we simulate $p$ values for a large number of hypotheses (1,000 in this case). Half of these hypotheses represent **nulls** (meaning there is no effect), while the other half represent **alternative hypotheses** (meaning there is a true effect).\n",
Expand Down
261 changes: 240 additions & 21 deletions dsbook/statistics/qvalue.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 2a0582f

Please sign in to comment.