Skip to content

Commit

Permalink
some updates after lecture
Browse files Browse the repository at this point in the history
  • Loading branch information
percolator committed Nov 27, 2024
1 parent 7ff6983 commit 67d85b6
Show file tree
Hide file tree
Showing 7 changed files with 81 additions and 101 deletions.
18 changes: 8 additions & 10 deletions dsbook/network/pathway.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "8dce544d",
"id": "99bb19c6",
"metadata": {},
"source": [
"# Pathway Analysis\n",
Expand All @@ -14,17 +14,15 @@
"\n",
"## Pathway Databases\n",
"\n",
"Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. Specialized databases like [**MetaCyc**](https://metacyc.org/) (metabolic pathways), [**SIGNOR**](https://signor.uniroma2.it/) (signaling networks), and [**CTD**](http://ctdbase.org/) (toxicogenomic interactions) cater to specific research needs, while [**MSigDB**](https://www.gsea-msigdb.org/gsea/msigdb/) and [**PANTHER**](http://www.pantherdb.org/) are frequently used in enrichment analyses. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often used with pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.\n",
"\n",
"Two of the most widely used databases for pathway analysis are KEGG and Reactome.\n",
"Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. For specialized needs, researchers use databases like [**MetaCyc**](https://metacyc.org/) for metabolic pathways, [**SIGNOR**](https://signor.uniroma2.it/) for signaling networks, and [**CTD**](http://ctdbase.org/) for studying toxicogenomic interactions. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often utilized alongside pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.\n",
"\n",
"### KEGG\n",
"\n",
"The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that offers a collection of high-level maps integrating genomic, chemical, and systemic functional information. KEGG provides comprehensive pathway maps, including metabolic pathways, signal transduction pathways, and regulatory pathways. KEGG pathways are represented as graphical diagrams, which help in visualizing molecular interactions and their roles in specific biological functions.\n",
"The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that integrates genomic, chemical, and systemic functional information into comprehensive pathway maps, including metabolic and regulatory pathways. KEGG pathways are depicted in graphical diagrams, facilitating the visualization of molecular interactions and their roles in specific biological functions.\n",
"\n",
"### Reactome\n",
"\n",
"[**Reactome**](https://reactome.org/PathwayBrowser/) is another prominent pathway database that provides detailed information about cellular processes, including metabolic reactions, signal transduction, immune system functions, and more. Reactome is an open-source, manually curated knowledge base, focusing on the relationships between genes, proteins, and other molecules in the context of biological pathways. Compared to KEGG, Reactome provides finer details about molecular interactions and is enriched by contributions from experts in the field.\n",
"[**Reactome**](https://reactome.org/PathwayBrowser/) is a detailed, open-source, manually curated knowledge base that provides information about cellular processes like metabolic reactions and immune system functions. Reactome focuses on the relationships between genes, proteins, and other molecules, offering more granular details about molecular interactions than KEGG.\n",
"\n",
"## Over Representation Analysis (ORA)\n",
"\n",
Expand Down Expand Up @@ -62,7 +60,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "170e3fdf",
"id": "c7e615b7",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -112,7 +110,7 @@
},
{
"cell_type": "markdown",
"id": "25c2f7a1",
"id": "40839767",
"metadata": {},
"source": [
"Above is an illustration of the expected overlap (6%) of a pathway consisting of 20% of all genes and a gene-list of 30% of all genes. \n",
Expand All @@ -133,7 +131,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "d44c99e9",
"id": "3be344bc",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -188,7 +186,7 @@
},
{
"cell_type": "markdown",
"id": "0ccf737d",
"id": "f1a71a16",
"metadata": {},
"source": [
"To assess the statistical significance of the observed enrichment score, GSEA uses a [sampling distribution](sec:statistics:sampling) of ES obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.\n",
Expand Down
8 changes: 3 additions & 5 deletions dsbook/network/pathway.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,15 @@ A typical output from a high-throughput experiment is a list of genes, transcrip

## Pathway Databases

Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. Specialized databases like [**MetaCyc**](https://metacyc.org/) (metabolic pathways), [**SIGNOR**](https://signor.uniroma2.it/) (signaling networks), and [**CTD**](http://ctdbase.org/) (toxicogenomic interactions) cater to specific research needs, while [**MSigDB**](https://www.gsea-msigdb.org/gsea/msigdb/) and [**PANTHER**](http://www.pantherdb.org/) are frequently used in enrichment analyses. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often used with pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.

Two of the most widely used databases for pathway analysis are KEGG and Reactome.
Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. For specialized needs, researchers use databases like [**MetaCyc**](https://metacyc.org/) for metabolic pathways, [**SIGNOR**](https://signor.uniroma2.it/) for signaling networks, and [**CTD**](http://ctdbase.org/) for studying toxicogenomic interactions. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often utilized alongside pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.

### KEGG

The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that offers a collection of high-level maps integrating genomic, chemical, and systemic functional information. KEGG provides comprehensive pathway maps, including metabolic pathways, signal transduction pathways, and regulatory pathways. KEGG pathways are represented as graphical diagrams, which help in visualizing molecular interactions and their roles in specific biological functions.
The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that integrates genomic, chemical, and systemic functional information into comprehensive pathway maps, including metabolic and regulatory pathways. KEGG pathways are depicted in graphical diagrams, facilitating the visualization of molecular interactions and their roles in specific biological functions.

### Reactome

[**Reactome**](https://reactome.org/PathwayBrowser/) is another prominent pathway database that provides detailed information about cellular processes, including metabolic reactions, signal transduction, immune system functions, and more. Reactome is an open-source, manually curated knowledge base, focusing on the relationships between genes, proteins, and other molecules in the context of biological pathways. Compared to KEGG, Reactome provides finer details about molecular interactions and is enriched by contributions from experts in the field.
[**Reactome**](https://reactome.org/PathwayBrowser/) is a detailed, open-source, manually curated knowledge base that provides information about cellular processes like metabolic reactions and immune system functions. Reactome focuses on the relationships between genes, proteins, and other molecules, offering more granular details about molecular interactions than KEGG.

## Over Representation Analysis (ORA)

Expand Down
96 changes: 38 additions & 58 deletions dsbook/unsupervised/VAEofCarcinomas.ipynb

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions dsbook/unsupervised/autoenc.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "04df1d55",
"id": "865f3d3b",
"metadata": {},
"source": [
"# Autoencoders\n",
Expand Down Expand Up @@ -82,7 +82,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "25158c7f",
"id": "eb3162d7",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -94,7 +94,7 @@
"from torch.utils.data import DataLoader, TensorDataset\n",
"\n",
"# Step 1: Generate 2D points on a circle with noise\n",
"n_samples = 1000\n",
"n_samples = 5000\n",
"noise_level = 0.1\n",
"np.random.seed(42)\n",
"\n",
Expand Down Expand Up @@ -188,7 +188,7 @@
},
{
"cell_type": "markdown",
"id": "8a333844",
"id": "760d8f89",
"metadata": {},
"source": [
"In this implementation, the `Autoencoder` class defines both the encoder and decoder as sequential networks. The encoder compresses the input to a latent dimension, while the decoder reconstructs the input from the latent representation. The network is trained by minimizing the mean squared error (MSE) between the input and the reconstructed output.\n",
Expand Down Expand Up @@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "e0c79023",
"id": "814c3434",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -361,7 +361,7 @@
},
{
"cell_type": "markdown",
"id": "ed408525",
"id": "e6243d4c",
"metadata": {},
"source": [
"In this implementation, the VAE consists of an encoder that outputs the mean and log variance of the latent distribution, a reparameterization trick to sample from this distribution, and a decoder to reconstruct the input. The loss function includes both a reconstruction term (mean squared error) and a KL divergence term to ensure the latent space is well-regularized. After training, the latent space embeddings are plotted to visualize the learned structure."
Expand Down
2 changes: 1 addition & 1 deletion dsbook/unsupervised/autoenc.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
# Step 1: Generate 2D points on a circle with noise
n_samples = 1000
n_samples = 5000
noise_level = 0.1
np.random.seed(42)
Expand Down
34 changes: 17 additions & 17 deletions dsbook/unsupervised/cluster.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "e5c5b70f",
"id": "3b111a3b",
"metadata": {},
"source": [
"# Unsupervised Machine Learning\n",
Expand Down Expand Up @@ -44,7 +44,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "8b756e63",
"id": "eb1aa82d",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -105,7 +105,7 @@
},
{
"cell_type": "markdown",
"id": "18e8cc04",
"id": "1c17efae",
"metadata": {},
"source": [
"The algorithm automatically assigns the points to clusters, and we can see that it closely matches what we would expect by visual inspection.\n",
Expand All @@ -120,7 +120,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "42af42d4",
"id": "4e984e9d",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -155,7 +155,7 @@
},
{
"cell_type": "markdown",
"id": "898c3ccc",
"id": "1a7401f1",
"metadata": {},
"source": [
"### Drawbacks of k-Means\n",
Expand All @@ -169,7 +169,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "987c7ba7",
"id": "639af2c8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -182,7 +182,7 @@
},
{
"cell_type": "markdown",
"id": "95d7b980",
"id": "06c36016",
"metadata": {},
"source": [
"3. **Linear Cluster Boundaries**: The k-Means algorithm assumes that clusters are spherical and separated by linear boundaries. It struggles with complex geometries. Consider the following dataset with two crescent-shaped clusters:"
Expand All @@ -191,7 +191,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "b2a0dabd",
"id": "f9c01eff",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -205,7 +205,7 @@
},
{
"cell_type": "markdown",
"id": "ab706b89",
"id": "2ef46f74",
"metadata": {},
"source": [
"4. **Differences in euclidian size**: K-Means assumes that the cluster sizes, in terms of euclidian distance to its borders, are fairly similar for all clusters.\n",
Expand All @@ -215,7 +215,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "da83d2ea",
"id": "74940f8f",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -249,7 +249,7 @@
},
{
"cell_type": "markdown",
"id": "dfbfdb4c",
"id": "2a0c72db",
"metadata": {},
"source": [
"## Multivariate Normal Distribution\n",
Expand All @@ -274,7 +274,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "be73f476",
"id": "18e0a9de",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -325,7 +325,7 @@
},
{
"cell_type": "markdown",
"id": "e04c18ec",
"id": "3c0ef7bd",
"metadata": {},
"source": [
"## Gaussian Mixture Models (GMM)\n",
Expand Down Expand Up @@ -400,7 +400,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "b9efe68f",
"id": "1bbdc75c",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -472,7 +472,7 @@
},
{
"cell_type": "markdown",
"id": "2bf86cb2",
"id": "87a664ea",
"metadata": {},
"source": [
"Points near the cluster boundaries have lower certainty, reflected in smaller marker sizes.\n",
Expand All @@ -483,7 +483,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "03824a40",
"id": "15527fe7",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -503,7 +503,7 @@
},
{
"cell_type": "markdown",
"id": "9c481827",
"id": "2bda0031",
"metadata": {},
"source": [
"GMM is able to model more complex, elliptical cluster boundaries, addressing one of the main limitations of k-Means.\n",
Expand Down
12 changes: 8 additions & 4 deletions dsbook/unsupervised/pca.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "c231bb50",
"id": "96a48260",
"metadata": {},
"source": [
"# Principal Component Analysis (PCA)\n",
Expand Down Expand Up @@ -176,7 +176,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "01c07042",
"id": "48c7db92",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -266,7 +266,7 @@
},
{
"cell_type": "markdown",
"id": "910b8965",
"id": "690bb737",
"metadata": {},
"source": [
"In this example, we demonstrate the power of PCA by reconstructing facial images from a low-dimensional representation. The dataset we use is the Olivetti Faces Dataset, which contains images of different individuals' faces. We start by centering the data both globally (centering each feature, i.e., pixel, across all samples) and locally (centering each sample, i.e., face, across all features). This centering is important for PCA as it ensures that the data has a mean of zero, which simplifies the calculation of principal components.\n",
Expand All @@ -289,9 +289,13 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "jb",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 67d85b6

Please sign in to comment.