some updates after lecture

statisticalbiotechnology · Nov 27, 2024 · 67d85b6 · 67d85b6
1 parent 7ff6983
commit 67d85b6
Show file tree

Hide file tree

Showing 7 changed files with 81 additions and 101 deletions.
diff --git a/dsbook/network/pathway.ipynb b/dsbook/network/pathway.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "8dce544d",
+   "id": "99bb19c6",
    "metadata": {},
    "source": [
     "# Pathway Analysis\n",
@@ -14,17 +14,15 @@
     "\n",
     "## Pathway Databases\n",
     "\n",
-    "Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. Specialized databases like [**MetaCyc**](https://metacyc.org/) (metabolic pathways), [**SIGNOR**](https://signor.uniroma2.it/) (signaling networks), and [**CTD**](http://ctdbase.org/) (toxicogenomic interactions) cater to specific research needs, while [**MSigDB**](https://www.gsea-msigdb.org/gsea/msigdb/) and [**PANTHER**](http://www.pantherdb.org/) are frequently used in enrichment analyses. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often used with pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.\n",
-    "\n",
-    "Two of the most widely used databases for pathway analysis are KEGG and Reactome.\n",
+    "Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. For specialized needs, researchers use databases like [**MetaCyc**](https://metacyc.org/) for metabolic pathways, [**SIGNOR**](https://signor.uniroma2.it/) for signaling networks, and [**CTD**](http://ctdbase.org/) for studying toxicogenomic interactions. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often utilized alongside pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.\n",
     "\n",
     "### KEGG\n",
     "\n",
-    "The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that offers a collection of high-level maps integrating genomic, chemical, and systemic functional information. KEGG provides comprehensive pathway maps, including metabolic pathways, signal transduction pathways, and regulatory pathways. KEGG pathways are represented as graphical diagrams, which help in visualizing molecular interactions and their roles in specific biological functions.\n",
+    "The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that integrates genomic, chemical, and systemic functional information into comprehensive pathway maps, including metabolic and regulatory pathways. KEGG pathways are depicted in graphical diagrams, facilitating the visualization of molecular interactions and their roles in specific biological functions.\n",
     "\n",
     "### Reactome\n",
     "\n",
-    "[**Reactome**](https://reactome.org/PathwayBrowser/) is another prominent pathway database that provides detailed information about cellular processes, including metabolic reactions, signal transduction, immune system functions, and more. Reactome is an open-source, manually curated knowledge base, focusing on the relationships between genes, proteins, and other molecules in the context of biological pathways. Compared to KEGG, Reactome provides finer details about molecular interactions and is enriched by contributions from experts in the field.\n",
+    "[**Reactome**](https://reactome.org/PathwayBrowser/) is a detailed, open-source, manually curated knowledge base that provides information about cellular processes like metabolic reactions and immune system functions. Reactome focuses on the relationships between genes, proteins, and other molecules, offering more granular details about molecular interactions than KEGG.\n",
     "\n",
     "## Over Representation Analysis (ORA)\n",
     "\n",
@@ -62,7 +60,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "170e3fdf",
+   "id": "c7e615b7",
    "metadata": {
     "tags": [
      "hide-input"
@@ -112,7 +110,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "25c2f7a1",
+   "id": "40839767",
    "metadata": {},
    "source": [
     "Above is an illustration of the expected overlap (6%) of a pathway consisting of 20% of all genes and a gene-list of 30% of all genes. \n",
@@ -133,7 +131,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "d44c99e9",
+   "id": "3be344bc",
    "metadata": {
     "tags": [
      "hide-input"
@@ -188,7 +186,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "0ccf737d",
+   "id": "f1a71a16",
    "metadata": {},
    "source": [
     "To assess the statistical significance of the observed enrichment score, GSEA uses a [sampling distribution](sec:statistics:sampling) of ES obtained through permutation. The ranked gene list is shuffled many times to generate a background distribution of ES values, which can then be used to calculate the p-value for the observed enrichment score.\n",

diff --git a/dsbook/network/pathway.md b/dsbook/network/pathway.md
@@ -19,17 +19,15 @@ A typical output from a high-throughput experiment is a list of genes, transcrip
 
 ## Pathway Databases
 
-Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. Specialized databases like [**MetaCyc**](https://metacyc.org/) (metabolic pathways), [**SIGNOR**](https://signor.uniroma2.it/) (signaling networks), and [**CTD**](http://ctdbase.org/) (toxicogenomic interactions) cater to specific research needs, while [**MSigDB**](https://www.gsea-msigdb.org/gsea/msigdb/) and [**PANTHER**](http://www.pantherdb.org/) are frequently used in enrichment analyses. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often used with pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.
-
-Two of the most widely used databases for pathway analysis are KEGG and Reactome.
+Examples of commonly used pathway databases include [**KEGG**](https://www.kegg.jp/), [**Reactome**](https://reactome.org/), [**WikiPathways**](https://www.wikipathways.org/), and [**BioCyc**](https://biocyc.org/), which provide curated pathways for metabolic and signaling processes across various organisms. For specialized needs, researchers use databases like [**MetaCyc**](https://metacyc.org/) for metabolic pathways, [**SIGNOR**](https://signor.uniroma2.it/) for signaling networks, and [**CTD**](http://ctdbase.org/) for studying toxicogenomic interactions. Although not formally a pathway database, [**Gene Ontology (GO)**](http://geneontology.org/) is often utilized alongside pathway-based methods to identify enriched biological processes, cellular components, and molecular functions from gene or protein lists.
 
 ### KEGG
 
-The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that offers a collection of high-level maps integrating genomic, chemical, and systemic functional information. KEGG provides comprehensive pathway maps, including metabolic pathways, signal transduction pathways, and regulatory pathways. KEGG pathways are represented as graphical diagrams, which help in visualizing molecular interactions and their roles in specific biological functions.
+The [**Kyoto Encyclopedia of Genes and Genomes (KEGG)**](https://www.genome.jp/kegg/) is a manually curated database that integrates genomic, chemical, and systemic functional information into comprehensive pathway maps, including metabolic and regulatory pathways. KEGG pathways are depicted in graphical diagrams, facilitating the visualization of molecular interactions and their roles in specific biological functions.
 
 ### Reactome
 
-[**Reactome**](https://reactome.org/PathwayBrowser/) is another prominent pathway database that provides detailed information about cellular processes, including metabolic reactions, signal transduction, immune system functions, and more. Reactome is an open-source, manually curated knowledge base, focusing on the relationships between genes, proteins, and other molecules in the context of biological pathways. Compared to KEGG, Reactome provides finer details about molecular interactions and is enriched by contributions from experts in the field.
+[**Reactome**](https://reactome.org/PathwayBrowser/) is a detailed, open-source, manually curated knowledge base that provides information about cellular processes like metabolic reactions and immune system functions. Reactome focuses on the relationships between genes, proteins, and other molecules, offering more granular details about molecular interactions than KEGG.
 
 ## Over Representation Analysis (ORA)
 

diff --git a/dsbook/unsupervised/VAEofCarcinomas.ipynb b/dsbook/unsupervised/VAEofCarcinomas.ipynb
diff --git a/dsbook/unsupervised/autoenc.ipynb b/dsbook/unsupervised/autoenc.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "04df1d55",
+   "id": "865f3d3b",
    "metadata": {},
    "source": [
     "# Autoencoders\n",
@@ -82,7 +82,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "25158c7f",
+   "id": "eb3162d7",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -94,7 +94,7 @@
     "from torch.utils.data import DataLoader, TensorDataset\n",
     "\n",
     "# Step 1: Generate 2D points on a circle with noise\n",
-    "n_samples = 1000\n",
+    "n_samples = 5000\n",
     "noise_level = 0.1\n",
     "np.random.seed(42)\n",
     "\n",
@@ -188,7 +188,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "8a333844",
+   "id": "760d8f89",
    "metadata": {},
    "source": [
     "In this implementation, the `Autoencoder` class defines both the encoder and decoder as sequential networks. The encoder compresses the input to a latent dimension, while the decoder reconstructs the input from the latent representation. The network is trained by minimizing the mean squared error (MSE) between the input and the reconstructed output.\n",
@@ -253,7 +253,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e0c79023",
+   "id": "814c3434",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -361,7 +361,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ed408525",
+   "id": "e6243d4c",
    "metadata": {},
    "source": [
     "In this implementation, the VAE consists of an encoder that outputs the mean and log variance of the latent distribution, a reparameterization trick to sample from this distribution, and a decoder to reconstruct the input. The loss function includes both a reconstruction term (mean squared error) and a KL divergence term to ensure the latent space is well-regularized. After training, the latent space embeddings are plotted to visualize the learned structure."

diff --git a/dsbook/unsupervised/autoenc.md b/dsbook/unsupervised/autoenc.md
@@ -92,7 +92,7 @@ import matplotlib.pyplot as plt
 from torch.utils.data import DataLoader, TensorDataset
 
 # Step 1: Generate 2D points on a circle with noise
-n_samples = 1000
+n_samples = 5000
 noise_level = 0.1
 np.random.seed(42)
 

diff --git a/dsbook/unsupervised/cluster.ipynb b/dsbook/unsupervised/cluster.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "e5c5b70f",
+   "id": "3b111a3b",
    "metadata": {},
    "source": [
     "# Unsupervised Machine Learning\n",
@@ -44,7 +44,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "8b756e63",
+   "id": "eb1aa82d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -105,7 +105,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "18e8cc04",
+   "id": "1c17efae",
    "metadata": {},
    "source": [
     "The algorithm automatically assigns the points to clusters, and we can see that it closely matches what we would expect by visual inspection.\n",
@@ -120,7 +120,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "42af42d4",
+   "id": "4e984e9d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -155,7 +155,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "898c3ccc",
+   "id": "1a7401f1",
    "metadata": {},
    "source": [
     "### Drawbacks of k-Means\n",
@@ -169,7 +169,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "987c7ba7",
+   "id": "639af2c8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -182,7 +182,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "95d7b980",
+   "id": "06c36016",
    "metadata": {},
    "source": [
     "3. **Linear Cluster Boundaries**: The k-Means algorithm assumes that clusters are spherical and separated by linear boundaries. It struggles with complex geometries. Consider the following dataset with two crescent-shaped clusters:"
@@ -191,7 +191,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b2a0dabd",
+   "id": "f9c01eff",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -205,7 +205,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ab706b89",
+   "id": "2ef46f74",
    "metadata": {},
    "source": [
     "4. **Differences in euclidian size**: K-Means assumes that the cluster sizes, in terms of euclidian distance to its borders, are fairly similar for all clusters.\n",
@@ -215,7 +215,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "da83d2ea",
+   "id": "74940f8f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -249,7 +249,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dfbfdb4c",
+   "id": "2a0c72db",
    "metadata": {},
    "source": [
     "## Multivariate Normal Distribution\n",
@@ -274,7 +274,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "be73f476",
+   "id": "18e0a9de",
    "metadata": {
     "tags": [
      "hide-input"
@@ -325,7 +325,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e04c18ec",
+   "id": "3c0ef7bd",
    "metadata": {},
    "source": [
     "## Gaussian Mixture Models (GMM)\n",
@@ -400,7 +400,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b9efe68f",
+   "id": "1bbdc75c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -472,7 +472,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2bf86cb2",
+   "id": "87a664ea",
    "metadata": {},
    "source": [
     "Points near the cluster boundaries have lower certainty, reflected in smaller marker sizes.\n",
@@ -483,7 +483,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "03824a40",
+   "id": "15527fe7",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -503,7 +503,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9c481827",
+   "id": "2bda0031",
    "metadata": {},
    "source": [
     "GMM is able to model more complex, elliptical cluster boundaries, addressing one of the main limitations of k-Means.\n",

diff --git a/dsbook/unsupervised/pca.ipynb b/dsbook/unsupervised/pca.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "c231bb50",
+   "id": "96a48260",
    "metadata": {},
    "source": [
     "# Principal Component Analysis (PCA)\n",
@@ -176,7 +176,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "01c07042",
+   "id": "48c7db92",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -266,7 +266,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "910b8965",
+   "id": "690bb737",
    "metadata": {},
    "source": [
     "In this example, we demonstrate the power of PCA by reconstructing facial images from a low-dimensional representation. The dataset we use is the Olivetti Faces Dataset, which contains images of different individuals' faces. We start by centering the data both globally (centering each feature, i.e., pixel, across all samples) and locally (centering each sample, i.e., face, across all features). This centering is important for PCA as it ensures that the data has a mean of zero, which simplifies the calculation of principal components.\n",
@@ -289,9 +289,13 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "jb",
    "language": "python",
    "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.8"
   }
  },
  "nbformat": 4,