AddedAdded maths for the first mentioning of GMMs

statisticalbiotechnology · Nov 20, 2024 · a3bced9 · a3bced9
1 parent 70e79d8
commit a3bced9
Show file tree

Hide file tree

Showing 2 changed files with 133 additions and 29 deletions.
diff --git a/dsbook/unsupervised/cluster.ipynb b/dsbook/unsupervised/cluster.ipynb
@@ -2,16 +2,11 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "7982a0ae",
+   "id": "0c97fd16",
    "metadata": {},
    "source": [
     "# Unsupervised Machine Learning\n",
     "\n",
-    "[//]: # \"Add maths for GMMs\"\n",
-    "\n",
-    "[//]: # \"Coloring of e-step i k-means\"\n",
-    "\n",
-    "\n",
     "## Introduction\n",
     "\n",
     "Unsupervised machine learning aims to learn patterns from data without predefined labels. Specifically, the goal is to learn a function $f(x)$ from the dataset $D = \\{\\mathbf{x}_i\\}$ by optimizing an objective function $g(D, f)$, or by simply partitioning the dataset $D$. This chapter provides an overview of clustering methods, which are a core part of unsupervised machine learning.\n",
@@ -49,7 +44,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "82aaff85",
+   "id": "610a2053",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -110,7 +105,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "684b06f7",
+   "id": "f8bc813e",
    "metadata": {},
    "source": [
     "The algorithm automatically assigns the points to clusters, and we can see that it closely matches what we would expect by visual inspection.\n",
@@ -125,7 +120,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c2c293fa",
+   "id": "bad308a6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -160,7 +155,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e9fb9c38",
+   "id": "820460a4",
    "metadata": {},
    "source": [
     "### Drawbacks of k-Means\n",
@@ -174,7 +169,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7ea2c4b5",
+   "id": "c9bccdda",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -187,7 +182,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e9713f76",
+   "id": "a76af82c",
    "metadata": {},
    "source": [
     "3. **Linear Cluster Boundaries**: The k-Means algorithm assumes that clusters are spherical and separated by linear boundaries. It struggles with complex geometries. Consider the following dataset with two crescent-shaped clusters:"
@@ -196,7 +191,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "47b982f9",
+   "id": "af09d7b1",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -210,7 +205,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f292fb4c",
+   "id": "a9e9050f",
    "metadata": {},
    "source": [
     "4. **Differences in euclidian size**: K-Means assumes that the cluster sizes, in terms of euclidian distance to its borders, are fairly similar for all clusters.\n",
@@ -220,7 +215,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "5419aa2a",
+   "id": "3054ef33",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -254,7 +249,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "18e9ef79",
+   "id": "d059cb0b",
    "metadata": {},
    "source": [
     "## Multivariate Normal Distribution\n",
@@ -279,7 +274,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "690cfecc",
+   "id": "a28a80b8",
    "metadata": {
     "tags": [
      "hide-input"
@@ -330,12 +325,69 @@
   },
   {
    "cell_type": "markdown",
-   "id": "350f1a5a",
+   "id": "ead3886c",
    "metadata": {},
    "source": [
     "## Gaussian Mixture Models (GMM)\n",
     "\n",
-    "**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.\n",
+    "Gaussian Mixture Models (GMMs) provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.\n",
+    "\n",
+    "Here is a detailed description of the **Gaussian Mixture Models (GMM)** algorithm with the mathematics you provided, outlining its steps:\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Gaussian Mixture Models (GMM)\n",
+    "\n",
+    "**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides soft clustering where each point is assigned a probability of belonging to each cluster.\n",
+    "\n",
+    "### Steps of the GMM Algorithm\n",
+    "\n",
+    "1. **Initialization**:\n",
+    "   - Define the number of clusters, $ K $.\n",
+    "   - Initialize the parameters:\n",
+    "     - Means $ \\mu_k $ for each component.\n",
+    "     - Covariance matrices $ \\Sigma_k $ for each component.\n",
+    "     - Mixing coefficients $ P_k $, such that $ \\sum_{k=1}^K P_k = 1 $.\n",
+    "\n",
+    "2. **Expectation Step (E-Step)**:\n",
+    "   - Compute the probability that a data point $ \\mathbf{x}_n $ belongs to cluster $ k $, called the responsibility $ \\gamma_{nk} $:\n",
+    "   ```{math}\n",
+    "   \\gamma_{nk} = \\frac{P_k \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k)}{\\sum_{j=1}^K P_j \\mathcal{N}(\\mathbf{x}_n | \\mu_j, \\Sigma_j)} \n",
+    "   ```\n",
+    "   where:\n",
+    "   - $ \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) $ is the Gaussian probability density function:\n",
+    "   ```{math}\n",
+    "   \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) = \\frac{1}{\\sqrt{(2\\pi)^d |\\Sigma_k|}} \\exp\\left( -\\frac{1}{2} (\\mathbf{x}_n - \\mu_k)^T \\Sigma_k^{-1} (\\mathbf{x}_n - \\mu_k) \\right)\n",
+    "   ```\n",
+    "\n",
+    "3. **Maximization Step (M-Step)**:\n",
+    "   - Recalculate the parameters based on the responsibilities $ \\gamma_{nk} $:\n",
+    "     - Effective number of points in cluster $ k $:\n",
+    "     ```{math}\n",
+    "     N_k = \\sum_{n=1}^N \\gamma_{nk}\n",
+    "     ```\n",
+    "     - Updated cluster means:\n",
+    "     ```{math}\n",
+    "     \\mu_k^{\\text{new}} = \\frac{1}{N_k} \\sum_{n=1}^N \\gamma_{nk} \\mathbf{x}_n\n",
+    "     ```\n",
+    "     - Updated covariance matrices:\n",
+    "     ```{math}\n",
+    "     \\Sigma_k^{\\text{new}} = \\frac{1}{N_k} \\sum_{n=1}^N \\gamma_{nk} (\\mathbf{x}_n - \\mu_k^{\\text{new}})(\\mathbf{x}_n - \\mu_k^{\\text{new}})^T\n",
+    "     ```\n",
+    "     - Updated mixing coefficients:\n",
+    "     ```{math}\n",
+    "     P_k^{\\text{new}} = \\frac{N_k}{N}\n",
+    "     ```\n",
+    "\n",
+    "4. **Log-Likelihood Calculation**:\n",
+    "   - Evaluate the log-likelihood of the data given the current model parameters:\n",
+    "   ```{math}\n",
+    "   \\ln \\Pr(\\mathbf{X} | \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma}, \\mathbf{P}) = \\sum_{n=1}^N \\ln \\left( \\sum_{k=1}^K P_k \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) \\right)\n",
+    "   ```\n",
+    "\n",
+    "5. **Convergence Check**:\n",
+    "   - Repeat the E and M steps until convergence, which occurs when the log-likelihood no longer increases or the parameter updates become negligible.\n",
+    "\n",
     "\n",
     "\n",
     "### Illustrations of GMM\n",
@@ -348,7 +400,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "0b302c2b",
+   "id": "49e31713",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -420,7 +472,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "23ea42ef",
+   "id": "2ab89322",
    "metadata": {},
    "source": [
     "Points near the cluster boundaries have lower certainty, reflected in smaller marker sizes.\n",
@@ -431,7 +483,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7a827b7c",
+   "id": "6dc457c8",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -451,7 +503,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e715ca59",
+   "id": "87cce6ba",
    "metadata": {},
    "source": [
     "GMM is able to model more complex, elliptical cluster boundaries, addressing one of the main limitations of k-Means.\n",

diff --git a/dsbook/unsupervised/cluster.md b/dsbook/unsupervised/cluster.md
@@ -11,11 +11,6 @@ jupytext:
 ---
 # Unsupervised Machine Learning
 
-[//]: # "Add maths for GMMs"
-
-[//]: # "Coloring of e-step i k-means"
-
-
 ## Introduction
 
 Unsupervised machine learning aims to learn patterns from data without predefined labels. Specifically, the goal is to learn a function $f(x)$ from the dataset $D = \{\mathbf{x}_i\}$ by optimizing an objective function $g(D, f)$, or by simply partitioning the dataset $D$. This chapter provides an overview of clustering methods, which are a core part of unsupervised machine learning.
@@ -266,7 +261,64 @@ plt.show()
 
 ## Gaussian Mixture Models (GMM)
 
-**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.
+Gaussian Mixture Models (GMMs) provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.
+
+Here is a detailed description of the **Gaussian Mixture Models (GMM)** algorithm with the mathematics you provided, outlining its steps:
+
+---
+
+## Gaussian Mixture Models (GMM)
+
+**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides soft clustering where each point is assigned a probability of belonging to each cluster.
+
+### Steps of the GMM Algorithm
+
+1. **Initialization**:
+   - Define the number of clusters, $ K $.
+   - Initialize the parameters:
+     - Means $ \mu_k $ for each component.
+     - Covariance matrices $ \Sigma_k $ for each component.
+     - Mixing coefficients $ P_k $, such that $ \sum_{k=1}^K P_k = 1 $.
+
+2. **Expectation Step (E-Step)**:
+   - Compute the probability that a data point $ \mathbf{x}_n $ belongs to cluster $ k $, called the responsibility $ \gamma_{nk} $:
+   ```{math}
+   \gamma_{nk} = \frac{P_k \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K P_j \mathcal{N}(\mathbf{x}_n | \mu_j, \Sigma_j)} 
+   ```
+   where:
+   - $ \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) $ is the Gaussian probability density function:
+   ```{math}
+   \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d |\Sigma_k|}} \exp\left( -\frac{1}{2} (\mathbf{x}_n - \mu_k)^T \Sigma_k^{-1} (\mathbf{x}_n - \mu_k) \right)
+   ```
+
+3. **Maximization Step (M-Step)**:
+   - Recalculate the parameters based on the responsibilities $ \gamma_{nk} $:
+     - Effective number of points in cluster $ k $:
+     ```{math}
+     N_k = \sum_{n=1}^N \gamma_{nk}
+     ```
+     - Updated cluster means:
+     ```{math}
+     \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} \mathbf{x}_n
+     ```
+     - Updated covariance matrices:
+     ```{math}
+     \Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (\mathbf{x}_n - \mu_k^{\text{new}})(\mathbf{x}_n - \mu_k^{\text{new}})^T
+     ```
+     - Updated mixing coefficients:
+     ```{math}
+     P_k^{\text{new}} = \frac{N_k}{N}
+     ```
+
+4. **Log-Likelihood Calculation**:
+   - Evaluate the log-likelihood of the data given the current model parameters:
+   ```{math}
+   \ln \Pr(\mathbf{X} | \boldsymbol{\mu}, \boldsymbol{\Sigma}, \mathbf{P}) = \sum_{n=1}^N \ln \left( \sum_{k=1}^K P_k \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) \right)
+   ```
+
+5. **Convergence Check**:
+   - Repeat the E and M steps until convergence, which occurs when the log-likelihood no longer increases or the parameter updates become negligible.
+
 
 
 ### Illustrations of GMM