Skip to content

Commit

Permalink
AddedAdded maths for the first mentioning of GMMs
Browse files Browse the repository at this point in the history
  • Loading branch information
percolator committed Nov 20, 2024
1 parent 70e79d8 commit a3bced9
Show file tree
Hide file tree
Showing 2 changed files with 133 additions and 29 deletions.
98 changes: 75 additions & 23 deletions dsbook/unsupervised/cluster.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,11 @@
"cells": [
{
"cell_type": "markdown",
"id": "7982a0ae",
"id": "0c97fd16",
"metadata": {},
"source": [
"# Unsupervised Machine Learning\n",
"\n",
"[//]: # \"Add maths for GMMs\"\n",
"\n",
"[//]: # \"Coloring of e-step i k-means\"\n",
"\n",
"\n",
"## Introduction\n",
"\n",
"Unsupervised machine learning aims to learn patterns from data without predefined labels. Specifically, the goal is to learn a function $f(x)$ from the dataset $D = \\{\\mathbf{x}_i\\}$ by optimizing an objective function $g(D, f)$, or by simply partitioning the dataset $D$. This chapter provides an overview of clustering methods, which are a core part of unsupervised machine learning.\n",
Expand Down Expand Up @@ -49,7 +44,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "82aaff85",
"id": "610a2053",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -110,7 +105,7 @@
},
{
"cell_type": "markdown",
"id": "684b06f7",
"id": "f8bc813e",
"metadata": {},
"source": [
"The algorithm automatically assigns the points to clusters, and we can see that it closely matches what we would expect by visual inspection.\n",
Expand All @@ -125,7 +120,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "c2c293fa",
"id": "bad308a6",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -160,7 +155,7 @@
},
{
"cell_type": "markdown",
"id": "e9fb9c38",
"id": "820460a4",
"metadata": {},
"source": [
"### Drawbacks of k-Means\n",
Expand All @@ -174,7 +169,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7ea2c4b5",
"id": "c9bccdda",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -187,7 +182,7 @@
},
{
"cell_type": "markdown",
"id": "e9713f76",
"id": "a76af82c",
"metadata": {},
"source": [
"3. **Linear Cluster Boundaries**: The k-Means algorithm assumes that clusters are spherical and separated by linear boundaries. It struggles with complex geometries. Consider the following dataset with two crescent-shaped clusters:"
Expand All @@ -196,7 +191,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "47b982f9",
"id": "af09d7b1",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -210,7 +205,7 @@
},
{
"cell_type": "markdown",
"id": "f292fb4c",
"id": "a9e9050f",
"metadata": {},
"source": [
"4. **Differences in euclidian size**: K-Means assumes that the cluster sizes, in terms of euclidian distance to its borders, are fairly similar for all clusters.\n",
Expand All @@ -220,7 +215,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5419aa2a",
"id": "3054ef33",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -254,7 +249,7 @@
},
{
"cell_type": "markdown",
"id": "18e9ef79",
"id": "d059cb0b",
"metadata": {},
"source": [
"## Multivariate Normal Distribution\n",
Expand All @@ -279,7 +274,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "690cfecc",
"id": "a28a80b8",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -330,12 +325,69 @@
},
{
"cell_type": "markdown",
"id": "350f1a5a",
"id": "ead3886c",
"metadata": {},
"source": [
"## Gaussian Mixture Models (GMM)\n",
"\n",
"**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.\n",
"Gaussian Mixture Models (GMMs) provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.\n",
"\n",
"Here is a detailed description of the **Gaussian Mixture Models (GMM)** algorithm with the mathematics you provided, outlining its steps:\n",
"\n",
"---\n",
"\n",
"## Gaussian Mixture Models (GMM)\n",
"\n",
"**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides soft clustering where each point is assigned a probability of belonging to each cluster.\n",
"\n",
"### Steps of the GMM Algorithm\n",
"\n",
"1. **Initialization**:\n",
" - Define the number of clusters, $ K $.\n",
" - Initialize the parameters:\n",
" - Means $ \\mu_k $ for each component.\n",
" - Covariance matrices $ \\Sigma_k $ for each component.\n",
" - Mixing coefficients $ P_k $, such that $ \\sum_{k=1}^K P_k = 1 $.\n",
"\n",
"2. **Expectation Step (E-Step)**:\n",
" - Compute the probability that a data point $ \\mathbf{x}_n $ belongs to cluster $ k $, called the responsibility $ \\gamma_{nk} $:\n",
" ```{math}\n",
" \\gamma_{nk} = \\frac{P_k \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k)}{\\sum_{j=1}^K P_j \\mathcal{N}(\\mathbf{x}_n | \\mu_j, \\Sigma_j)} \n",
" ```\n",
" where:\n",
" - $ \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) $ is the Gaussian probability density function:\n",
" ```{math}\n",
" \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) = \\frac{1}{\\sqrt{(2\\pi)^d |\\Sigma_k|}} \\exp\\left( -\\frac{1}{2} (\\mathbf{x}_n - \\mu_k)^T \\Sigma_k^{-1} (\\mathbf{x}_n - \\mu_k) \\right)\n",
" ```\n",
"\n",
"3. **Maximization Step (M-Step)**:\n",
" - Recalculate the parameters based on the responsibilities $ \\gamma_{nk} $:\n",
" - Effective number of points in cluster $ k $:\n",
" ```{math}\n",
" N_k = \\sum_{n=1}^N \\gamma_{nk}\n",
" ```\n",
" - Updated cluster means:\n",
" ```{math}\n",
" \\mu_k^{\\text{new}} = \\frac{1}{N_k} \\sum_{n=1}^N \\gamma_{nk} \\mathbf{x}_n\n",
" ```\n",
" - Updated covariance matrices:\n",
" ```{math}\n",
" \\Sigma_k^{\\text{new}} = \\frac{1}{N_k} \\sum_{n=1}^N \\gamma_{nk} (\\mathbf{x}_n - \\mu_k^{\\text{new}})(\\mathbf{x}_n - \\mu_k^{\\text{new}})^T\n",
" ```\n",
" - Updated mixing coefficients:\n",
" ```{math}\n",
" P_k^{\\text{new}} = \\frac{N_k}{N}\n",
" ```\n",
"\n",
"4. **Log-Likelihood Calculation**:\n",
" - Evaluate the log-likelihood of the data given the current model parameters:\n",
" ```{math}\n",
" \\ln \\Pr(\\mathbf{X} | \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma}, \\mathbf{P}) = \\sum_{n=1}^N \\ln \\left( \\sum_{k=1}^K P_k \\mathcal{N}(\\mathbf{x}_n | \\mu_k, \\Sigma_k) \\right)\n",
" ```\n",
"\n",
"5. **Convergence Check**:\n",
" - Repeat the E and M steps until convergence, which occurs when the log-likelihood no longer increases or the parameter updates become negligible.\n",
"\n",
"\n",
"\n",
"### Illustrations of GMM\n",
Expand All @@ -348,7 +400,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "0b302c2b",
"id": "49e31713",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -420,7 +472,7 @@
},
{
"cell_type": "markdown",
"id": "23ea42ef",
"id": "2ab89322",
"metadata": {},
"source": [
"Points near the cluster boundaries have lower certainty, reflected in smaller marker sizes.\n",
Expand All @@ -431,7 +483,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7a827b7c",
"id": "6dc457c8",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -451,7 +503,7 @@
},
{
"cell_type": "markdown",
"id": "e715ca59",
"id": "87cce6ba",
"metadata": {},
"source": [
"GMM is able to model more complex, elliptical cluster boundaries, addressing one of the main limitations of k-Means.\n",
Expand Down
64 changes: 58 additions & 6 deletions dsbook/unsupervised/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,6 @@ jupytext:
---
# Unsupervised Machine Learning

[//]: # "Add maths for GMMs"

[//]: # "Coloring of e-step i k-means"


## Introduction

Unsupervised machine learning aims to learn patterns from data without predefined labels. Specifically, the goal is to learn a function $f(x)$ from the dataset $D = \{\mathbf{x}_i\}$ by optimizing an objective function $g(D, f)$, or by simply partitioning the dataset $D$. This chapter provides an overview of clustering methods, which are a core part of unsupervised machine learning.
Expand Down Expand Up @@ -266,7 +261,64 @@ plt.show()

## Gaussian Mixture Models (GMM)

**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.
Gaussian Mixture Models (GMMs) provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides a soft clustering where each point is assigned a probability of belonging to each cluster.

Here is a detailed description of the **Gaussian Mixture Models (GMM)** algorithm with the mathematics you provided, outlining its steps:

---

## Gaussian Mixture Models (GMM)

**Gaussian Mixture Models (GMMs)** provide a probabilistic approach to clustering and are an example of soft clustering. GMMs assume that the data is generated from a mixture of several Gaussian distributions, each representing a cluster. Unlike k-Means, GMM provides soft clustering where each point is assigned a probability of belonging to each cluster.

### Steps of the GMM Algorithm

1. **Initialization**:
- Define the number of clusters, $ K $.
- Initialize the parameters:
- Means $ \mu_k $ for each component.
- Covariance matrices $ \Sigma_k $ for each component.
- Mixing coefficients $ P_k $, such that $ \sum_{k=1}^K P_k = 1 $.

2. **Expectation Step (E-Step)**:
- Compute the probability that a data point $ \mathbf{x}_n $ belongs to cluster $ k $, called the responsibility $ \gamma_{nk} $:
```{math}
\gamma_{nk} = \frac{P_k \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K P_j \mathcal{N}(\mathbf{x}_n | \mu_j, \Sigma_j)}
```
where:
- $ \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) $ is the Gaussian probability density function:
```{math}
\mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) = \frac{1}{\sqrt{(2\pi)^d |\Sigma_k|}} \exp\left( -\frac{1}{2} (\mathbf{x}_n - \mu_k)^T \Sigma_k^{-1} (\mathbf{x}_n - \mu_k) \right)
```

3. **Maximization Step (M-Step)**:
- Recalculate the parameters based on the responsibilities $ \gamma_{nk} $:
- Effective number of points in cluster $ k $:
```{math}
N_k = \sum_{n=1}^N \gamma_{nk}
```
- Updated cluster means:
```{math}
\mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} \mathbf{x}_n
```
- Updated covariance matrices:
```{math}
\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk} (\mathbf{x}_n - \mu_k^{\text{new}})(\mathbf{x}_n - \mu_k^{\text{new}})^T
```
- Updated mixing coefficients:
```{math}
P_k^{\text{new}} = \frac{N_k}{N}
```
4. **Log-Likelihood Calculation**:
- Evaluate the log-likelihood of the data given the current model parameters:
```{math}
\ln \Pr(\mathbf{X} | \boldsymbol{\mu}, \boldsymbol{\Sigma}, \mathbf{P}) = \sum_{n=1}^N \ln \left( \sum_{k=1}^K P_k \mathcal{N}(\mathbf{x}_n | \mu_k, \Sigma_k) \right)
```

5. **Convergence Check**:
- Repeat the E and M steps until convergence, which occurs when the log-likelihood no longer increases or the parameter updates become negligible.



### Illustrations of GMM
Expand Down

0 comments on commit a3bced9

Please sign in to comment.