-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAn_Introduction_to_Hierarchical_Clustering.qmd
195 lines (110 loc) · 11.3 KB
/
An_Introduction_to_Hierarchical_Clustering.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "An Introduction to Hierarchical Clustering"
format: html
typora-copy-images-to: images
---
Taken from the article "[An Introduction to Hierarchical Clustering in Python](https://www.datacamp.com/tutorial/introduction-hierarchical-clustering-python)" from DataCamp
## Definition of Hierarchical Clustering
A hierarchical clustering approach is based on the determination of successive clusters based on previously defined clusters. It's a technique aimed more toward grouping data into a tree of clusters called dendrograms, which graphically represents the hierarchical relationship between the underlying clusters.
## Comparison of Hierarchical Clustering to Other Clustering Techniques
Hierarchical clustering is a powerful algorithm, but it is not the only one out there, and each type of clustering comes with its set of advantages and drawbacks.
Let’s understand how it compares to other types of clustering, such as K-means and model-based clustering. Many more techniques exist, but these two, in addition to hierarchical clustering, are widely used and provide a framework to help better understand the others.
You can learn more about [clustering in machine learning](https://www.datacamp.com/blog/clustering-in-machine-learning-5-essential-clustering-algorithms) in our separate article, covering five essential clustering algorithms.
### Hierarchical clustering vs K Means clustering
Unlike Hierarchical clustering, K-means clustering seeks to partition the original data points into “K” groups or clusters where the user specifies “K” in advance.
The general idea is to look for clusters that minimize the squared Euclidean distance of all the points from the centers over all attributes (variables or features) and merge those individuals in an iterative manner.
Our [K-means Clustering in Python with Scikit-learn](https://www.datacamp.com/tutorial/k-means-clustering-python) tutorial will help you understand the inner workings of K-means clustering with an interesting case study.
#### Benefits
- It is computationally efficient compared to hierarchical clustering and can be used to analyze large data sets.
- K-means is easier to understand and implement.
#### Drawbacks
- It is less flexible than hierarchical clustering because it constrains the user to specify the number of clusters beforehand, which may not be obvious in some situations.
- The result is not stable and changes from one iteration to another for the same data set.
- It is more sensitive to outliers because the use of outliers in the data impacts the mean of the cluster.
- Both k-means and hierarchical clustering are incapable of handling categorical data directly and may not work well with data that is not continuous or has very large variance.
Despite its limitations, k-means clustering is still a popular method due to its ease of use and computational efficiency. It is frequently used as a reference point for comparing the performance of other clustering techniques.
### Model-based clustering
Both K-means and hierarchical clustering techniques use a distance matrix to represent the distances between all the points in the dataset. Model-based clustering, on the other hand, applies statistical techniques to identify clusters in the data. Below is the general process:
- Decide the statistical model to use and choose the number of clusters.
- Fit the model on the data.
- Identify the clusters based on the model parameters.
#### Benefits
- Model-based clustering is more flexible than hierarchical clustering because it allows the use of different models to identify different types of clusters.
- It works better on data with complex shapes or structures.
#### Drawbacks
- It’s computationally more expensive than hierarchical clustering, especially for large data.
- It requires a better understanding of statistical modeling techniques since the choice of the model can affect the final result.
- Similarly to K-means clustering, it requires the number of clusters to be specified beforehand.
## Applications of Hierarchical Clustering
Hierarchical clustering has a variety of applications in our day-to-day life, including (but by no means limited to) biology, image processing, marketing, economics, and social network analysis.
### Biology
The clustering of DNA sequences is one of the biggest challenges in bioinformatics.
Biologists can leverage hierarchical clustering to study genetic relationships between organisms to classify those organisms into taxonomic groups. This is beneficial for quick analysis and visualization of the underlying relationships.
### Image processing
Hierarchical clustering can be performed in image processing to group similar regions or pixels of an image in terms of color, intensity, or other features. This can be useful for further tasks such as image segmentation, image classification, and object recognition.
### Marketing
Marketing specialists can use hierarchical clustering to draw a hierarchy between different types of customers based on their purchasing habits for better marketing strategies and product recommendations. For instance, different products in retails can be recommended to customers whether they are low, medium, or high spenders.
### Social network analysis
Social networks are a great source of valuable information when exploited efficiently. Hierarchical clustering can be used to identify groups or communities and to understand their relationships to each other and the structure of the network as a whole.
## The Hierarchical Clustering Algorithm
In this section, we will look at three main concepts. The steps of the hierarchical algorithm, a highlight of the two types of hierarchical clustering (agglomerative and divisive), and finally, some techniques to choose the right distance measure.
### Steps involved in the hierarchical clustering algorithm
The hierarchical clustering algorithm employs the use of distance measures to generate clusters. This generation process involves the following main steps:
![Creating Hierarchical Clustering Algorithm](attachments/Creating_Hierarchical_Clustering_Algorithm_0e75661cab.png)
Preprocess the data by removing missing data and applying any additional tasks that make the data as clean as possible. This step is a more general one for most of the machine learning tasks.
1. Compute the distance matrix containing the distance between each pair of data points using a particular distance metric such as Euclidean distance, Manhattan distance, or cosine similarity. But the default distance metric is the Euclidean one.
2.Merge the two clusters that are the closest in distance.
3. Update the distance matrix with regard to the new clusters.
4. Repeat steps 1, 2, and 3 until all the clusters are merged together to create a single cluster.
### Hierarchical clustering examples
We can consider agglomerative and divisive clustering as mirrors of each other. Let’s have a better look at how each one operates, along with a hierarchical clustering example and graphical visualization.
#### Agglomerative hierarchical clustering
This first scenario corresponds to the approach explained above. It starts by considering each observation as a singleton cluster (cluster with only one data point). Then iteratively merges clusters until only one cluster is obtained. This process is also known as the bottom-up approach.
As showing in the illustration below:
- We start by considering each animal to be its unique cluster.
- Then we generate three different clusters from those unique animals based on their similarities:
- **Birds**: Eagle and Peacock
- **Mammals**: Lion and Bear
- **More than three leg animals**: Spider and Scorpion.
- We repeat the merging process to create the vertebrate cluster by combining the two most similar clusters: Birds and Mammals.
- After this step, the remaining two clusters, **Vertebrate** and **More than three legs**, are merged to create a single Animals cluster.
![Dendrogram of Agglomerative Clustering Approach](attachments/Dendrogram_of_Agglomerative_Clustering_Approach_4eba3586ec.png)
*Dendrogram of Agglomerative Clustering Approach*
#### Divisive clustering
On the other hand, divisive clustering is top-down because it starts by considering all the data points as a unique cluster. Then it separates them until all the data points are unique.
From this divisive approach graphic:
- We notice that the whole animal dataset is considered as a single bloc.
- Then, we divide that block into two clusters: Vertebrate and More than 3 legs.
- The division process is iteratively applied to the previously created clusters until we get unique animals.
![Dendrogram of Divisive Clustering Approach](attachments/Dendrogram_of_Divisive_Clustering_Approach_8623354c7b.png)
*Dendrogram of Divisive Clustering Approach*
### Choosing the right distance measure
Your choice of distance measure is a critical step in clustering, and it depends on the problem you’re trying to solve. Considering the following scenario, we could cluster students based on any number of approaches such as their:
- Country of origin
- Gender, either male or female
- Previous academic background.
These are all valid clusters but differ in meaning.
Even though Euclidean distance is the most common distance used in most clustering software, other distance measures such as Manhattan distance, Canberra distance, Pearson or Spearman correlation, and Minkowski distance exist.
### How to measure clusters before merging them
The previously mentioned distances are related to items. In this section, we will cover three standard ways (not exhaustive) to measure the nearest pair of clusters before merging them: (1) Single linkage, (2) Complete linkage, and (3) Average linkage.
#### Single linkage
From all the pairwise distances between the items in the two clusters C1 and C2, the single linkage takes the distance between the clusters as the minimum distance.
***Distance (C1, C2) = Min { d(i, j), where item i is within C1, and item j is within C2}***
Out of all the pairs of items from the two clusters, the ones highlighted in green have the minimum distance.
![Single linkage illustration](attachments/Single_linkage_illustration_ea623e18a4.png)
*Single linkage illustration*
#### Complete linkage
From all the pairwise distances between the items in the two clusters C1 and C2, the single linkage takes the distance between the clusters as the **maximum** distance.
***Distance (C1, C2) = Max { d(i, j), where item i is within C1, and item j is within C2}***
Out of all the pairs of items from the two clusters, the ones highlighted in green have the maximum distance.
![Complete linkage illustration](attachments/Complete_linkage_illustration_982fb26b4c.png)
*Complete linkage illustration*
#### Average linkage
In the average linkage clustering, the distance between two given clusters C1 and C2 corresponds to the average distances between all pairs of items in the two clusters.
***Distance (C1, C2) = Sum{ d(i, j) } / Total Number of distances***
![Average linkage illustration](attachments/Average_linkage_illustration_edeec7a09e.png)
*Average linkage illustration*
Then the average linkage clustering is performed as follows
*d(a,j) + d(a,h) + d(a,n) + d(d,j) + d(d,h) + d(d,n)*
*—-----------------------------------------------------------*, where Total number of distances = 6
*Total number of distances*