Skip to content

The Whole God D_mn thing

Adam Dernis edited this page Jun 20, 2024 · 25 revisions

The Whole God D_mn thing

In this wiki page I will explain the whole god d_mn thing.

What is cluster analysis and why should I give a sh_t?

Cluster analysis is the process of grouping objects by similarities. "But what makes two object similar", I hear you ask. Ah, I respond. That's the b_tch of it, there's no single answer that always works. At least not that anybody has found yet (If you have any ideas hit me up). Instead, you're going to have to think about your problem, consider what properties you care about, quantify them, and select an algorithm that serves your purpose best. I'm so sorry.

In my opinion, cluster analysis libraries are about to become extremely popular because of Machine Learning. A fundamental component of ML is grouping similar nodes/memories and determining what makes a node/memory similar to another. That's Cluster Analysis, baby. The year is 2023 (as of writing). Everyone and their grandma is solving every problem with ML/LLMs (Large Language Models), whether or not they should be. Given that, the very least they can do is have someone else do their clustering for them. You wouldn't write your own sorting, would you? You suck at sorting. And what is cluster analysis if not sorting with extra steps. Sorting in N-dimensions, then grouping the sorted objects. That sounds hard. Let me do it for you.
(If anything goes wrong let me know so I can fix it. Hell, I'll even throw in a full refund).

I'm still confused. How do the algorithms serve different purposes?

Of course you are. That's okay, I can keep talking.

There are 4 main types of clustering algorithms.

  • Centroid Based
  • Density Based
  • Grid Based
  • Hierarchical

Clustering Algorithms generally fit into these categories and which category an algorithm belongs to influences how it treats these similarities. That said, please don't get mad when they don't fit into a precise category. They often won't.

Centroid Based

Points grouped by a centroid-based clustering algorithm are a similar distance from their cluster's centroid. The resulting clusters form a Voronoi diagram, which looks pretty cool unless you have trypophobia. In other words, the clusters form a puzzle of convex polygons, which means the bounds of a cluster can be easily defined. There's a big problem though. Centroid based clusters will always be convex polygons, even if the data says otherwise. This is bad.

Density Based

Density based clustering determines clusters based on the density of points in a region. The big flaw of density-based methods is that they expect some clear drop in density to define the end of a cluster. One situation does not allow is overlapping clustering (and yes those do exist, unfortunately). Density based clustering can also result in extremely abstract looking shapes. Therefore, the bounds of these clusters are often hard to define, especially compared to the convex polygons or Voronoi diagrams of centroid based.

Grid Based

Grid based clustering has the most easily defined bounds, but also suffers the most from the Arbitrary Bullsh_t factor (ABSF). Grid based clustering depends on splitting the points into cells following a grid pattern. Here's a 1D example:

0 1 2|3 4 5|6 7 8|9

Here the numbers {0,1,2,3,4,5,6,7,8,9} have formed the clusters {{0,1,2},{3,4,5},{6,7,8},{9}}. This does effectively splits the points, and the similarity between 0 and 1 is appropriate, but the difference between 2 and 3 is equivalent. Furtherly tragic, 0 and 2 are less similar than 2 and 3, but the clusters didn't land that way. The ABSF has been struck.

Hierarchical

Hierarchical clustering creates a hierarchy where the bottom levels are smaller and likely more dense clusters, then those layers build to the top until there's likely a layer containing all clusters below. This can be used to reduce the ABSF, because you can use the cluster above or below the current layer of hierarchy if the clusters in that layer are more similar than other high-level clusters. The word "similar" is carrying a lot of weight here, but it will be described in greater detail when get to cluster evaluation.

Distribution Based Clustering (The 5th one that I lied to you about)

Distribution Based clusters have a centroid, and then have a kernel/distribution around them describing the likely density of points belonging to that cluster as they get further from the centroid. Some will say this constitutes another category. That's cool. But, in my opinion this is a sub-category of centroid based clustering that takes little sparkles of inspiration from density based clustering, and this is my wiki. So it will not be joining the big 4.

How do I calculate the centroid, is that even possible?

The answer to both is: "It depends. Now get ready for a roller coaster."

Euclidean Space

Let's assume we're in Euclidean space. In Euclidean space the centroid can be calculated by taking the average of each point, handling each dimension separately.

The N-dimensional formula:

$$C_{n} = \frac{1}{m} \sum_{i=0}^{m} i_{n}$$

And an example in layman's terms:

2D Euclidean Centroid calculation example

Points: {(4,8), (2,5), (3,2)}

Centroid X = (4 + 2 + 3)/3
           = 3

Centroid Y = (8 + 5 + 2)/3
           = 5

Centroid   = (3,5)

Back tracking a bit, in Euclidean space the distance between points can be found using Euclidean Distance (A.K.A. Pythagorean Distance)

The N-dimensional formula:

$$\left\|p-q\right\| = \sqrt{(p_{1} - q_{1})^{2} + (p_{2} - q_{2})^{2} + \cdots + (p_{n} - q_{n})^{2}}$$

And again, an example in layman's terms:

2D Euclidean Distance calculation example

A = (8,1)
B = (2,9)

  A-B   = (8-2, 1-9)
        = (6,-8)

||A-B|| = sqrt(6^2 + (-8)^2)
        = sqrt(36 + 64)
        = 10

Non-Euclidean space

Now the sh_t hits the fan.

In Non-Euclidean spaces the Centroid of a set of points can be calculated, but not necessarily with the formula above.

One example of these contrarian bastards is the Hue Saturation Value (HSV) Color Space.

HSV Cone

In HSV, the value of the color red is hsv(0°,1,1) hsv(0°,1,1) and the color cyan is hsv(180°,1,1) hsv(180°,1,1). So following the Euclidean centroid formula you get this hsv(90°,1,1) hsv(90°,1,1), which is entirely incorrect. This happens because the hue determines the rotation around the circular top of the cone. If you naively average that value, you'll go around the arc pumping into all kinds of vibrant and irrelevant colors in your quest to find the centroid of two colors.

Instead, if you use a polar coordinate distance formula over the Hue and Saturation channels, and take the average of the values, you get the average as depicted visually on the cone. As result, the centroid would be hsv(0°,0,1) hsv(0°,0,1), which is located between Red and Cyan on the HSV Cone.

The HSV Centroid Formula:

Hue:

$$C_{H} = \arctan\left( \frac{\frac{1}{m}\sum_{i=0}^{m} i_{S}Sin(i_{H})}{\frac{1}{m}\sum_{i=0}^{m} i_{H}Cos(i_{H})} \right)$$

Saturation:

$$C_{S} = \sqrt{\left(\frac{1}{m}\sum_{i=0}^{m} i_{S}Sin(i_{H})\right)^{2}+\left(\frac{1}{m}\sum_{i=0}^{m} i_{S}Cos(i_{H})\right)^{2}}$$

Value:

$$C_{V} = \frac{1}{m}\sum_{i=0}^{m} i_{V}$$

What a pain. Why do I want this?

Sometimes the most accurate results do not lie in Euclidean space.

For example, in this case, HSV is more accurate than RGB! In RGB, Red is rgb(255,0,0) rgb(255,0,0) and Cyan is rgb(0,255,255) rgb(0,255,255). Therefore their centroid is rgb(128,128,128) rgb(128,128,128), which is missing that bright value shared by both of Red and Cyan. HSV caught it though.

Metric Space

In metric space, it is no longer guaranteed that centroids can be calculated. The only property metric space defines is the distance between a set of points. This space isn't n-dimensional, it's non-dimensional! This is because points in metric space do not have a position. If you think this is confusing, you're actually wrong. Yup. You're a smart cookie, and you know graph theory. Every single graph you've ever looked at is just a definition of a metric space. If the graph was weighted, the weight of each edge is the distance between the points. If the graph was unweighted, the weight was one.

I don't like this. Can I just use a library?

My friend, you have come to the right place. ClusterF_ck has it all[citation needed]. ClusterF_ck is also bug free![citation needed], so that's good.

Okay, but how do I use it?

Hopefully, with ease. All clustering algorithms share a very similar interface.

// Replace <Algo> with the name of the algorithm you want to use
using ClusterF_ck.<Algo>;

// Replace <Algo> with the name of the algorithm you want to use
// T:
//    The type of your points.
//
// TSpace:
//    An implementation of one or more ISpace<T> interfaces, as required by the method.
//
// points:
//    The collection of points to cluster.
//
// ...:
//    Any arguments specific to that clustering method.
//
// space:
//    If your space depends on a graph, matrix, or any other instance of an object that is passed in here.
//    Otherwise, a default instance of the TSpace type will be created so the instance methods can be accessed.
// 
<Algo>.Cluster<T, TSpace>(points, ..., TSpace space = default);

First let's talk about the base class Cluster<T>, which is either returned as a List<Cluster<T>> or Cluster<T>[] from every ClusterF_ck implemented method. T is the type of the points clustered. In addition to a base abstract class, there are 3 different interfaces that an algorithm's clusters may implement.

  • ICentroidCluster<T>
  • IPointsCluster<T>
  • IWeightedCluster

ICentroidCluster<T>

Adds a property Centroid representing the centroid of the cluster.

/// <summary>
/// Gets the center point of all points in the <see cref="ICentroidCluster{T}"/>.
/// </summary>
T? Centroid { get; }

IPointsCluster<T>

Adds a property Points containing all points in the cluster.

/// <summary>
/// Gets an <see cref="IReadOnlyCollection{T}"/> of points in the cluster.
/// </summary>
IReadOnlyCollection<T> Points { get; }

IWeightedCluster

Adds a property Weight, which represents the collective weight of points in the cluster for weighted algorithms.

/// <summary>
/// Gets the weight of the cluster.
/// </summary>
double Weight { get; }

Enough abstractions. Let's get down to brass tax.

K-Means

Here's a sample of how to run K-Means clustering on the points {0, 1, 8, 10, 12, 22, 24}.

using ClusterF_ck.Spaces;

// Make an alias "KM" for the static class KMeans, which contains the KMeans clustering methods.
using KM = ClusterF_ck.KMeans.KMeans;

int k = 3;
double[] points = new double[] {0, 1, 8, 10, 12, 22, 24};
KMeansCluster<double, DoubleShape>[] clusters = KM.Cluster<double, DoubleShape>(points, k);

I did everything right, but it still sucks.

Cluster analysis is f_cky like that. Let's give that disappointment a unit. Welcome to ClusterF_ck.Evaluation.

Numerically defining our failure

There are two primary types of cluster evaluation.

  • Internal
  • External

Internal

Internal evaluation uses only the data that was clustered and the resulting clusters to evaluate the damage. Most of these methods come down to scoring well when similarity within clusters is maximized and similarity between clusters is minimized.

External

External evaluation uses additional data that was not provided in clustering to evaluate the accuracy of the results. Often this means providing "ground truth" examples that some human made, referred to as benchmarks. The big flaw of external evaluation is that the data may contain anomalies not documented in a ground truth case. Given an anomaly, all the evaluation is left to do is guess. This results in a type 1 error, so at least you'll see that there's a problem, and you can choose to ignore it.

Cluster Tendency I lied to you again!

If you've been making tweaks and nothing seems to work, it might be time to ask yourself: Does the data even want to be clustered? Some data isn't meant to be clustered. Try asking your data, "Hey there, data. I'm interested in clustering you. Are you okay with that?". Your data responds... "Bro, I'm just a set of pseudo-random points you generated for testing purposes. I mean nothing, and you know it. Stop hurting me. Stop hurting you. Stop hurting us". Listen to your data. Cluster Tendency puts a numerical value on how much your data wants to be clustered.

Clone this wiki locally