Skip to content

The Whole God D_mn thing

Adam Dernis edited this page Aug 29, 2023 · 25 revisions

The Whole God D_mn thing

In this wiki page I will explain the whole god d_mn thing.

What is cluster analysis and why should I give a sh_t?

Cluster analysis is the process of grouping objects by similarities. "But what makes two object similar", I hear you ask. Ah, I respond. That's the b_tch of it, there's no single answer that always works. At least not that anybody has found yet (If you have any ideas hit me up). Instead, you're going to have to think about your problem, consider what properties you care about, quantify them, and select an algorithm that serves your purpose best. I'm so sorry.

In my opinion, cluster analysis libraries are about to become extremely popular because of Machine Learning. A fundamental component of ML is grouping similar nodes/memories and determining what makes a node/memory similar to another. That's Cluster Analysis, baby. The year is 2023 (as of writing). Everyone and their grandma is solving every problem with ML/LLMs (Large Language Models), whether or not they should be. Given that, the very least they can do is have someone else do their clustering for them. You wouldn't write your own sorting, would you? You suck at sorting. And what is cluster analysis if not sorting with extra steps. Sorting in N-dimensions, then grouping the sorted objects. That sounds hard. Let me do it for you.
(If anything goes wrong let me know so I can fix it. Hell, I'll even throw in a full refund).

I'm still confused. How do the algorithms serve different purposes?

Of course you are. That's okay, I can keep talking.

There are 4 main types of algorithms.

  • Centroid Based
  • Density Based
  • Grid Based
  • Hierarchical

Clustering Algorithms generally fit into these categories and which category an algorithm belongs to influences how it treats these similarities. That said, please don't get mad when they don't fit into a precise category. They often won't.

Centroid Based

Points grouped by a centroid-based clustering algorithm are a similar distance from their cluster's centroid.

Density Based

Grid Based

Hierarchical

Distribution Based Clustering (The 5th one that I lied to you about)

Distribution Based clusters have a centroid, and then have a kernel/distribution around them describing the likely density of points belonging to that cluster has they get further from the centroid. Some will say this constitutes another category. That's cool. But, in my opinion this is a sub-category of centroid based clustering that takes little sparkles of inspiration from density based clustering, and this is my wiki. So it will not be joining the big 4.

How do I calculate the centroid, is that even possible?

The answer to both is: "It depends. Now get ready for a roller coaster."

Euclidean Space

Let's assume we're in Euclidean space. In Euclidean space the centroid can be calculated by taking the average of each point, handling each dimension separately.

The N-dimensional formula:

$$C_{n} = \frac{1}{m} \sum_{i=0}^{m} n_{i}$$

And an example in layman's terms:

2D Euclidean Centroid calculation example

Points: {(4,8), (2,5), (3,2)}

Centroid X = (4 + 2 + 3)/3
           = 3

Centroid Y = (8 + 5 + 2)/3
           = 5

Centroid   = (3,5)

Back tracking a bit, in Euclidean space the distance between points can be found using Euclidean Distance (A.K.A. Pythagorean Distance)

The N-dimensional formula:

$$\left\|p-q\right\| = \sqrt{(p_{1} - q_{1})^{2} + (p_{2} - q_{2})^{2} + \cdots + (p_{n} - q_{n})^{2}}$$

And again, an example in layman's terms:

2D Euclidean Distance calculation example

A = (8,1)
B = (2,9)

  A-B   = (8-2, 1-9)
        = (6,-8)

||A-B|| = sqrt(6^2 + (-8)^2)
        = sqrt(36 + 64)
        = 10

Non-Euclidean space

Now the sh_t hits the fan.

In Non-Euclidean spaces the Centroid of a set of points can be calculated, but not necessarily with the formula above.

One example of these contrarian bastards is the Hue Saturation Value (HSV) Color Space.

HSV Cone

In HSV, the value of the color red is hsv(0°,1,1) hsv(0°,1,1) and the color cyan is hsv(180°,1,1) hsv(180°,1,1). So following the Euclidean centroid formula you get this hsv(90°,1,1) hsv(90°,1,1), which is entirely incorrect. This happens because the hue determines the rotation around the circular top of the cone. If you naively average that value, you'll go around the arc pumping into all kinds of vibrant and irrelevant colors in your quest to find the centroid of two colors.

Instead, the proper centroid would be hsv(0°,0,1) hsv(0°,0,1), which is located between Red and Cyan on the HSV Cone.

What a pain. Why do I want this?

Sometimes the most accurate results do not lie in Euclidean space.

For example, in this case, HSV is more accurate than RGB! In RGB, Red is rgb(255,0,0) rgb(255,0,0) and Cyan is rgb(0,255,255) rgb(0,255,255). Therefore their centroid is rgb(128,128,128) rgb(128,128,128), which is missing that bright value shared by both of Red and Cyan. HSV caught it though.

Metric Space

In metric space, it is no longer guaranteed that centroids can be calculated. In fact, it's no longer guaranteed that a point as an absolute position. The only property

I don't like this. Can I just use a library?

My friend, you have come to the right place. ClusterF_ck has it all[citation needed]. ClusterF_ck is also bug free![citation needed], so that's good.

Clone this wiki locally