Unexpected results with k-means on binary data #213

RenatoGeh · 2021-03-24T20:00:09Z

Hi,

I'm clustering binary data, and found that the package is giving me some weird results. Here's a minimal example.

using Clustering, Random, Distances

"Function simply returns which data points are in each cluster."
function extract(D::Matrix, R::ClusteringResult)::Vector{Matrix}
  A = assignments(R)
  k = nclusters(R)
  I = [Vector{Int}() for i ∈ 1:k]
  for i ∈ 1:length(A) 
    push!(I[A[i]], i)
  end
  return [D[I[i],:] for i ∈ 1:k]
end


M = Matrix{Bool}([0 0 1; 1 1 0; 1 1 1; 0 1 1; 1 1 1; 0 0 0]);
Random.seed!(73);
display(M)
T = reshape(M, (size(M)[[2, 1]])); # Transpose
R = kmeans(T, 2; maxiter = 100, distance = Hamming(), display = :iter)
for (i, X) ∈ enumerate(extract(M, R)) println(i, ":"); display(X) end

This gives the following output:

6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
  Iters               objv        objv-change | affected
-------------------------------------------------------------
      0       5.000000e+00
┌ Warning: The clustering cost increased at iteration #1
└ @ Clustering ~/.julia/packages/Clustering/tt9vc/src/kmeans.jl:188
      1       1.000000e+01       5.000000e+00 |        0
      2       1.000000e+01       0.000000e+00 |        0
K-means converged with 2 iterations (objv = 10)
1:
5×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 0  0  0
2:
1×3 Array{Bool,2}:
 1  1  1

This is weird to me since cluster 2's center (1, 1, 1) has a distance of zero with data point 3 (1, 1, 1), and so should fall under that cluster instead. If we print the centers:

3×2 Array{Float64,2}:
 0.4  1.0
 1.0  0.0
 0.4  1.0

So k-means is giving cluster 2, which has only one point (1, 1, 1) a center of (1, 0, 1). Am I misinterpreting how the package behaves? Maybe centers are not being updated in the correct order and thus giving this weird result?

Another issue is that k-means should guarantee at least k clusters, yet the package may end up with less than k clusters. Take the same example above, but change the seed to 999 and we'll get the following output:

6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
  Iters               objv        objv-change | affected
-------------------------------------------------------------
      0       3.000000e+00
┌ Warning: The clustering cost increased at iteration #1
└ @ Clustering ~/.julia/packages/Clustering/tt9vc/src/kmeans.jl:188
      1       7.000000e+00       4.000000e+00 |        2
      2       7.000000e+00       0.000000e+00 |        2
K-means converged with 2 iterations (objv = 7)
1:
6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
2:
0×3 Array{Bool,2}

Both of these issues occur on Hamming and Jaccard distances.

The text was updated successfully, but these errors were encountered:

alyst added the bug label Jul 31, 2021

alyst added the help wanted label Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected results with k-means on binary data #213

Unexpected results with k-means on binary data #213

RenatoGeh commented Mar 24, 2021

Unexpected results with k-means on binary data #213

Unexpected results with k-means on binary data #213

Comments

RenatoGeh commented Mar 24, 2021