Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results with k-means on binary data #213

Open
RenatoGeh opened this issue Mar 24, 2021 · 0 comments
Open

Unexpected results with k-means on binary data #213

RenatoGeh opened this issue Mar 24, 2021 · 0 comments

Comments

@RenatoGeh
Copy link

Hi,

I'm clustering binary data, and found that the package is giving me some weird results. Here's a minimal example.

using Clustering, Random, Distances

"Function simply returns which data points are in each cluster."
function extract(D::Matrix, R::ClusteringResult)::Vector{Matrix}
  A = assignments(R)
  k = nclusters(R)
  I = [Vector{Int}() for i  1:k]
  for i  1:length(A) 
    push!(I[A[i]], i)
  end
  return [D[I[i],:] for i  1:k]
end


M = Matrix{Bool}([0 0 1; 1 1 0; 1 1 1; 0 1 1; 1 1 1; 0 0 0]);
Random.seed!(73);
display(M)
T = reshape(M, (size(M)[[2, 1]])); # Transpose
R = kmeans(T, 2; maxiter = 100, distance = Hamming(), display = :iter)
for (i, X)  enumerate(extract(M, R)) println(i, ":"); display(X) end

This gives the following output:

6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
  Iters               objv        objv-change | affected
-------------------------------------------------------------
      0       5.000000e+00
┌ Warning: The clustering cost increased at iteration #1
└ @ Clustering ~/.julia/packages/Clustering/tt9vc/src/kmeans.jl:188
      1       1.000000e+01       5.000000e+00 |        0
      2       1.000000e+01       0.000000e+00 |        0
K-means converged with 2 iterations (objv = 10)
1:
5×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 0  0  0
2:
1×3 Array{Bool,2}:
 1  1  1

This is weird to me since cluster 2's center (1, 1, 1) has a distance of zero with data point 3 (1, 1, 1), and so should fall under that cluster instead. If we print the centers:

3×2 Array{Float64,2}:
 0.4  1.0
 1.0  0.0
 0.4  1.0

So k-means is giving cluster 2, which has only one point (1, 1, 1) a center of (1, 0, 1). Am I misinterpreting how the package behaves? Maybe centers are not being updated in the correct order and thus giving this weird result?

Another issue is that k-means should guarantee at least k clusters, yet the package may end up with less than k clusters. Take the same example above, but change the seed to 999 and we'll get the following output:

6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
  Iters               objv        objv-change | affected
-------------------------------------------------------------
      0       3.000000e+00
┌ Warning: The clustering cost increased at iteration #1
└ @ Clustering ~/.julia/packages/Clustering/tt9vc/src/kmeans.jl:188
      1       7.000000e+00       4.000000e+00 |        2
      2       7.000000e+00       0.000000e+00 |        2
K-means converged with 2 iterations (objv = 7)
1:
6×3 Array{Bool,2}:
 0  0  1
 1  1  0
 1  1  1
 0  1  1
 1  1  1
 0  0  0
2:
0×3 Array{Bool,2}

Both of these issues occur on Hamming and Jaccard distances.

@alyst alyst added the bug label Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants