You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm clustering binary data, and found that the package is giving me some weird results. Here's a minimal example.
using Clustering, Random, Distances
"Function simply returns which data points are in each cluster."functionextract(D::Matrix, R::ClusteringResult)::Vector{Matrix}
A =assignments(R)
k =nclusters(R)
I = [Vector{Int}() for i ∈1:k]
for i ∈1:length(A)
push!(I[A[i]], i)
endreturn [D[I[i],:] for i ∈1:k]
end
M =Matrix{Bool}([001; 110; 111; 011; 111; 000]);
Random.seed!(73);
display(M)
T =reshape(M, (size(M)[[2, 1]])); # Transpose
R =kmeans(T, 2; maxiter =100, distance =Hamming(), display =:iter)
for (i, X) ∈enumerate(extract(M, R)) println(i, ":"); display(X) end
This is weird to me since cluster 2's center (1, 1, 1) has a distance of zero with data point 3 (1, 1, 1), and so should fall under that cluster instead. If we print the centers:
3×2 Array{Float64,2}:0.41.01.00.00.41.0
So k-means is giving cluster 2, which has only one point (1, 1, 1) a center of (1, 0, 1). Am I misinterpreting how the package behaves? Maybe centers are not being updated in the correct order and thus giving this weird result?
Another issue is that k-means should guarantee at least k clusters, yet the package may end up with less than k clusters. Take the same example above, but change the seed to 999 and we'll get the following output:
Hi,
I'm clustering binary data, and found that the package is giving me some weird results. Here's a minimal example.
This gives the following output:
This is weird to me since cluster 2's center
(1, 1, 1)
has a distance of zero with data point 3(1, 1, 1)
, and so should fall under that cluster instead. If we print the centers:So k-means is giving cluster 2, which has only one point
(1, 1, 1)
a center of(1, 0, 1)
. Am I misinterpreting how the package behaves? Maybe centers are not being updated in the correct order and thus giving this weird result?Another issue is that k-means should guarantee at least k clusters, yet the package may end up with less than k clusters. Take the same example above, but change the seed to 999 and we'll get the following output:
Both of these issues occur on Hamming and Jaccard distances.
The text was updated successfully, but these errors were encountered: