-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making kmeans take AbstractMatrix instead of Matrix #138
Conversation
Thanks! Besides |
I made all the changes requested by @nalimilan and some of the changes requested by @alyst. Re @alyst: since in assignments = Vector{Int}(undef, n)
costs = Vector{Float64}(undef, n)
counts = Vector{Int}(undef, k)
cweights = Vector{Float64}(undef, k) I don't think it's useful to have subsequent helper functions take (on cosmetics, I hope you won't mind too much, I was just reviewing my changes and ended up doing a bit of cleaning up...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that assignments
/costs
type relaxations are not relevant (actually, I'd rather pass KmeansResult
to _kmeans!()
, but that's out of scope of this PR).
Thanks for doing cleanups in the other places. Would you mind running benchmarks once again to make sure we don't regress?
src/kmeans.jl
Outdated
_kmeans!(X, conv_weights(T, n, weights), centers, | ||
assignments, costs, counts, cweights, | ||
round(Int, maxiter), tol, display_level(display), distance) | ||
_kmeans!(X, conv_weights(Float64, n, weights), centers, assignments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T
was actually needed in this function, otherwise it can become type unstable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With apologies, I don't really understand this comment and more importantly I'm not sure what to do.
Reverting to the previous doesn't seem right to me as it imposes the type of X
and centers
to be the same which I don't think is correct.
I've changed the Float64
in the conv_weights
to eltype(X)
is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eltype(X)
is fine. You could also have kept T
, just not using it for centers
.
However, please check that the code (this function and others down the line) is type stable when X
and centers
have different eltypes: it may have been written with the assumption that they are equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with hindsight, I'm not happy with this conversion. If the X
are say, integers and the weights are float, the weights should stay floats. So it should be something like conv_weights(promote_type(eltype(X), Float64), n, weights)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. But promote_type(eltype(X), Float64)
will almost always return Float64
. Better use float(eltype(X))
. Or maybe better, use eltype(centers)
, which cannot be integers, so that the user can choose the most appropriate type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T
was actually needed in this function, otherwise it can become type unstable.
Are the relevant tests already there (@inferred
typeof() ==
etc) or could they be added? Otherwise finding the right types is a little bit like wandering in the dark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so but maybe that could be for another PR also related to #140 so it's not just for kmeans
that such checks would be needed...
(PS: for this here, I've just reverted to ~ previous behaviour)
Co-Authored-By: tlienart <[email protected]>
Co-Authored-By: tlienart <[email protected]>
Co-Authored-By: tlienart <[email protected]>
Co-Authored-By: tlienart <[email protected]>
Co-Authored-By: tlienart <[email protected]>
Co-Authored-By: tlienart <[email protected]>
Thanks for the detailed feedback. I think I applied all the changes requested apart from 2 where I'm not sure I understood the comment and may have not interpreted it correctly. Apologies for the bloated PR I wasn't expecting to have to touch so many parts of the code 🙈 Edit, Edit 2: I found the culprit, turns out the suggestion by @alyst to use mini benchmarksbefore julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
[ Info: Recompiling stale cache file /Users/tlienart/.julia/compiled/v1.1/Clustering/Ovl2Q.ji for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
126.245 ms (142 allocations: 1.62 MiB)
213.494 ms (325297 allocations: 441.28 MiB) after julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
137.222 ms (278 allocations: 1.63 MiB)
95.003 ms (184 allocations: 1.13 MiB) (I guess the massive improvement in the weighted case is rather good thing) and there's the penalty for working with transposes but at least it doesn't error anymore as noted at the beginning: julia> Xt = copy(transpose(X)); Xtt = Xt'; @btime kmeans($Xtt, 3);
226.050 ms (218 allocations: 1.29 MiB) |
src/kmeans.jl
Outdated
_kmeans!(X, conv_weights(T, n, weights), centers, | ||
assignments, costs, counts, cweights, | ||
round(Int, maxiter), tol, display_level(display), distance) | ||
_kmeans!(X, conv_weights(Float64, n, weights), centers, assignments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eltype(X)
is fine. You could also have kept T
, just not using it for centers
.
However, please check that the code (this function and others down the line) is type stable when X
and centers
have different eltypes: it may have been written with the assumption that they are equal.
…at was in place before but I believe more consistent. Benchmark results identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ignore this, I failed at github)
I've applied the changes in 16ae803. I must admit there are some things that I still don't like with the types (e.g. centers are updated with X*w, but X and c have the same type while w could potentially be something else) but this PR is already too messy so that would be for another issue. Potentially this is the reason why previously the type was restricted to |
src/kmeans.jl
Outdated
if cweights[cj] > 0 | ||
centers[:, cj] .+= xj * wj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this *
(plus the one below) was the reason for regression in weighted case as it creates the new vector.
.+= xj .* wj
should avoid that.
I'm a little bit hesitant about converting everything to explicit loops as it defeats the purpose of having broadcast support in the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with opening another PR for this kind of stuff (with tests & benchmarks) after this one gets merged so that we can isolate changes related to the initial problem from other changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Then I'd rather not introduce explicit loop conversions in this PR to keep it as small as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(FWIW, I just tried exactly what you said and it increases the allocation by a lot and doubles the time. There may be better things to do with copyto!
etc but I must admit I'd rather get this PR done with first...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought centers[:, cj] .+=
calls setindex!()
directly, but it actually creates centers[:, cj]
vector behind the scenes (centers[:, cj] .=
doesn't). So the better version is cej = view(centers, :, cj); cej .+= xj .* wj
.
@@ -311,7 +317,9 @@ function update_centers!( | |||
# sum ==> mean | |||
@inbounds for j = 1:k | |||
if to_update[j] | |||
centers[:, j] .*= 1 / cweights[j] | |||
for i = 1:d | |||
centers[i, j] /= cweights[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but back in the days /
was more expensive than *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with /= --> 132.276ms; 76.272ms
with *= 1 / ... --> 132.628ms; 77.268ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the current code computes 1/cweights[j]
only once, and then does the multiplication for each i
. IIUC the timings for *= 1 /
recompute the division it for each i
, which isn't faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think LLVM is smart enough to calculate 1/wj
only once (at least that's what I see in @code_llvm
for a toy example). But I still prefer the broadcast version (cej = view(centers, :, j); cej .*= 1/cweights[j]
), where this would be definitely done once. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the broadcasted version was nicer. Better use view
. Same above, where view
was used but you removed it: use xj .* wj
and everything should be OK.
Apart from the broadcasting questions which are good for another PR (and possible refactoring of the code), is there anything else left to change before a merge? |
tol::Real, # in: tolerance of change at convergence | ||
displevel::Int, # in: the level of display | ||
distance::SemiMetric # in: function to calculate the distance with | ||
) where T<:Real |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pre-PR code contained both indented and unindented examples of a stray closing bracket + type constraints declaration.
I like unindented version more as it helps to separate the function arguments from its body. @nalimilan What is the recommended convention?
) where T<:Real | |
) where T<:Real |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest it's not really like the current code (now and before) is super consistent in terms of style... but can we work on this in another PR potentially with broadcasting and type fix?
I'm considering rewriting the whole backbone of this kmeans.jl
file so that it can use more than one algorithm, is clearer on the Type promotion front, uses broadcasting and can take a Table
instead of just matrices... style questions and better docstrings could go in there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a convention about this. In general the Julia code base doesn't use that style, but instead puts the first argument on the same line as the function name.
LGTM except |
Co-Authored-By: tlienart <[email protected]>
Ok so here we go, I did try (again) the broadcasted versions suggested and its worse in terms of allocations. I don't really want to figure out what's going as it was not the purpose of this PR. (I believe views do allocate a little bit but it's out of my depth so I'm not going to make strong statements about this) Here is verbatim the code that was plugged in for the weighted case: function update_centers!(
X::AbstractMatrix{T}, # in: sample matrix (d x n)
weights::AbstractVector{<:Real}, # in: sample weights (n)
assignments::Vector{Int}, # in: assignments (n)
to_update::Vector{Bool}, # in: whether a center needs update (k)
centers::AbstractMatrix{T}, # out: updated centers (d x k)
cweights::Vector{Float64} # out: updated cluster weights (k)
) where T<:Real
d, n = size(X)
k = size(centers, 2)
# initialize center weights
cweights[to_update] .= 0.0
# accumulate columns
@inbounds for j = 1:n
wj = weights[j]
if wj > 0
cj = assignments[j]
1 <= cj <= k || error("assignment out of boundary.")
cej = view(centers, :, cj);
xj = view(X, :, j)
if to_update[cj]
if cweights[cj] > 0
cej .+= xj .* wj
# for i = 1:d
# centers[i, cj] += X[i, j] * wj
# end
else
cej .= xj .* wj
# for i = 1:d
# centers[i, cj] = X[i, j] * wj
# end
end
cweights[cj] += wj
end
end
end
# sum ==> mean
@inbounds for j = 1:k
cej = view(centers, :, j)
if to_update[j]
cej .*= 1/cweights[j]
# for i = 1:d
# centers[i, j] /= cweights[j]
# end
end
end
end same benchmark .... using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
122.769 ms (285 allocations: 1.63 MiB)
88.693 ms (190248 allocations: 9.83 MiB) where the first result doesn't change because it uses the function for the unweighted case where I left the loops unrolled. So it's about the same speed but allocates a whole lot more... I personally think that if you're not unhappy with the PR as it stands in terms of solving issue #136 then it's probably best to merge it (or tell me what last changes to make) and look into broadcasting and style harmonisation in another PR? |
Fine with me, but then please revert all broadcasting changes which aren't related to the main purpose of this PR. It's never a good idea to mix different issues in a single PR (as we can see). |
I don't think there's any non-trivial change in that regards that are left? Edit : the only non-trivial change is in the weighted case where the exact same style as for the unweighted case is applied (i.e. previous code) so I imagine it's ok? Also given that the old code allocates like crazy for the weighted case surely it's not desirable to revert these changes? |
I mean these, but I don't really care. |
Ok thanks a lot for the detailed feedback. I'll be happy to try to contribute more to this repo now that I feel I have a bit better understanding of what's going on! |
Thanks @tlienart! |
See #136 and #129.
I duplicated the existing tests to take effectively the same matrix but now wrapped in a
Transpose
object.Benchmarks on 1.1 / libopenlibm for what it's worth
Before
After
So in short, the performances are identical, the only difference is that now it works if the user passes a transpose which I think is desirable.
Of course we can argue that there is a performance drop if that's the case and it could be suggested that the user should do
copy(transpose(X))
if they can. My opinion is that were we should suggest it we should not force it and let the user decide whether that's a good idea for them or not. (the copy could be problematic if the data is very large).