Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making kmeans take AbstractMatrix instead of Matrix #138

Merged
merged 13 commits into from
Jan 29, 2019

Conversation

tlienart
Copy link
Contributor

@tlienart tlienart commented Jan 25, 2019

See #136 and #129.

I duplicated the existing tests to take effectively the same matrix but now wrapped in a Transpose object.

Benchmarks on 1.1 / libopenlibm for what it's worth

Before

julia> seed!(1234)
julia> X = randn(5000, 200);
julia> @btime kmeans($X, 2);
3.138 ms (35 allocations: 95.34 KiB)
julia> @btime kmeans(transpose($Xt), 2);          # ERRORS
julia> @btime kmeans(copy(transpose($Xt)), 2);
4.835 ms (37 allocations: 7.72 MiB)

After

julia> seed!(1234)
julia> X = randn(5000, 200);
julia> @btime kmeans($X, 2);
3.158 ms (35 allocations: 95.34 KiB)
julia> @btime kmeans(transpose($Xt), 2);     # WORKS
21.913 ms (54 allocations: 96.10 KiB)
julia> @btime kmeans(copy(transpose($Xt)), 2);
  4.782 ms (37 allocations: 7.72 MiB)

So in short, the performances are identical, the only difference is that now it works if the user passes a transpose which I think is desirable.
Of course we can argue that there is a performance drop if that's the case and it could be suggested that the user should do copy(transpose(X)) if they can. My opinion is that were we should suggest it we should not force it and let the user decide whether that's a good idea for them or not. (the copy could be problematic if the data is very large).

src/seeding.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
test/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Member

alyst commented Jan 25, 2019

Thanks! Besides Transpose there are other AbstractMatrix subtypes, which may work well without collect() (in particular, SubArray), so these changes would be quite useful.

@tlienart
Copy link
Contributor Author

I made all the changes requested by @nalimilan and some of the changes requested by @alyst.

Re @alyst: since in kmeans! there are now effectively those lines:

assignments = Vector{Int}(undef, n)
costs = Vector{Float64}(undef, n)
counts = Vector{Int}(undef, k)
cweights = Vector{Float64}(undef, k)

I don't think it's useful to have subsequent helper functions take AbstractVector etc. I tried to make changes that I thought would help towards generality though and, where appropriate, tried to be explicit about the type if the code warrants it. I hope the changes are ok.

(on cosmetics, I hope you won't mind too much, I was just reviewing my changes and ended up doing a bit of cleaning up...)

Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that assignments/costs type relaxations are not relevant (actually, I'd rather pass KmeansResult to _kmeans!(), but that's out of scope of this PR).

Thanks for doing cleanups in the other places. Would you mind running benchmarks once again to make sure we don't regress?

src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Show resolved Hide resolved
src/kmeans.jl Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/seeding.jl Outdated Show resolved Hide resolved
src/seeding.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated
_kmeans!(X, conv_weights(T, n, weights), centers,
assignments, costs, counts, cweights,
round(Int, maxiter), tol, display_level(display), distance)
_kmeans!(X, conv_weights(Float64, n, weights), centers, assignments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T was actually needed in this function, otherwise it can become type unstable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With apologies, I don't really understand this comment and more importantly I'm not sure what to do.
Reverting to the previous doesn't seem right to me as it imposes the type of X and centers to be the same which I don't think is correct.
I've changed the Float64 in the conv_weights to eltype(X) is that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eltype(X) is fine. You could also have kept T, just not using it for centers.

However, please check that the code (this function and others down the line) is type stable when X and centers have different eltypes: it may have been written with the assumption that they are equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with hindsight, I'm not happy with this conversion. If the X are say, integers and the weights are float, the weights should stay floats. So it should be something like conv_weights(promote_type(eltype(X), Float64), n, weights)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. But promote_type(eltype(X), Float64) will almost always return Float64. Better use float(eltype(X)). Or maybe better, use eltype(centers), which cannot be integers, so that the user can choose the most appropriate type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T was actually needed in this function, otherwise it can become type unstable.

Are the relevant tests already there (@inferred typeof() == etc) or could they be added? Otherwise finding the right types is a little bit like wandering in the dark.

Copy link
Contributor Author

@tlienart tlienart Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so but maybe that could be for another PR also related to #140 so it's not just for kmeans that such checks would be needed...

(PS: for this here, I've just reverted to ~ previous behaviour)

src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Show resolved Hide resolved
src/kmeans.jl Outdated Show resolved Hide resolved
src/seeding.jl Outdated Show resolved Hide resolved
src/seeding.jl Show resolved Hide resolved
src/seeding.jl Outdated Show resolved Hide resolved
src/seeding.jl Show resolved Hide resolved
test/seeding.jl Show resolved Hide resolved
@tlienart
Copy link
Contributor Author

tlienart commented Jan 28, 2019

Thanks for the detailed feedback. I think I applied all the changes requested apart from 2 where I'm not sure I understood the comment and may have not interpreted it correctly.

Apologies for the bloated PR I wasn't expecting to have to touch so many parts of the code 🙈

Edit, a small benchmark shows that some of the changes make allocations where they shouldn't... 😞 I'll try to find where.

Edit 2: I found the culprit, turns out the suggestion by @alyst to use centers[:, cj] .+= view... allocates while the previous doesn't. I'll fix that and re-test.

mini benchmarks

before

julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
[ Info: Recompiling stale cache file /Users/tlienart/.julia/compiled/v1.1/Clustering/Ovl2Q.ji for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
  126.245 ms (142 allocations: 1.62 MiB)
  213.494 ms (325297 allocations: 441.28 MiB)

after

julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
  137.222 ms (278 allocations: 1.63 MiB)
  95.003 ms (184 allocations: 1.13 MiB)

(I guess the massive improvement in the weighted case is rather good thing) and there's the penalty for working with transposes but at least it doesn't error anymore as noted at the beginning:

julia> Xt = copy(transpose(X)); Xtt = Xt'; @btime kmeans($Xtt, 3);
  226.050 ms (218 allocations: 1.29 MiB)

src/kmeans.jl Outdated Show resolved Hide resolved
src/kmeans.jl Outdated
_kmeans!(X, conv_weights(T, n, weights), centers,
assignments, costs, counts, cweights,
round(Int, maxiter), tol, display_level(display), distance)
_kmeans!(X, conv_weights(Float64, n, weights), centers, assignments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eltype(X) is fine. You could also have kept T, just not using it for centers.

However, please check that the code (this function and others down the line) is type stable when X and centers have different eltypes: it may have been written with the assumption that they are equal.

src/kmeans.jl Outdated Show resolved Hide resolved
…at was in place before but I believe more consistent. Benchmark results identical.
Copy link
Contributor Author

@tlienart tlienart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ignore this, I failed at github)

@tlienart
Copy link
Contributor Author

tlienart commented Jan 28, 2019

I've applied the changes in 16ae803.

I must admit there are some things that I still don't like with the types (e.g. centers are updated with X*w, but X and c have the same type while w could potentially be something else) but this PR is already too messy so that would be for another issue.

Potentially this is the reason why previously the type was restricted to AbstractFloat and not to Real. So that a user could not feed an array of Int. If that's judged to be important, I can revert all the <:Real to <:AbstractFloat. (however cmeans does the same thing)

src/kmeans.jl Outdated
if cweights[cj] > 0
centers[:, cj] .+= xj * wj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this * (plus the one below) was the reason for regression in weighted case as it creates the new vector.
.+= xj .* wj should avoid that.
I'm a little bit hesitant about converting everything to explicit loops as it defeats the purpose of having broadcast support in the language.

Copy link
Contributor Author

@tlienart tlienart Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with opening another PR for this kind of stuff (with tests & benchmarks) after this one gets merged so that we can isolate changes related to the initial problem from other changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Then I'd rather not introduce explicit loop conversions in this PR to keep it as small as possible.

Copy link
Contributor Author

@tlienart tlienart Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW, I just tried exactly what you said and it increases the allocation by a lot and doubles the time. There may be better things to do with copyto! etc but I must admit I'd rather get this PR done with first...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought centers[:, cj] .+= calls setindex!() directly, but it actually creates centers[:, cj] vector behind the scenes (centers[:, cj] .= doesn't). So the better version is cej = view(centers, :, cj); cej .+= xj .* wj.

@@ -311,7 +317,9 @@ function update_centers!(
# sum ==> mean
@inbounds for j = 1:k
if to_update[j]
centers[:, j] .*= 1 / cweights[j]
for i = 1:d
centers[i, j] /= cweights[j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but back in the days / was more expensive than *.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with /= --> 132.276ms; 76.272ms
with *= 1 / ... --> 132.628ms; 77.268ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current code computes 1/cweights[j] only once, and then does the multiplication for each i. IIUC the timings for *= 1 / recompute the division it for each i, which isn't faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think LLVM is smart enough to calculate 1/wj only once (at least that's what I see in @code_llvm for a toy example). But I still prefer the broadcast version (cej = view(centers, :, j); cej .*= 1/cweights[j]), where this would be definitely done once. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the broadcasted version was nicer. Better use view. Same above, where view was used but you removed it: use xj .* wj and everything should be OK.

@tlienart
Copy link
Contributor Author

Apart from the broadcasting questions which are good for another PR (and possible refactoring of the code), is there anything else left to change before a merge?

src/kmeans.jl Outdated Show resolved Hide resolved
tol::Real, # in: tolerance of change at convergence
displevel::Int, # in: the level of display
distance::SemiMetric # in: function to calculate the distance with
) where T<:Real
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-PR code contained both indented and unindented examples of a stray closing bracket + type constraints declaration.
I like unindented version more as it helps to separate the function arguments from its body. @nalimilan What is the recommended convention?

Suggested change
) where T<:Real
) where T<:Real

Copy link
Contributor Author

@tlienart tlienart Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest it's not really like the current code (now and before) is super consistent in terms of style... but can we work on this in another PR potentially with broadcasting and type fix?
I'm considering rewriting the whole backbone of this kmeans.jl file so that it can use more than one algorithm, is clearer on the Type promotion front, uses broadcasting and can take a Table instead of just matrices... style questions and better docstrings could go in there too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a convention about this. In general the Julia code base doesn't use that style, but instead puts the first argument on the same line as the function name.

@alyst
Copy link
Member

alyst commented Jan 28, 2019

LGTM except weights declaration in kmeans!().

Co-Authored-By: tlienart <[email protected]>
@tlienart
Copy link
Contributor Author

tlienart commented Jan 29, 2019

Ok so here we go, I did try (again) the broadcasted versions suggested and its worse in terms of allocations. I don't really want to figure out what's going as it was not the purpose of this PR. (I believe views do allocate a little bit but it's out of my depth so I'm not going to make strong statements about this)

Here is verbatim the code that was plugged in for the weighted case:

function update_centers!(
    X::AbstractMatrix{T},            # in: sample matrix (d x n)
    weights::AbstractVector{<:Real}, # in: sample weights (n)
    assignments::Vector{Int},        # in: assignments (n)
    to_update::Vector{Bool},         # in: whether a center needs update (k)
    centers::AbstractMatrix{T},      # out: updated centers (d x k)
    cweights::Vector{Float64}        # out: updated cluster weights (k)
    ) where T<:Real

    d, n = size(X)
    k = size(centers, 2)

    # initialize center weights
    cweights[to_update] .= 0.0

    # accumulate columns
    @inbounds for j = 1:n
        wj = weights[j]
        if wj > 0
            cj = assignments[j]
            1 <= cj <= k || error("assignment out of boundary.")
            cej = view(centers, :, cj);
            xj = view(X, :, j)
            if to_update[cj]
                if cweights[cj] > 0
                    cej .+= xj .* wj
                    # for i = 1:d
                    #     centers[i, cj] += X[i, j] * wj
                    # end
                else
                    cej .= xj .* wj
                    # for i = 1:d
                    #     centers[i, cj] = X[i, j] * wj
                    # end
                end
                cweights[cj] += wj
            end
        end
    end

    # sum ==> mean
    @inbounds for j = 1:k
        cej = view(centers, :, j)
        if to_update[j]
            cej .*= 1/cweights[j]
            # for i = 1:d
            #     centers[i, j] /= cweights[j]
            # end
        end
    end
end

same benchmark ....

using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);

  122.769 ms (285 allocations: 1.63 MiB)
  88.693 ms (190248 allocations: 9.83 MiB)

where the first result doesn't change because it uses the function for the unweighted case where I left the loops unrolled. So it's about the same speed but allocates a whole lot more...

I personally think that if you're not unhappy with the PR as it stands in terms of solving issue #136 then it's probably best to merge it (or tell me what last changes to make) and look into broadcasting and style harmonisation in another PR?

@nalimilan
Copy link
Member

Fine with me, but then please revert all broadcasting changes which aren't related to the main purpose of this PR. It's never a good idea to mix different issues in a single PR (as we can see).

@tlienart
Copy link
Contributor Author

tlienart commented Jan 29, 2019

Fine with me, but then please revert all broadcasting changes which aren't related to the main purpose of this PR. It's never a good idea to mix different issues in a single PR (as we can see).

I don't think there's any non-trivial change in that regards that are left?

Edit : the only non-trivial change is in the weighted case where the exact same style as for the unweighted case is applied (i.e. previous code) so I imagine it's ok? Also given that the old code allocates like crazy for the weighted case surely it's not desirable to revert these changes?

@nalimilan
Copy link
Member

I mean these, but I don't really care.

@tlienart
Copy link
Contributor Author

Ok thanks a lot for the detailed feedback. I'll be happy to try to contribute more to this repo now that I feel I have a bit better understanding of what's going on!

@alyst alyst merged commit eaae8dd into JuliaStats:master Jan 29, 2019
@nalimilan
Copy link
Member

Thanks @tlienart!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants