making kmeans take AbstractMatrix instead of Matrix #138

tlienart · 2019-01-25T04:35:35Z

I duplicated the existing tests to take effectively the same matrix but now wrapped in a Transpose object.

Benchmarks on 1.1 / libopenlibm for what it's worth

Before

julia> seed!(1234)
julia> X = randn(5000, 200);
julia> @btime kmeans($X, 2);
3.138 ms (35 allocations: 95.34 KiB)
julia> @btime kmeans(transpose($Xt), 2);          # ERRORS
julia> @btime kmeans(copy(transpose($Xt)), 2);
4.835 ms (37 allocations: 7.72 MiB)

After

julia> seed!(1234)
julia> X = randn(5000, 200);
julia> @btime kmeans($X, 2);
3.158 ms (35 allocations: 95.34 KiB)
julia> @btime kmeans(transpose($Xt), 2);     # WORKS
21.913 ms (54 allocations: 96.10 KiB)
julia> @btime kmeans(copy(transpose($Xt)), 2);
  4.782 ms (37 allocations: 7.72 MiB)

So in short, the performances are identical, the only difference is that now it works if the user passes a transpose which I think is desirable.
Of course we can argue that there is a performance drop if that's the case and it could be suggested that the user should do copy(transpose(X)) if they can. My opinion is that were we should suggest it we should not force it and let the user decide whether that's a good idea for them or not. (the copy could be problematic if the data is very large).

… and JuliaStats#129

src/seeding.jl

src/kmeans.jl

test/kmeans.jl

src/kmeans.jl

alyst · 2019-01-25T12:31:17Z

Thanks! Besides Transpose there are other AbstractMatrix subtypes, which may work well without collect() (in particular, SubArray), so these changes would be quite useful.

tlienart · 2019-01-26T04:51:20Z

I made all the changes requested by @nalimilan and some of the changes requested by @alyst.

Re @alyst: since in kmeans! there are now effectively those lines:

assignments = Vector{Int}(undef, n)
costs = Vector{Float64}(undef, n)
counts = Vector{Int}(undef, k)
cweights = Vector{Float64}(undef, k)

I don't think it's useful to have subsequent helper functions take AbstractVector etc. I tried to make changes that I thought would help towards generality though and, where appropriate, tried to be explicit about the type if the code warrants it. I hope the changes are ok.

(on cosmetics, I hope you won't mind too much, I was just reviewing my changes and ended up doing a bit of cleaning up...)

alyst

I agree that assignments/costs type relaxations are not relevant (actually, I'd rather pass KmeansResult to _kmeans!(), but that's out of scope of this PR).

Thanks for doing cleanups in the other places. Would you mind running benchmarks once again to make sure we don't regress?

src/kmeans.jl

src/seeding.jl

nalimilan · 2019-01-27T20:39:52Z

src/kmeans.jl

-    _kmeans!(X, conv_weights(T, n, weights), centers,
-             assignments, costs, counts, cweights,
-             round(Int, maxiter), tol, display_level(display), distance)
+    _kmeans!(X, conv_weights(Float64, n, weights), centers, assignments,


T was actually needed in this function, otherwise it can become type unstable.

With apologies, I don't really understand this comment and more importantly I'm not sure what to do.
Reverting to the previous doesn't seem right to me as it imposes the type of X and centers to be the same which I don't think is correct.
I've changed the Float64 in the conv_weights to eltype(X) is that ok?

eltype(X) is fine. You could also have kept T, just not using it for centers.

However, please check that the code (this function and others down the line) is type stable when X and centers have different eltypes: it may have been written with the assumption that they are equal.

with hindsight, I'm not happy with this conversion. If the X are say, integers and the weights are float, the weights should stay floats. So it should be something like conv_weights(promote_type(eltype(X), Float64), n, weights)

Good catch. But promote_type(eltype(X), Float64) will almost always return Float64. Better use float(eltype(X)). Or maybe better, use eltype(centers), which cannot be integers, so that the user can choose the most appropriate type.

T was actually needed in this function, otherwise it can become type unstable.

Are the relevant tests already there (@inferred typeof() == etc) or could they be added? Otherwise finding the right types is a little bit like wandering in the dark.

I don't think so but maybe that could be for another PR also related to #140 so it's not just for kmeans that such checks would be needed...

(PS: for this here, I've just reverted to ~ previous behaviour)

src/kmeans.jl

src/seeding.jl

test/seeding.jl

Co-Authored-By: tlienart <[email protected]>

tlienart · 2019-01-28T01:48:01Z

Thanks for the detailed feedback. I think I applied all the changes requested apart from 2 where I'm not sure I understood the comment and may have not interpreted it correctly.

Apologies for the bloated PR I wasn't expecting to have to touch so many parts of the code 🙈

Edit, ~~a small benchmark shows that some of the changes make allocations where they shouldn't... 😞 I'll try to find where.~~

Edit 2: I found the culprit, turns out the suggestion by @alyst to use centers[:, cj] .+= view... allocates while the previous doesn't. I'll fix that and re-test.

mini benchmarks

before

julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
[ Info: Recompiling stale cache file /Users/tlienart/.julia/compiled/v1.1/Clustering/Ovl2Q.ji for Clustering [aaaa29a8-35af-508c-8bc3-b662a17a0fe5]
  126.245 ms (142 allocations: 1.62 MiB)
  213.494 ms (325297 allocations: 441.28 MiB)

after

julia> using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);
  137.222 ms (278 allocations: 1.63 MiB)
  95.003 ms (184 allocations: 1.13 MiB)

(I guess the massive improvement in the weighted case is rather good thing) and there's the penalty for working with transposes but at least it doesn't error anymore as noted at the beginning:

julia> Xt = copy(transpose(X)); Xtt = Xt'; @btime kmeans($Xtt, 3);
  226.050 ms (218 allocations: 1.29 MiB)

src/kmeans.jl

nalimilan · 2019-01-28T08:42:46Z

src/kmeans.jl

-    _kmeans!(X, conv_weights(T, n, weights), centers,
-             assignments, costs, counts, cweights,
-             round(Int, maxiter), tol, display_level(display), distance)
+    _kmeans!(X, conv_weights(Float64, n, weights), centers, assignments,


eltype(X) is fine. You could also have kept T, just not using it for centers.

However, please check that the code (this function and others down the line) is type stable when X and centers have different eltypes: it may have been written with the assumption that they are equal.

src/kmeans.jl

…at was in place before but I believe more consistent. Benchmark results identical.

tlienart

(ignore this, I failed at github)

tlienart · 2019-01-28T09:42:07Z

I've applied the changes in 16ae803.

I must admit there are some things that I still don't like with the types (e.g. centers are updated with X*w, but X and c have the same type while w could potentially be something else) but this PR is already too messy so that would be for another issue.

Potentially this is the reason why previously the type was restricted to AbstractFloat and not to Real. So that a user could not feed an array of Int. If that's judged to be important, I can revert all the <:Real to <:AbstractFloat. (however cmeans does the same thing)

alyst · 2019-01-28T10:17:23Z

src/kmeans.jl

                if cweights[cj] > 0
-                    centers[:, cj] .+= xj * wj


I guess this * (plus the one below) was the reason for regression in weighted case as it creates the new vector.
.+= xj .* wj should avoid that.
I'm a little bit hesitant about converting everything to explicit loops as it defeats the purpose of having broadcast support in the language.

I'm happy with opening another PR for this kind of stuff (with tests & benchmarks) after this one gets merged so that we can isolate changes related to the initial problem from other changes?

+1. Then I'd rather not introduce explicit loop conversions in this PR to keep it as small as possible.

(FWIW, I just tried exactly what you said and it increases the allocation by a lot and doubles the time. There may be better things to do with copyto! etc but I must admit I'd rather get this PR done with first...)

I thought centers[:, cj] .+= calls setindex!() directly, but it actually creates centers[:, cj] vector behind the scenes (centers[:, cj] .= doesn't). So the better version is cej = view(centers, :, cj); cej .+= xj .* wj.

alyst · 2019-01-28T10:18:51Z

src/kmeans.jl

@@ -311,7 +317,9 @@ function update_centers!(
    # sum ==> mean
    @inbounds for j = 1:k
        if to_update[j]
-            centers[:, j] .*= 1 / cweights[j]
+            for i = 1:d
+                centers[i, j] /= cweights[j]


I'm not 100% sure, but back in the days / was more expensive than *.

with /= --> 132.276ms; 76.272ms
with *= 1 / ... --> 132.628ms; 77.268ms

Note that the current code computes 1/cweights[j] only once, and then does the multiplication for each i. IIUC the timings for *= 1 / recompute the division it for each i, which isn't faster.

Actually, I think LLVM is smart enough to calculate 1/wj only once (at least that's what I see in @code_llvm for a toy example). But I still prefer the broadcast version (cej = view(centers, :, j); cej .*= 1/cweights[j]), where this would be definitely done once. :)

I agree, the broadcasted version was nicer. Better use view. Same above, where view was used but you removed it: use xj .* wj and everything should be OK.

tlienart · 2019-01-28T22:34:24Z

Apart from the broadcasting questions which are good for another PR (and possible refactoring of the code), is there anything else left to change before a merge?

src/kmeans.jl

alyst · 2019-01-28T23:32:22Z

src/kmeans.jl

+    tol::Real,                    # in: tolerance of change at convergence
+    displevel::Int,               # in: the level of display
+    distance::SemiMetric          # in: function to calculate the distance with
+    ) where T<:Real


The pre-PR code contained both indented and unindented examples of a stray closing bracket + type constraints declaration.
I like unindented version more as it helps to separate the function arguments from its body. @nalimilan What is the recommended convention?

Suggested change

) where T<:Real

) where T<:Real

to be honest it's not really like the current code (now and before) is super consistent in terms of style... but can we work on this in another PR potentially with broadcasting and type fix?
I'm considering rewriting the whole backbone of this kmeans.jl file so that it can use more than one algorithm, is clearer on the Type promotion front, uses broadcasting and can take a Table instead of just matrices... style questions and better docstrings could go in there too.

I don't think there's a convention about this. In general the Julia code base doesn't use that style, but instead puts the first argument on the same line as the function name.

alyst · 2019-01-28T23:38:45Z

LGTM except weights declaration in kmeans!().

Co-Authored-By: tlienart <[email protected]>

tlienart · 2019-01-29T11:29:25Z

Ok so here we go, I did try (again) the broadcasted versions suggested and its worse in terms of allocations. I don't really want to figure out what's going as it was not the purpose of this PR. (I believe views do allocate a little bit but it's out of my depth so I'm not going to make strong statements about this)

Here is verbatim the code that was plugged in for the weighted case:

function update_centers!(
    X::AbstractMatrix{T},            # in: sample matrix (d x n)
    weights::AbstractVector{<:Real}, # in: sample weights (n)
    assignments::Vector{Int},        # in: assignments (n)
    to_update::Vector{Bool},         # in: whether a center needs update (k)
    centers::AbstractMatrix{T},      # out: updated centers (d x k)
    cweights::Vector{Float64}        # out: updated cluster weights (k)
    ) where T<:Real

    d, n = size(X)
    k = size(centers, 2)

    # initialize center weights
    cweights[to_update] .= 0.0

    # accumulate columns
    @inbounds for j = 1:n
        wj = weights[j]
        if wj > 0
            cj = assignments[j]
            1 <= cj <= k || error("assignment out of boundary.")
            cej = view(centers, :, cj);
            xj = view(X, :, j)
            if to_update[cj]
                if cweights[cj] > 0
                    cej .+= xj .* wj
                    # for i = 1:d
                    #     centers[i, cj] += X[i, j] * wj
                    # end
                else
                    cej .= xj .* wj
                    # for i = 1:d
                    #     centers[i, cj] = X[i, j] * wj
                    # end
                end
                cweights[cj] += wj
            end
        end
    end

    # sum ==> mean
    @inbounds for j = 1:k
        cej = view(centers, :, j)
        if to_update[j]
            cej .*= 1/cweights[j]
            # for i = 1:d
            #     centers[i, j] /= cweights[j]
            # end
        end
    end
end

same benchmark ....

using BenchmarkTools, Random, Clustering; Random.seed!(1641); X = randn(500, 5000); w = abs.(randn(5000)); @btime kmeans($X, 3); @btime kmeans($X, 3, weights=w);

  122.769 ms (285 allocations: 1.63 MiB)
  88.693 ms (190248 allocations: 9.83 MiB)

where the first result doesn't change because it uses the function for the unweighted case where I left the loops unrolled. So it's about the same speed but allocates a whole lot more...

I personally think that if you're not unhappy with the PR as it stands in terms of solving issue #136 then it's probably best to merge it (or tell me what last changes to make) and look into broadcasting and style harmonisation in another PR?

nalimilan · 2019-01-29T12:20:02Z

Fine with me, but then please revert all broadcasting changes which aren't related to the main purpose of this PR. It's never a good idea to mix different issues in a single PR (as we can see).

src/kmeans.jl

tlienart · 2019-01-29T12:55:51Z

Fine with me, but then please revert all broadcasting changes which aren't related to the main purpose of this PR. It's never a good idea to mix different issues in a single PR (as we can see).

I don't think there's any non-trivial change in that regards that are left?

Edit : the only non-trivial change is in the weighted case where the exact same style as for the unweighted case is applied (i.e. previous code) so I imagine it's ok? Also given that the old code allocates like crazy for the weighted case surely it's not desirable to revert these changes?

nalimilan · 2019-01-29T13:12:33Z

I mean these, but I don't really care.

tlienart · 2019-01-29T13:13:59Z

Ok thanks a lot for the detailed feedback. I'll be happy to try to contribute more to this repo now that I feel I have a bit better understanding of what's going on!

nalimilan · 2019-01-29T20:29:56Z

Thanks @tlienart!

making kmeans take AbstractMatrix instead of Matrix, see JuliaStats#136…

57c172a

… and JuliaStats#129

nalimilan reviewed Jan 25, 2019

View reviewed changes

src/seeding.jl Outdated Show resolved Hide resolved

src/kmeans.jl Outdated Show resolved Hide resolved

test/kmeans.jl Outdated Show resolved Hide resolved

alyst reviewed Jan 25, 2019

View reviewed changes

src/kmeans.jl Outdated Show resolved Hide resolved

alyst reviewed Jan 25, 2019

View reviewed changes

src/kmeans.jl Outdated Show resolved Hide resolved

alyst reviewed Jan 25, 2019

View reviewed changes

src/kmeans.jl Outdated Show resolved Hide resolved

PR update based on feedback + cosmetics

4c7f218

alyst reviewed Jan 27, 2019

View reviewed changes

nalimilan reviewed Jan 27, 2019

View reviewed changes

alyst and others added 7 commits January 28, 2019 11:40

Update src/kmeans.jl

2e95afb

Co-Authored-By: tlienart <[email protected]>

Update src/kmeans.jl

be3575d

Co-Authored-By: tlienart <[email protected]>

Update src/kmeans.jl

d3f167a

Co-Authored-By: tlienart <[email protected]>

Update src/kmeans.jl

a8ea8c7

Co-Authored-By: tlienart <[email protected]>

Update src/kmeans.jl

c5bda5b

Co-Authored-By: tlienart <[email protected]>

Update src/kmeans.jl

3f8b15b

Co-Authored-By: tlienart <[email protected]>

applying the changes following PR reviews

0d272e9

reverting a few changes that seemed to cause spurious allocations

9cb02f7

nalimilan reviewed Jan 28, 2019

View reviewed changes

type changes, tying costs and centers to the type of X, similar to wh…

16ae803

…at was in place before but I believe more consistent. Benchmark results identical.

tlienart commented Jan 28, 2019

View reviewed changes

tlienart mentioned this pull request Jan 28, 2019

Potential type issues with fuzzy_cmeans #140

Open

alyst reviewed Jan 28, 2019

View reviewed changes

src/kmeans.jl Outdated Show resolved Hide resolved

alyst reviewed Jan 28, 2019

View reviewed changes

Update src/kmeans.jl

192694a

Co-Authored-By: tlienart <[email protected]>

nalimilan reviewed Jan 29, 2019

View reviewed changes

src/kmeans.jl Show resolved Hide resolved

reverting the rounding of maxiter as per requested change

603422d

nalimilan approved these changes Jan 29, 2019

View reviewed changes

alyst merged commit eaae8dd into JuliaStats:master Jan 29, 2019

tlienart mentioned this pull request Jan 30, 2019

refactoring of kmeans + type fixes #143

Merged

alyst mentioned this pull request Mar 19, 2019

DBSCAN DenseMatrix Requirement #153

Closed

making kmeans take AbstractMatrix instead of Matrix #138

making kmeans take AbstractMatrix instead of Matrix #138

Conversation

tlienart commented Jan 25, 2019 • edited Loading

alyst commented Jan 25, 2019

tlienart commented Jan 26, 2019

alyst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlienart Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

tlienart commented Jan 28, 2019 • edited Loading

mini benchmarks

Choose a reason for hiding this comment

tlienart left a comment • edited Loading

Choose a reason for hiding this comment

tlienart commented Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

tlienart Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlienart Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlienart commented Jan 28, 2019

Choose a reason for hiding this comment

tlienart Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyst commented Jan 28, 2019

tlienart commented Jan 29, 2019 • edited Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019 • edited Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019

nalimilan commented Jan 29, 2019

tlienart commented Jan 25, 2019 •

edited

Loading

tlienart Jan 28, 2019 •

edited

Loading

tlienart commented Jan 28, 2019 •

edited

Loading

tlienart left a comment •

edited

Loading

tlienart commented Jan 28, 2019 •

edited

Loading

tlienart Jan 28, 2019 •

edited

Loading

tlienart Jan 28, 2019 •

edited

Loading

tlienart Jan 29, 2019 •

edited

Loading

tlienart commented Jan 29, 2019 •

edited

Loading

tlienart commented Jan 29, 2019 •

edited

Loading