Potential type issues with `fuzzy_cmeans` #140

tlienart · 2019-01-28T10:04:36Z

Following explorations in #138 I had a look at the code for fuzzy_cmeans and tried:

julia> Random.seed!(51)
julia> X = rand(Int16, 100, 5);
julia> fuzzy_cmeans(X, 3, 1.2)
ERROR: InexactError: Int16(-9526.902803360517)

I believe this may stem from the fact that the algorithm accepts any subtype of Real and therefore <:Integer but with integers there may be type issues with how centers are assigned and updated.

Two suggestions

restrict the input type to AbstractFloat everywhere in clustering, I believe that solves the problem but maybe that we would like clustering algorithms to work on ints...
cast X to float if it isn't which, I believe, would be the expected behaviour from a maths perspective

The text was updated successfully, but these errors were encountered:

nalimilan · 2019-01-29T16:31:09Z

Solution 1. isn't very interesting since it fixes the error by throwing another error. Better choose the appropriate element type using float(eltype(X)) or something similar.

tlienart · 2019-01-29T22:09:04Z

I agree. The disadvantage being that it requires converting the matrix to a floating point one, so effectively copying the data.

With respect to float(eltype(X)), one thing I noted is that float(Int) for Int16, Int32, ... always gives Float64. That's not super intuitive to me. Wouldn't it be preferrable to use the a Float32 if we can to speed up computations?

alyst · 2019-01-29T23:36:03Z

The disadvantage being that it requires converting the matrix to a floating point one, so effectively copying the data.

Why so? The centers are calculated in update_centers!(), and from that code it doesn't look like you need to do any explicit conversion of data matrix or its elements.

With respect to float(eltype(X)), one thing I noted is that float(Int) for Int16, Int32, ... always gives Float64. That's not super intuitive to me. Wouldn't it be preferrable to use the a Float32 if we can to speed up computations?

float(<:Integer) returns the most reasonable type, which is the one that is most precise.
Note that for Int16 the maximal number is 32767, but maxintfloat(Float16) is 2048.0; not every larger integer could be precisely represented as Float16.
But maybe it would make sense -- similar to kmeans!() -- to have fuzzy_cmeans!(data, centers) method, which would allow specifying the centers eltype.
And speaking of kmeans!(): actually after #138 it's still not possible that centers have eltype different from X.

tlienart · 2019-01-30T00:10:23Z

In kmeans, these lines:

centers[i, cj] += X[i, j]

and

centers[i ,cj] += X[i, j] * wj

are the ones causing issues if the type of centers, X and w are not set carefully. At the moment (previous behaviour prior to #138 as well), if X is T then necessarily centers is T too but w can be anything which can cause problems. And I agree that centers should be allowed to be provided with its own type separate from X but then more type casting error possibilities arise.

Allegedly you don't necessarily need to fully construct Xf = convert(Matrix{FloatXX}, X) but then you need to make sure these computations are done correctly and without loss. Casting individual computations without doing the full thing is IMO a path to problems.

What I would suggest:

write specialised functions that act on X::AbstractMatrix{<:Union{Integer, Rational, AbstractIrrational}}, compute a conversion to the nearest floating type that allows least precision loss (Float64 for Int64, Rational, Irrational, possibly Float32 for Int16 and Int32,, possibly Float16 for Bool)
write specialised functions that act on X::AbstractFloat (=the current one but reverting to AbstractFloat instead of Real) and use that after the conversion done in (1).

PS:

float(Int16) # Float64
float(Bool) # Float64

nalimilan · 2019-01-30T10:05:14Z

But maybe it would make sense -- similar to kmeans!() -- to have fuzzy_cmeans!(data, centers) method, which would allow specifying the centers eltype.

I agree with this. That sounds simpler and more reliable than trying to choose the floating-point type based on the input type. Indeed, depending on the situation, you may need more or less precision, independent from whether the data can be converted without loss. For example, Float16 is very imprecise, so even if your data is boolean or in a small range you probably don't want to use it. Float64 is the default everywhere in Julia because of this.

tlienart · 2019-01-30T10:18:01Z

See #143 where basically what I implemented:

allows to have any type for center
if X is a non-floating point it gets converted to Float64
centers get converted to eltype(X) after this first conversion so effectively Float64 unless X is specified as lower-precision float like Float32 in which case I don't think it makes a lot of sense to keep the centers in -say- Float64 while updating them with Float32 operations.
(Probably Float16 should be treated as Int etc, and just be converted to Float64, who uses Float16 in data anlytics anyway apart from in Neural Nets maybe? I didn't do this and just kept the behaviour pre making kmeans take AbstractMatrix instead of Matrix #138 in place.)

wildart · 2019-01-30T21:10:01Z

It does not make sense to use Integer data input for k-means without initial conversion to floating point format. The mean calculation results into a floating point value, I would not expect it to be integer.

I think that better to limit input type to be derived from AbstractFloat rather then Real, instead of converting data to Float64.

Moreover, make sure that the update_centers! is agnostic to input data types. I can spot a problem when the center weights array is initialized with Float64 zero value instead of appropriate data type:

Clustering.jl/src/kmeans.jl

Line 251 in eaae8dd

cweights[to_update] .= 0.0

IMHO, there should be no silent data conversion: whatever the input data type is provided the same data type should be expected in the result. It's a user responsibility to decide what kind of data to feed into the algorithm. We should just provide a clear description of parameter limitations.

nalimilan · 2019-01-30T21:23:19Z

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input? If so, it sounds wasteful to require the user to convert beforehand (especially given that it requires making a copy of the data).

Moreover, make sure that the update_centers! is agnostic to input data types. I can spot a problem when the center weights array is initialized with Float64 zero value instead of appropriate data type:

Assignment into an array will never change its type, so what happen here is that 0.0 is converted to eltype(cweights). Using .= 0 would have the same effect.

tlienart · 2019-01-30T22:17:13Z

The proposal specifically uses .=zero(T) see

Clustering.jl/src/kmeans.jl

Line 314 in 8e156b5

cweights[to_update] .= zero(T)

etc

Re "silent data conversion" I'm in favour of throwing a warning message and do the conversion rather than just throw an InexactError.

If someone enters data that looks like:

1 1 1 2 1 2 2 2 1 1 50 55 60 50 65 50 65

you'd still want your kmeans to pick the obvious two clusters.

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input? If so, it sounds wasteful to require the user to convert beforehand (especially given that it requires making a copy of the data).

Maybe not, but then copyseeds needs to be modified so that the type of centers is not tied to that of X and, rather, is tied to that of what X would be after conversion. This may work. I can give it a shot.

wildart · 2019-01-30T23:14:03Z

Assignment into an array will never change its type, so what happen here is that 0.0 is converted to eltype(cweights). Using .= 0 would have the same effect.

Sorry, my bad. I spotted assignment and acted on it.

Re "silent data conversion" I'm in favour of throwing a warning message and do the conversion rather than just throw an InexactError.

Restricting input to AbstractFloat will not allow run the function with incompatible input data, leaving a user a choice to convert data to an acceptable type.

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input?

I believe its how the function works now - all internal structures and output have the same time as input, and that is how it always should be unless a particular algorithm specifies differently.

tlienart · 2019-01-30T23:22:03Z

Restricting input to AbstractFloat will not allow run the function with incompatible input data, leaving a user a choice to convert data to an acceptable type.

I think this is an odd restriction. In fact the example given on ScikitLearn's page itself is with integer data... https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Also the previous behaviour with which you may be familiar had a few issues:

it required the type of X and centers to be the same in kmeans! which is not necessary though you may say that it would happen most of the time
you could still enter data as Float16 and potentially get shafted by InexactErrors
it oddly forced some of the things to be exactly Float64 when not necessarily required.

The attempt at #143 is to change this while offering better performances at the same time. I'm working on @nalimilan's suggestion which should allow to not use explicit conversion of the input data.

Also I'm doing all this with a view on future use of kmeans where the user has potentially no idea of the internals and may in fact try to feed a DataFrame or even a Table to kmeans, in that case a novice user may have no idea why it's failing with integer input and not know (or want) to convert it to float...

I think all these tools from super basic ML / Data analytics should be as robust to user input as possible... (especially when it's not very complicated to make them so)

wildart · 2019-01-30T23:53:46Z

I think this is an odd restriction. In fact the example given on ScikitLearn's page itself is with integer data...

That results in floating point array of centers.

the user has potentially no idea of the internals and may in fact try to feed a DataFrame or even a Table to kmeans,

I think users should be aware of type of their data and limitations of the algorithms that they use.

But, I see a reason for providing a different type of a centers array, only when it's done explicitly.

tlienart · 2019-01-30T23:55:19Z

That results in floating point array of centers.

The code as it stands currently in #143 does this as well fwiw.

tlienart mentioned this issue Jan 28, 2019

making kmeans take AbstractMatrix instead of Matrix #138

Merged

tlienart mentioned this issue Jan 30, 2019

refactoring of kmeans + type fixes #143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential type issues with `fuzzy_cmeans` #140

Potential type issues with `fuzzy_cmeans` #140

tlienart commented Jan 28, 2019 •

edited

Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019

alyst commented Jan 29, 2019

tlienart commented Jan 30, 2019 •

edited

Loading

nalimilan commented Jan 30, 2019

tlienart commented Jan 30, 2019 •

edited

Loading

wildart commented Jan 30, 2019 •

edited

Loading

nalimilan commented Jan 30, 2019

tlienart commented Jan 30, 2019 •

edited

Loading

wildart commented Jan 30, 2019

tlienart commented Jan 30, 2019 •

edited

Loading

wildart commented Jan 30, 2019

tlienart commented Jan 30, 2019

Potential type issues with fuzzy_cmeans #140

Potential type issues with fuzzy_cmeans #140

Comments

tlienart commented Jan 28, 2019 • edited Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019

alyst commented Jan 29, 2019

tlienart commented Jan 30, 2019 • edited Loading

nalimilan commented Jan 30, 2019

tlienart commented Jan 30, 2019 • edited Loading

wildart commented Jan 30, 2019 • edited Loading

nalimilan commented Jan 30, 2019

tlienart commented Jan 30, 2019 • edited Loading

wildart commented Jan 30, 2019

tlienart commented Jan 30, 2019 • edited Loading

wildart commented Jan 30, 2019

tlienart commented Jan 30, 2019

Potential type issues with `fuzzy_cmeans` #140

Potential type issues with `fuzzy_cmeans` #140

tlienart commented Jan 28, 2019 •

edited

Loading

tlienart commented Jan 30, 2019 •

edited

Loading

tlienart commented Jan 30, 2019 •

edited

Loading

wildart commented Jan 30, 2019 •

edited

Loading

tlienart commented Jan 30, 2019 •

edited

Loading

tlienart commented Jan 30, 2019 •

edited

Loading