-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential type issues with fuzzy_cmeans
#140
Comments
Solution 1. isn't very interesting since it fixes the error by throwing another error. Better choose the appropriate element type using |
I agree. The disadvantage being that it requires converting the matrix to a floating point one, so effectively copying the data. With respect to |
Why so? The
|
In centers[i, cj] += X[i, j] and centers[i ,cj] += X[i, j] * wj are the ones causing issues if the type of Allegedly you don't necessarily need to fully construct What I would suggest:
PS: float(Int16) # Float64
float(Bool) # Float64 |
I agree with this. That sounds simpler and more reliable than trying to choose the floating-point type based on the input type. Indeed, depending on the situation, you may need more or less precision, independent from whether the data can be converted without loss. For example, |
See #143 where basically what I implemented:
|
It does not make sense to use Integer data input for k-means without initial conversion to floating point format. The mean calculation results into a floating point value, I would not expect it to be integer. I think that better to limit input type to be derived from Moreover, make sure that the Line 251 in eaae8dd
IMHO, there should be no silent data conversion: whatever the input data type is provided the same data type should be expected in the result. It's a user responsibility to decide what kind of data to feed into the algorithm. We should just provide a clear description of parameter limitations. |
Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input? If so, it sounds wasteful to require the user to convert beforehand (especially given that it requires making a copy of the data).
Assignment into an array will never change its type, so what happen here is that |
The proposal specifically uses Line 314 in 8e156b5
Re "silent data conversion" I'm in favour of throwing a warning message and do the conversion rather than just throw an If someone enters data that looks like: 1 1 1 2 1 2 2 2 1 1 50 55 60 50 65 50 65 you'd still want your
Maybe not, but then |
Sorry, my bad. I spotted assignment and acted on it.
Restricting input to
I believe its how the function works now - all internal structures and output have the same time as input, and that is how it always should be unless a particular algorithm specifies differently. |
I think this is an odd restriction. In fact the example given on ScikitLearn's page itself is with integer data... https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html Also the previous behaviour with which you may be familiar had a few issues:
The attempt at #143 is to change this while offering better performances at the same time. I'm working on @nalimilan's suggestion which should allow to not use explicit conversion of the input data. Also I'm doing all this with a view on future use of I think all these tools from super basic ML / Data analytics should be as robust to user input as possible... (especially when it's not very complicated to make them so) |
That results in floating point array of centers.
I think users should be aware of type of their data and limitations of the algorithms that they use. But, I see a reason for providing a different type of a centers array, only when it's done explicitly. |
The code as it stands currently in #143 does this as well fwiw. |
Following explorations in #138 I had a look at the code for
fuzzy_cmeans
and tried:I believe this may stem from the fact that the algorithm accepts any subtype of
Real
and therefore<:Integer
but with integers there may be type issues with howcenters
are assigned and updated.Two suggestions
AbstractFloat
everywhere in clustering, I believe that solves the problem but maybe that we would like clustering algorithms to work on ints...X
to float if it isn't which, I believe, would be the expected behaviour from a maths perspectiveThe text was updated successfully, but these errors were encountered: