Minimum samples to cluster & cluster count linked #99

smangham · 2018-06-18T09:21:45Z

There's a slightly odd interaction with minimum cluster size and cells with few entries. In kmapper.py:372, cells are only checked for clustering if there are >= min_cluster_samples samples within them. But min_cluster_samples is set to n_clusters.

So if you set n_clusters to 3, then any cell with 3 samples in will produce 3 separate 1-sample clusters in the output. Any cell with 2 samples will produce 0 clusters (and thus likely a different unique sample count in the output). This probably has little to no impact on the graph and is unlikely to show up except in small trial datasets, but it is a bit confusing that the parameter is reused for a different (if related) purpose.

The text was updated successfully, but these errors were encountered:

sauln · 2018-06-19T09:23:12Z

This is a bit confusing. If you have any ideas for how to fix it, a PR would be very welcome.

I think this started as a necessary condition when using a clustering method that requires n many samples or will fail and was not meant as a user facing facility.

Do you want to expose the option to set min_cluster_samples to the user in a better way?

rintrah · 2018-07-05T14:39:58Z

Dear all,

I ran into the same odd interaction and decided to make explicit the number of cells. I also had some odd interaction with agglomerative clustering and min_cluster_samples.

I think that it would be better to expose the option min_cluster_samples to set to the user.

Kind regards,

MLWave · 2018-07-10T08:47:06Z

Cool, we can do that.

Originally this was added, because some cluster algorithms in Scikit-learn exit with an error if you try clustering data that has a size less than n_clusters. (n_clusters=3 and 2 samples in hypercube results in an error).

sauln · 2018-07-12T22:40:25Z

Hi @rintrah, Pull requests are very welcome if you have already made some of these changes!

Thank you!

rintrah · 2018-07-19T15:55:23Z

Hello, Sauln,

I am sorry for my late response, but the past days have been hectic.

I will write a clean version and do a pull request. I haven't done pull request before, but I think it won't be so difficult.

Best wishes,

sauln · 2018-07-19T16:36:24Z

No worries!

We look forward to the PR and will try to help in whatever way possible to make the experience easy.

Please let me know if you have any questions.

sauln added the help wanted label Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum samples to cluster & cluster count linked #99

Minimum samples to cluster & cluster count linked #99

smangham commented Jun 18, 2018

sauln commented Jun 19, 2018

rintrah commented Jul 5, 2018

MLWave commented Jul 10, 2018

sauln commented Jul 12, 2018

rintrah commented Jul 19, 2018

sauln commented Jul 19, 2018

Minimum samples to cluster & cluster count linked #99

Minimum samples to cluster & cluster count linked #99

Comments

smangham commented Jun 18, 2018

sauln commented Jun 19, 2018

rintrah commented Jul 5, 2018

MLWave commented Jul 10, 2018

sauln commented Jul 12, 2018

rintrah commented Jul 19, 2018

sauln commented Jul 19, 2018