Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum samples to cluster & cluster count linked #99

Open
smangham opened this issue Jun 18, 2018 · 6 comments
Open

Minimum samples to cluster & cluster count linked #99

smangham opened this issue Jun 18, 2018 · 6 comments

Comments

@smangham
Copy link
Contributor

There's a slightly odd interaction with minimum cluster size and cells with few entries. In kmapper.py:372, cells are only checked for clustering if there are >= min_cluster_samples samples within them. But min_cluster_samples is set to n_clusters.

So if you set n_clusters to 3, then any cell with 3 samples in will produce 3 separate 1-sample clusters in the output. Any cell with 2 samples will produce 0 clusters (and thus likely a different unique sample count in the output). This probably has little to no impact on the graph and is unlikely to show up except in small trial datasets, but it is a bit confusing that the parameter is reused for a different (if related) purpose.

@sauln
Copy link
Member

sauln commented Jun 19, 2018

This is a bit confusing. If you have any ideas for how to fix it, a PR would be very welcome.

I think this started as a necessary condition when using a clustering method that requires n many samples or will fail and was not meant as a user facing facility.

Do you want to expose the option to set min_cluster_samples to the user in a better way?

@rintrah
Copy link

rintrah commented Jul 5, 2018

Dear all,

I ran into the same odd interaction and decided to make explicit the number of cells. I also had some odd interaction with agglomerative clustering and min_cluster_samples.

I think that it would be better to expose the option min_cluster_samples to set to the user.

Kind regards,

@MLWave
Copy link
Member

MLWave commented Jul 10, 2018

Cool, we can do that.

Originally this was added, because some cluster algorithms in Scikit-learn exit with an error if you try clustering data that has a size less than n_clusters. (n_clusters=3 and 2 samples in hypercube results in an error).

@sauln
Copy link
Member

sauln commented Jul 12, 2018

Hi @rintrah, Pull requests are very welcome if you have already made some of these changes!

Thank you!

@rintrah
Copy link

rintrah commented Jul 19, 2018

Hello, Sauln,

I am sorry for my late response, but the past days have been hectic.

I will write a clean version and do a pull request. I haven't done pull request before, but I think it won't be so difficult.

Best wishes,

@sauln
Copy link
Member

sauln commented Jul 19, 2018

No worries!

We look forward to the PR and will try to help in whatever way possible to make the experience easy.

Please let me know if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants