-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HDBSCAN from HorseML.jl #273
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #273 +/- ##
==========================================
+ Coverage 95.40% 95.56% +0.15%
==========================================
Files 20 22 +2
Lines 1503 1647 +144
==========================================
+ Hits 1434 1574 +140
- Misses 69 73 +4 ☔ View full report in Codecov by Sentry. |
@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some? |
there was no test for it
test failed on a ubuntu-latest-x86
tests still fails on ubuntu-latest-x86
I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions.
|
@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point). |
there were functions for Xmeans
deleted some utility functions
Documentation workflow failed due to the docs for it
I forgot to export it
I noticed that |
Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the PR!
The first iteration looks good -- we don't need much refactoring, mostly some method renames and object field tweaks for clarity.
Of a bigger things:
- check the dimensions of the
points
matrix - use
Distances.jl
API to calculate the point distances (see the other methods, e.g.clustering_quality()
for how we do it); it is worth precalculating all pairwise distances and passing it to the methods that calculate core distances and build the graph. We can also allow the user to specify the metric as a kwarg tohdbscan
, which would be a useful generalization. - add more tests to check that the point assignments to the clusters are correct
there is no need to create new file
add comment and alias
hdbscan.md remains
the progress is temporary
add comment and improve performance, etc.
changes sugges alyst
error occured with Julia1.10
forgot to remove debugging code
add description for detail
fix links
add space
make isnoise available for user
@alyst |
we don't really need a function, this is one line operation
for consistency
and remove Base.getindex()
add_edges!() is used only once
@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration. Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first. TODO items:
|
replace eachcol with alternatives
add newline
cleanup UnionFind terminology and move it to the other file
there wes no test for it
rename HdbscanCluster into HdbscanNode and create new one to expose to the user
the algorithm went wrong
ensure that `min_size` effects properly
add counts field into HdbscanResult
@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about I would appreciate it if you could check for any performance issues regarding the fixed algorithm. |
Hi, thanks for the effort of bringing HDBSCAN to Clustering.jl! I wonder what is the current status of this issue? |
Add HDBSCAN
I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is.
I changed some code from my original to get closer to the code of this repo.
Abstract of functions and structures
As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.
Usage
This is the usage of main function.
Parameters
points
: the d×n matrix, where each column is a d-dimensional coordinate of a pointk
: we will define "core distance of point A" as the distance between point A and thek
th neighbor point of point A.min_cluster_size
: minimum number of points in the clustergen_mst
: whether to generate minimum-spannig-tree or notmst
: when is specified andgen_mst
is false, new mst won't be generatedExample
I checked that this following code is available: