-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite reclustering #340
Rewrite reclustering #340
Conversation
3e28676
to
12d6dde
Compare
Hm, it may be that the changed deduplication is actually worse - perhaps because most contigs have no marker information, so this new approach that dynamically finds the best clusters is not as good as globally finding the best eps value, allowing the clusters with markers to inform the eps value of the clusters without markers. Needs tests. |
bbf74f3
to
900cabf
Compare
This commit makes parsemarkers work even when the split contigs do not have integer identifiers. It also allows easier subsetting of the contigs to predict genes for. The API is still a little bad, but it's being restricted by Python's bad multiprocessing.
This commit completely rewrites reclustering. A few important differences: * It now uses Markers from vamb.parsemarkers, avoiding shelling out to prodigal and hmmer * The algorithm used to pick the best disjoint bins in the DBScan algorithm is faster, and should also produce better results. * The code is almost completely statically analyzable * Pandas is no longer used
Commit cb93655 removed the last use of Pandas in the codebase. Good riddance!
900cabf
to
3adaa41
Compare
I'm going to merge this and then commence the larger quality control and testing needed for v5. That implies there will probably be some bugs in this commit. |
This PR completely rewrites reclustering. A few important differences:
and hmmer
faster, and should also produce better results.
It also refactors
parsemarkers.py
.It is built on #339, so that needs to be merged first.
TODO