Rewrite reclustering #340

jakobnissen · 2024-07-16T12:51:34Z

This PR completely rewrites reclustering. A few important differences:

It now uses Markers from vamb.parsemarkers, avoiding shelling out to prodigal
and hmmer
The algorithm used to pick the best disjoint bins in the DBScan algorithm is
faster, and should also produce better results.
The code is almost completely statically analyzable
Pandas is no longer used

It also refactors parsemarkers.py.
It is built on #339, so that needs to be merged first.

TODO

Merge Make TaxVamb taxonomy output format well-defined #339
Add tests

jakobnissen · 2024-08-30T13:14:27Z

Hm, it may be that the changed deduplication is actually worse - perhaps because most contigs have no marker information, so this new approach that dynamically finds the best clusters is not as good as globally finding the best eps value, allowing the clusters with markers to inform the eps value of the clusters without markers. Needs tests.

This commit makes parsemarkers work even when the split contigs do not have integer identifiers. It also allows easier subsetting of the contigs to predict genes for. The API is still a little bad, but it's being restricted by Python's bad multiprocessing.

This commit completely rewrites reclustering. A few important differences: * It now uses Markers from vamb.parsemarkers, avoiding shelling out to prodigal and hmmer * The algorithm used to pick the best disjoint bins in the DBScan algorithm is faster, and should also produce better results. * The code is almost completely statically analyzable * Pandas is no longer used

Commit cb93655 removed the last use of Pandas in the codebase. Good riddance!

jakobnissen · 2024-11-11T09:41:24Z

I'm going to merge this and then commence the larger quality control and testing needed for v5. That implies there will probably be some bugs in this commit.

jakobnissen force-pushed the recluster branch from e9b7619 to 2e399f5 Compare July 17, 2024 05:48

jakobnissen mentioned this pull request Jul 17, 2024

CLI errors do not fail CI #344

Closed

jakobnissen force-pushed the recluster branch 3 times, most recently from 3e28676 to 12d6dde Compare July 22, 2024 08:57

jakobnissen added Needs benchmark Must benchmark before merging this needs tests Needs workflow run Test the associated Snakemake workflow before merge labels Jul 22, 2024

jakobnissen force-pushed the recluster branch from 6e05a7a to ba4124b Compare August 27, 2024 07:29

jakobnissen force-pushed the recluster branch from bbf74f3 to 900cabf Compare September 4, 2024 08:37

jakobnissen added 15 commits November 6, 2024 12:53

Refactor parsemarkers

eee3073

This commit makes parsemarkers work even when the split contigs do not have integer identifiers. It also allows easier subsetting of the contigs to predict genes for. The API is still a little bad, but it's being restricted by Python's bad multiprocessing.

Remove Pandas dependency

226e27a

Commit cb93655 removed the last use of Pandas in the codebase. Good riddance!

Fixup: Assert tax is canonical in reclustering

fe0ec14

Fixup: Use Union types correctly

2537e91

Fixup reclustering: Raise better error

e21b6ed

Use new reclustering in main

b043641

Fixup: Add attribute

39b9903

Fixup: Amend tests

85740b1

Make predict_taxonomy return the computed object

4ffc098

Remove use of stdlib

0cdd944

Add tests for parsemarkers

4fa4504

Fixup: Add comment

735edf4

Make reclustering work

a2e3c54

Use pqdict for efficient reclustering deduplication

3adaa41

jakobnissen force-pushed the recluster branch from 900cabf to 3adaa41 Compare November 6, 2024 11:54

jakobnissen marked this pull request as ready for review November 11, 2024 09:30

Back to old alg

e548fa0

jakobnissen merged commit 1127238 into master Nov 11, 2024
6 checks passed

jakobnissen deleted the recluster branch November 11, 2024 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite reclustering #340

Rewrite reclustering #340

jakobnissen commented Jul 16, 2024 •

edited

Loading

jakobnissen commented Aug 30, 2024

jakobnissen commented Nov 11, 2024

Rewrite reclustering #340

Rewrite reclustering #340

Conversation

jakobnissen commented Jul 16, 2024 • edited Loading

TODO

jakobnissen commented Aug 30, 2024

jakobnissen commented Nov 11, 2024

jakobnissen commented Jul 16, 2024 •

edited

Loading