Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAE mode taking too long for clustering and tsv file generation #232

Open
microbiomix opened this issue Oct 24, 2023 · 4 comments
Open

AAE mode taking too long for clustering and tsv file generation #232

microbiomix opened this issue Oct 24, 2023 · 4 comments

Comments

@microbiomix
Copy link

Hello,

I am binning ~22M contigs from ~900 samples using AAE with default parameters. Training finished in 26 hours on an A100 GPU. Since then, it has been clustering and writing the clusters.tsv file extremely slowly. The first 18M contigs were written fast. From 18.4M contigs at the 12 hour mark, it was only at 18.8M contigs at the 18 hour mark. After 3.5 days, it is still at 20.5M contigs. I am not sure how long I will have to wait!

Currently, GPU shows 25GB and 99% GPU utilized. Linux OS shows that vamb uses 1 CPU and 150GB memory.

For comparison, on the same server/GPU, VAE mode took 17 hours to train and 4.5 hours to cluster. I am using version 4.1.3.

Is this expected or am I doing something wrong? Happy to provide more info if needed.

Thanks a lot!

@jakobnissen
Copy link
Member

Dear @microbiomix

Thanks for the report. I'll need some more information:

  • What version of AVAMB are you running?
  • AVAMB clusters three times: With the VAE, the Z space, and the Y space. The Y should be of no concern. The two others use the same underlying clustering algorithm as the VAE. Can you post the log file, so I can see which clustering step is taken time?

I'm aware of some issues in the clustering algorithm when run on GPU with a lot of contigs. I think these have been resolved on master, but I'll take a look again

@microbiomix
Copy link
Author

Hi @jakobnissen,

Thanks for the response.

  • I am using AVAMB 4.1.3 installed using PIP, not the master from github.
  • I ran VAE and AAE separately but in parallel to save time (cf Parallel training of VAE and AAE? #170). The problematic run is the AAE-only mode. The file that is being generated slowly is aae_z_clusters.tsv. I am attaching the partial log file: log.txt

@microbiomix
Copy link
Author

Hi @jakobnissen ,

The AVAMB 4.1.3 installation via conda that I have, uses the following parameters in cluster.py:

_DEFAULT_RADIUS = 0.06
# Distance within which to search for medoid point
_MEDOID_RADIUS = 0.05

So, it doesn't seem to be affected by the issues listed in #250 and links thereof? Any ideas where this slowdown might be originating? Should I try the master version for comparison to see if it is resolved there?

@ShriramHPatel
Copy link

@microbiomix any luck in resolving slow clustering issue with AAE mode? Adapted avamb workflow v4.1.3 and this step takes forever (even with gpu). Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants