Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not store data multiple times in memory if possible #354

Open
jakobnissen opened this issue Aug 28, 2024 · 0 comments
Open

Do not store data multiple times in memory if possible #354

jakobnissen opened this issue Aug 28, 2024 · 0 comments

Comments

@jakobnissen
Copy link
Member

For large datasets, memory becomes a constraining factor. Large datasets typically consists of millions of contigs and 100+ samples. I currently believe that the most memory-consuming part of Vamb is during the encoding. Here, the largest data structures consists of:

  • The abundance array, which is n_contigs * n_samples floats.
  • The composition array, which is n_contigs * 103 floats
  • The latent array, which is n_contigs * n_latent floats.
    There are also more data around - the contig lengths (1 uint32), the contig names and such. However, when n_samples >> 100, this should matter less.

@shiraz-shah had a dataset of ~1600 samples, ~1.5e7 contigs. So, we should estimate the memory usage to be 4 * (1600 + 103) * 1.5e7 = 1.02e11 bytes. However, memory usage was twice as high - apparently it showed up in the system as two processes, each taking ~102 GB.
That suggests that somehow, the underlying data is duplicated.

To solve this

  • Get a big dataset ~2e6 contigs * 1000 samples, which should take around 9 GB of memory.
  • Run it through Vamb with reduced epochs, see if memory exceeds the estimate by roughly 2x
  • Then drill down into why
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant