Do not store data multiple times in memory if possible #354

jakobnissen · 2024-08-28T13:56:34Z

For large datasets, memory becomes a constraining factor. Large datasets typically consists of millions of contigs and 100+ samples. I currently believe that the most memory-consuming part of Vamb is during the encoding. Here, the largest data structures consists of:

The abundance array, which is n_contigs * n_samples floats.
The composition array, which is n_contigs * 103 floats
The latent array, which is n_contigs * n_latent floats.
There are also more data around - the contig lengths (1 uint32), the contig names and such. However, when n_samples >> 100, this should matter less.

@shiraz-shah had a dataset of ~1600 samples, ~1.5e7 contigs. So, we should estimate the memory usage to be 4 * (1600 + 103) * 1.5e7 = 1.02e11 bytes. However, memory usage was twice as high - apparently it showed up in the system as two processes, each taking ~102 GB.
That suggests that somehow, the underlying data is duplicated.

To solve this

Get a big dataset ~2e6 contigs * 1000 samples, which should take around 9 GB of memory.
Run it through Vamb with reduced epochs, see if memory exceeds the estimate by roughly 2x
Then drill down into why

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not store data multiple times in memory if possible #354

Do not store data multiple times in memory if possible #354

jakobnissen commented Aug 28, 2024

Do not store data multiple times in memory if possible #354

Do not store data multiple times in memory if possible #354

Comments

jakobnissen commented Aug 28, 2024