You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For large datasets, memory becomes a constraining factor. Large datasets typically consists of millions of contigs and 100+ samples. I currently believe that the most memory-consuming part of Vamb is during the encoding. Here, the largest data structures consists of:
The abundance array, which is n_contigs * n_samples floats.
The composition array, which is n_contigs * 103 floats
The latent array, which is n_contigs * n_latent floats.
There are also more data around - the contig lengths (1 uint32), the contig names and such. However, when n_samples >> 100, this should matter less.
@shiraz-shah had a dataset of ~1600 samples, ~1.5e7 contigs. So, we should estimate the memory usage to be 4 * (1600 + 103) * 1.5e7 = 1.02e11 bytes. However, memory usage was twice as high - apparently it showed up in the system as two processes, each taking ~102 GB.
That suggests that somehow, the underlying data is duplicated.
To solve this
Get a big dataset ~2e6 contigs * 1000 samples, which should take around 9 GB of memory.
Run it through Vamb with reduced epochs, see if memory exceeds the estimate by roughly 2x
Then drill down into why
The text was updated successfully, but these errors were encountered:
For large datasets, memory becomes a constraining factor. Large datasets typically consists of millions of contigs and 100+ samples. I currently believe that the most memory-consuming part of Vamb is during the encoding. Here, the largest data structures consists of:
n_contigs * n_samples
floats.n_contigs * 103
floatsn_contigs * n_latent
floats.There are also more data around - the contig lengths (1 uint32), the contig names and such. However, when n_samples >> 100, this should matter less.
@shiraz-shah had a dataset of ~1600 samples, ~1.5e7 contigs. So, we should estimate the memory usage to be
4 * (1600 + 103) * 1.5e7 = 1.02e11 bytes
. However, memory usage was twice as high - apparently it showed up in the system as two processes, each taking ~102 GB.That suggests that somehow, the underlying data is duplicated.
To solve this
The text was updated successfully, but these errors were encountered: