Loading large HDFDatasets inside MetaDataset is slow #1669

dorian-K · 2024-12-13T12:59:54Z

I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that _update_tag_idx in returnn/datasets/cached.py:148 is the bottleneck here.

(cc @patrick-wilken @NeoLegends @JackTemaki)

The text was updated successfully, but these errors were encountered:

patrick-wilken · 2024-12-13T17:31:28Z

For CombinedDataset there used to be a similar issue, that's why I changed its implementation so that the sequence order is passed to the sub-datasets via indices instead of sequence tags (already years ago).
MetaDataset does it via tags, and as you notice this will trigger the creation of the CachedDataset._tag_idx mapping which in your case contains 2 * 40M strings. Also, it accesses the HDF file 2 * 40M times to get the tags one by one which is slow. Not only because it's a file access, but also because the h5py library is not optimized for this and has some overhead in each call.
Also, the attributes MetaDataset.seq_list_original and MetaDataset.tag_idx probably just get large in your case.
I never had to use MetaDataset myself for so many sequences, so I don't have an existing solution, but for CombinedDataset the important points to get a setup that scales pretty much to arbitrary dataset sizes were:

don't use sequence tags at all for initialization / sequence order
(try to) remove all variables* of size O(total_num_sequences)
minimize number of HDF accesses
use sub-epochs and apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

*We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

albertz · 2024-12-13T20:41:24Z

don't use sequence tags at all for initialization / sequence order

This doesn't really work for MetaDataset because we sometimes (even often?) have datasets where the underlying seq order (and thus the seq indices) is different.

(try to) remove all variables of size O(total_num_sequences)

I thought you still have the seq indices as order?

apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

How do you do that?

We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

But this is not in master? Why did you not make a PR for this?

patrick-wilken · 2024-12-14T16:04:31Z

I thought you still have the seq indices as order?

get_seq_order_for_epoch() was changed to return a range when possible because of this. Only works for "default" and "reverse" sequence order of course, but importantly also with partition_epoch.

What we use for large MT trainings is: HDF files which are already shuffled themselves, put into HDFDatasets with default sequence order, combined via CombinedDataset with sampling_sizes set. So, a fixed amount of sequences is taken from each HDFDataset in each epoch in order (next to zero time and memory complexity for computing sequence order), and for the CombinedDataset we do use a more complex sequence ordering ("laplace"), but it operates only on sum of sampling sizes many sequences (e.g. 1M), independent of the total number of sequences.
If you don't want sampling_sizes, I think setting partition_epoch in the HDFDatasets should be equally efficient.

True, for MetaDataset it is trickier in general where you have to assume the sub-datasets contain different / differently ordered sequences. But if you need such a sequence mapping at runtime there is not much you can do. Maybe ask the sub-datasets to translate tag to index before storing all in memory, but that might again be slow.
But I would rather try to have parallel datasets in the first place, if that is at all practical in your case.
Still there are optimizations possible, I believe. For example, MetaDataset.tag_idx: is it used in your particular case at all?

dorian-K · 2025-01-08T13:58:56Z

I ended up splitting the dataset up into smaller chunks and loading them with DistributeFilesDataset

albertz · 2025-01-08T14:25:58Z

I ended up splitting the dataset up into smaller chunks and loading them with DistributeFilesDataset

This is just a workaround, not really a solution to the problem itself, i.e. the issue itself still persists.

albertz assigned patrick-wilken and NeoLegends and unassigned patrick-wilken Dec 13, 2024

dorian-K closed this as completed Jan 8, 2025

albertz reopened this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading large HDFDatasets inside MetaDataset is slow #1669

Loading large HDFDatasets inside MetaDataset is slow #1669

dorian-K commented Dec 13, 2024 •

edited by albertz

Loading

patrick-wilken commented Dec 13, 2024

albertz commented Dec 13, 2024

patrick-wilken commented Dec 14, 2024

dorian-K commented Jan 8, 2025

albertz commented Jan 8, 2025

Loading large HDFDatasets inside MetaDataset is slow #1669

Loading large HDFDatasets inside MetaDataset is slow #1669

Comments

dorian-K commented Dec 13, 2024 • edited by albertz Loading

patrick-wilken commented Dec 13, 2024

albertz commented Dec 13, 2024

patrick-wilken commented Dec 14, 2024

dorian-K commented Jan 8, 2025

albertz commented Jan 8, 2025

dorian-K commented Dec 13, 2024 •

edited by albertz

Loading