-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading large HDFDatasets inside MetaDataset is slow #1669
Comments
For
*We even use a custom variant of the HDFDataset that avoids |
This doesn't really work for
I thought you still have the seq indices as order?
How do you do that?
But this is not in master? Why did you not make a PR for this? |
What we use for large MT trainings is: HDF files which are already shuffled themselves, put into HDFDatasets with default sequence order, combined via CombinedDataset with True, for MetaDataset it is trickier in general where you have to assume the sub-datasets contain different / differently ordered sequences. But if you need such a sequence mapping at runtime there is not much you can do. Maybe ask the sub-datasets to translate tag to index before storing all in memory, but that might again be slow. |
I ended up splitting the dataset up into smaller chunks and loading them with DistributeFilesDataset |
This is just a workaround, not really a solution to the problem itself, i.e. the issue itself still persists. |
I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that
_update_tag_idx
inreturnn/datasets/cached.py:148
is the bottleneck here.(cc @patrick-wilken @NeoLegends @JackTemaki)
The text was updated successfully, but these errors were encountered: