-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for PackedDataset support for mmap #1051
Comments
@parthsarthi03 Yes this is a known limitation of the current version of sample packing. We are planning to migrate to iterable datasets entirely for better support of large datasets that require streaming. This will significantly reduce the memory costs of sample packing, as you no longer have to hold the entire packed dataset in memory, but instead just pack sequences on the fly or pack upto a buffer size. This is a large refactor and is still in the works :) One possible approach with our current in-memory datasets that can reduce CPU RAM is to partition the packing itself, currently on a distributed setup every rank does the same packing, which is highly inefficient. Ideally, each rank packs its own shard of the dataset, so you don't keep multiple copies of the same dataset. The challenge is making sure shuffling and distributed sampling is still random and pulls data from the correct shards. If this is something that is blocking you, we can consider adding this approach in the interim until iterable datasets. cc @ebsmothers, @rohan-varma for any other thoughts on this. Out of curiosity, what is the underlying dataset you are using? Are you using this via Instruct/ChatDataset? |
Thank you! Looking forward to the refactor, having a distributed packing setup for the interim would be great too, if possible. I'm using the |
@RdoubleA any update on this? |
@parthsarthi03 No update unfortunately since we haven't had bandwidth for this. But we're having early discussions around improving the data experience and performance optimizations and this issue is on our mind. Will update you once we have more concrete plans! |
Thank you for starting discussions around this. Looking forward to try out streaming dataset with packing enabled, while getting similar performance as in-memory datasets. |
Hi, thank you for the great work!
The current implementation of the
PackedDataset
class only supports in-memory map-style datasets. When working with large datasets, the in-memory limitation can cause issues with CPU RAM usage. I often run out of CPU RAM when using PackedDataset leading to errors likeI wanted to know if there are any future plans of supporting an mmap version of
PackedDataset
or if there are ways of getting around high CPU ram utilisation while creating the packed dataset ?The text was updated successfully, but these errors were encountered: