Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for PackedDataset support for mmap #1051

Open
parthsarthi03 opened this issue Jun 4, 2024 · 6 comments
Open

Request for PackedDataset support for mmap #1051

parthsarthi03 opened this issue Jun 4, 2024 · 6 comments

Comments

@parthsarthi03
Copy link
Contributor

Hi, thank you for the great work!

The current implementation of the PackedDataset class only supports in-memory map-style datasets. When working with large datasets, the in-memory limitation can cause issues with CPU RAM usage. I often run out of CPU RAM when using PackedDataset leading to errors like

Packing dataset:  10%|███▊                                | 364975/3494093 [15:39<2:38:09, 329.73it/s]W0604 09:22:40.995000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066680 closing signal SIGTERM
W0604 09:22:40.995000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066681 closing signal SIGTERM
W0604 09:22:40.996000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066682 closing signal SIGTERM
E0604 09:23:02.230000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 3066679) of binary:
in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ps/torchtune/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-04_09:22:40
  host      : spk-4h100-hgx-14
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 3066679)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3066679
============================================================

I wanted to know if there are any future plans of supporting an mmap version of PackedDataset or if there are ways of getting around high CPU ram utilisation while creating the packed dataset ?

@RdoubleA
Copy link
Contributor

RdoubleA commented Jun 4, 2024

@parthsarthi03 Yes this is a known limitation of the current version of sample packing. We are planning to migrate to iterable datasets entirely for better support of large datasets that require streaming. This will significantly reduce the memory costs of sample packing, as you no longer have to hold the entire packed dataset in memory, but instead just pack sequences on the fly or pack upto a buffer size. This is a large refactor and is still in the works :)

One possible approach with our current in-memory datasets that can reduce CPU RAM is to partition the packing itself, currently on a distributed setup every rank does the same packing, which is highly inefficient. Ideally, each rank packs its own shard of the dataset, so you don't keep multiple copies of the same dataset. The challenge is making sure shuffling and distributed sampling is still random and pulls data from the correct shards. If this is something that is blocking you, we can consider adding this approach in the interim until iterable datasets. cc @ebsmothers, @rohan-varma for any other thoughts on this.

Out of curiosity, what is the underlying dataset you are using? Are you using this via Instruct/ChatDataset?

@parthsarthi03
Copy link
Contributor Author

Thank you! Looking forward to the refactor, having a distributed packing setup for the interim would be great too, if possible. I'm using the TextCompletionDataset, I had wrapped it with the PackedDataset similar to how it is done in the InstructDataset.

@parthsarthi03
Copy link
Contributor Author

@RdoubleA any update on this?

@RdoubleA
Copy link
Contributor

RdoubleA commented Oct 29, 2024

@parthsarthi03 No update unfortunately since we haven't had bandwidth for this. But we're having early discussions around improving the data experience and performance optimizations and this issue is on our mind. Will update you once we have more concrete plans!

@Tandon-A
Copy link

@RdoubleA ,

Thank you for starting discussions around this. Looking forward to try out streaming dataset with packing enabled, while getting similar performance as in-memory datasets.

@RdoubleA
Copy link
Contributor

RdoubleA commented Nov 11, 2024

@Tandon-A See the draft PR for the torchdata integration, which should bring iterable datasets with multithreading: #1929

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants