Request for PackedDataset support for mmap #1051

parthsarthi03 · 2024-06-04T20:38:31Z

Hi, thank you for the great work!

The current implementation of the PackedDataset class only supports in-memory map-style datasets. When working with large datasets, the in-memory limitation can cause issues with CPU RAM usage. I often run out of CPU RAM when using PackedDataset leading to errors like

Packing dataset:  10%|███▊                                | 364975/3494093 [15:39<2:38:09, 329.73it/s]W0604 09:22:40.995000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066680 closing signal SIGTERM
W0604 09:22:40.995000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066681 closing signal SIGTERM
W0604 09:22:40.996000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3066682 closing signal SIGTERM
E0604 09:23:02.230000 140716581643456 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 3066679) of binary:
in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ps/torchtune/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-04_09:22:40
  host      : spk-4h100-hgx-14
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 3066679)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3066679
============================================================

I wanted to know if there are any future plans of supporting an mmap version of PackedDataset or if there are ways of getting around high CPU ram utilisation while creating the packed dataset ?

The text was updated successfully, but these errors were encountered:

RdoubleA · 2024-06-04T21:30:25Z

@parthsarthi03 Yes this is a known limitation of the current version of sample packing. We are planning to migrate to iterable datasets entirely for better support of large datasets that require streaming. This will significantly reduce the memory costs of sample packing, as you no longer have to hold the entire packed dataset in memory, but instead just pack sequences on the fly or pack upto a buffer size. This is a large refactor and is still in the works :)

One possible approach with our current in-memory datasets that can reduce CPU RAM is to partition the packing itself, currently on a distributed setup every rank does the same packing, which is highly inefficient. Ideally, each rank packs its own shard of the dataset, so you don't keep multiple copies of the same dataset. The challenge is making sure shuffling and distributed sampling is still random and pulls data from the correct shards. If this is something that is blocking you, we can consider adding this approach in the interim until iterable datasets. cc @ebsmothers, @rohan-varma for any other thoughts on this.

Out of curiosity, what is the underlying dataset you are using? Are you using this via Instruct/ChatDataset?

parthsarthi03 · 2024-06-04T22:02:50Z

Thank you! Looking forward to the refactor, having a distributed packing setup for the interim would be great too, if possible. I'm using the TextCompletionDataset, I had wrapped it with the PackedDataset similar to how it is done in the InstructDataset.

parthsarthi03 · 2024-10-29T08:12:13Z

@RdoubleA any update on this?

RdoubleA · 2024-10-29T16:54:11Z

@parthsarthi03 No update unfortunately since we haven't had bandwidth for this. But we're having early discussions around improving the data experience and performance optimizations and this issue is on our mind. Will update you once we have more concrete plans!

Tandon-A · 2024-11-11T22:58:43Z

@RdoubleA ,

Thank you for starting discussions around this. Looking forward to try out streaming dataset with packing enabled, while getting similar performance as in-memory datasets.

RdoubleA · 2024-11-11T23:31:01Z

@Tandon-A See the draft PR for the torchdata integration, which should bring iterable datasets with multithreading: #1929

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for PackedDataset support for mmap #1051

Request for PackedDataset support for mmap #1051

parthsarthi03 commented Jun 4, 2024

RdoubleA commented Jun 4, 2024

parthsarthi03 commented Jun 4, 2024

parthsarthi03 commented Oct 29, 2024

RdoubleA commented Oct 29, 2024 •

edited

Loading

Tandon-A commented Nov 11, 2024

RdoubleA commented Nov 11, 2024 •

edited

Loading

Request for PackedDataset support for mmap #1051

Request for PackedDataset support for mmap #1051

Comments

parthsarthi03 commented Jun 4, 2024

RdoubleA commented Jun 4, 2024

parthsarthi03 commented Jun 4, 2024

parthsarthi03 commented Oct 29, 2024

RdoubleA commented Oct 29, 2024 • edited Loading

Tandon-A commented Nov 11, 2024

RdoubleA commented Nov 11, 2024 • edited Loading

RdoubleA commented Oct 29, 2024 •

edited

Loading

RdoubleA commented Nov 11, 2024 •

edited

Loading