Add multiprocess dataset packing #2180

bratao · 2024-12-19T13:23:56Z

Hello,

I have a custom JSONL dataset with 4 million examples. In Axolotl, I am able to load this with packing on a 64-core machine in 10 minutes, as it uses all cores. It also caches the dataset packing and is super handy.

The same dataset in torchtune takes 6 hours if packing is enabled. Apparently, it uses only one core.

Is there any way to accelerate this?

joecummings · 2024-12-19T14:12:21Z

This is a great suggestion! Let me look into how we could add this into the library.

Caching is something that we may need to think a little more about so we don't accidentally take up too much of our users memory, but we should definitely be able to utilize more cores to pack.

joecummings · 2024-12-19T14:22:47Z

It looks like Axolotl makes use of the map and filter functions on the Hugging Face Dataset abstraction, which is pretty neat. That way they can just set a default num_processes to use and find a potentially cached dataset. I'll look into if this is feasible for torchtune to do or if we might have to roll our own solution based on that.

ebsmothers · 2024-12-19T17:28:17Z

cc @andrewkho

andrewkho · 2024-12-19T19:12:34Z

@bratao what's the command you're using to run this with tune? it'd be interesting to see how we can improve on the baseline using torchdata

bratao · 2024-12-19T19:19:08Z

@andrewkho
This is my config ( I used this torchdata dataloader but there is no difference for me in the main branch. It is using only 1 cpu core)

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: llama_finetune_train_modelo_unico_v1.jsonl
  split: train
  conversation_column: conversations
  conversation_style: sharegpt
  packed: True  # True increases speed
seed: null
shuffle: True
use_torchdata: true
dataloader:
  parallel_method: thread
  num_workers: 8
  packed: True

The run command is tune run full_finetune_single_device --config 32B_full_single_device_mi300.yaml. This config is based on qwen2.5-7b with the 32B model.

andrewkho · 2024-12-19T19:23:46Z

Thanks @bratao ! What version of python are you using? Are you able to share the jsonl file or some mangled version of it? Have you tried using parallel_method: "process"? Threads may be GIL bound so processes are probably better, but it's probably easiest if I try to run this myself.

edit: are you running on a branch? I'd like to see how torchdata is set up for this recipe

bratao · 2024-12-19T19:29:21Z

@andrewkho thank you so much. I´m using Python 3.11 in Ubuntu 22.04
Unfortunately I cannot share the jsonl file, as it has sensitive information. But is like that with 4 million lines:


{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Oi! Eu sou o EscavAI! Em que posso te ajudar hoje?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Oi! Prazer, sou o EscavAI! Espero que esteja bem. Como posso te ajudar?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Que bom te ver por aqui! Me chamo EscavAI! Posso te ajudar com algo?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Olá, tudo bem? Eu sou o EscavAI! Como posso te ajudar?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Bom dia"}, {"from": "gpt", "value": "Bom dia! Em que posso te ajudar?"}]}

I will try with parallel_method: "process"

andrewkho · 2024-12-19T19:30:22Z

@bratao totally understand, thank you, this is enough to go on, will report back soon :) obrigado

alanrios2001 · 2024-12-19T19:52:47Z

Caching is something that we may need to think a little more about so we don't accidentally take up too much of our users memory, but we should definitely be able to utilize more cores to pack.

Caching could be optional.

andrewkho · 2024-12-21T09:49:46Z

so a small update, I copy/pasted the sample files until I got around 1.6M lines of JSON-L, which takes around 12 minutes (estimated) to load on my machine with the current implementation.

With a straightforward torchdata.nodes implementation on duplicated version of this dataset I'm able to see around 6x speed up, from 12 minutes to a little under 2 minutes with 16 workers and multi-process. Also found a feature we should add to our parallel mapper for automatic pre-batching.
If I jump through more hoops, I'm able to see around 20-30x speedup with 32 workers, bringing the time down to around 20-25s if you ignore warmup time, which is much better. This is going to need us to impelment a new mode of parallelism in torchdata which we probably need to do anyways.
The above hoops/pain I am experiencing is all due to passing data between processes, and with Free-threaded Python (3.13t) I imagine a lot of this pain will go away, but datasets is not able to run with free-threaded python due to aiohttp not working yet (cffi might be the offending package).

But one fundamental question, is the point here to race through packing into a map-style dataset as fast as possible, or is the real goal to set up a streaming packer to be used in training? cc @ebsmothers @joecummings @bratao

bratao · 2024-12-21T17:40:42Z

Thank you so much @andrewkho for analyzing this problem!

Would you mind sharing the configuration and command line you used? In my case, it's taking hours rather than minutes, so achieving 12 minutes would be a significant improvement for me.

Regarding the other options, while I'm not familiar with the project's specific goals, the torchdata.nodes approach with a straightforward implementation sounds like the better option. Waiting for free-threading Python isn't practical right now since it's still experimental. I don't think we need to sacrifice features just for maximum speed, especially since the training phase should dominate the overall running time. Being able to load the data in minutes would be perfectly sufficient for my needs!

andrewkho · 2024-12-21T18:12:10Z

@bratao got it so the end goal is training, in that case I suggest we go with a streaming model of packing, then you won't wait for the job to start at all, packing will happen on the fly. This should also be more (CPU) memory efficient

joecummings · 2024-12-21T18:27:15Z

@andrewkho But how long would it take to get streaming packing in vs. the straightforward torchdata.nodes approach for map-style datasets?

Map-style will always need to be supported, so it wouldn't be wasted effort IMO to get this initial approach in as long as it's not going to take a month to implement.

andrewkho · 2024-12-21T19:00:39Z

@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some nodes to do multi process streaming to fill _packs. From there it should be somewhat straightforward to enable packed streaming with nodes , although recipe integration will take more time.

However this basic version of map style isn't going to maximally utilize the cores. To improve that (my second bullet point) it'll take more work on torchdata side, we need a new stream parallelizer (which we need anyways), but it's going to need some design work on our end to get it right.

One thing we could do is land the packer, integrate with existing packed dataset to improve startup. Then we can enable streaming packing in torchtune while we work on the stream parallelizer for a further ttfb win later on. Wdyt?

joecummings · 2024-12-22T15:45:35Z

@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some nodes to do multi process streaming to fill _packs. From there it should be somewhat straightforward to enable packed streaming with nodes , although recipe integration will take more time.

However this basic version of map style isn't going to maximally utilize the cores. To improve that (my second bullet point) it'll take more work on torchdata side, we need a new stream parallelizer (which we need anyways), but it's going to need some design work on our end to get it right.

One thing we could do is land the packer, integrate with existing packed dataset to improve startup. Then we can enable streaming packing in torchtune while we work on the stream parallelizer for a further ttfb win later on. Wdyt?

Sounds brilliant. 10/10. Very excited. Lmk how we can help facilitate.

andrewkho · 2024-12-22T17:13:18Z

@bratao just to set expectations, I'll be out for Christmas and new years, and we'll get going on this in January, hope that's alright!

bratao · 2024-12-22T18:33:53Z

@andrewkho of course. I will be using axolotl for this run but anxiously waiting to use the new implementation. Have a good Christmas and a prosperous new year, take the opportunity to rest and enjoy with your family!!!

BTW, just to let you know, I tried

use_torchdata: true
dataloader:
  parallel_method: process
  num_workers: 8
  shuffle: True
  packed: True

But it still used only one core, with the following ETA:
Packing dataset: 7%|¦¦¦¦¦¦¦¦? | 247394/3449510 [18:04<4:02:07, 220.42it/s]

andrewkho · 2024-12-23T00:09:12Z

@bratao after some digging, thats because torchdata hasn't been integrated to this recipe yet so those settings would have no effect

andrewkho · 2025-01-07T01:51:23Z

@bratao happy new year! I put a small demo together of the most straightforward solution: https://github.com/pytorch/torchtune/compare/main...andrewkho:torchtune:andrewkh/parallel-packer?expand=1

You'll need torchdata's nightly build to test this:
pip install --index-url https://download.pytorch.org/whl/nightly/cpu torchdata

I ran this with tune run full_finetune_single_device --config recipes/configs/qwen2_5/32B_full_single_device_mi300.yaml dataset.packer_num_workers=12 and get around 14k QPS (lines of input processed) vs 2k QPS when using packer_num_workers=0.

At this point the Packer itself is the bottleneck and the next step is to run that in parallel, which will need a bit more work on our end since it's stateful. But as discussed above, for a streaming solution this is probably sufficient for most use cases. Could you give this a shot on your dataset and see how much it helps? You will need to apply patches for _packed.py and _chat.py, then update your config appropriately

bratao · 2025-01-07T01:53:38Z

Thank you so much. Happy new year! I will try and report ASAP

bratao · 2025-01-18T13:30:49Z

@andrewkho sorry I forgot to report back. I got a 8x speedup that is good enough for me!!
Looking forward for it in the main branch

Thanks for the improvement!!!

joecummings added the triaged This issue has been assigned an owner and appropriate label label Dec 19, 2024

joecummings self-assigned this Dec 19, 2024

joecummings added the enhancement New feature or request label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocess dataset packing #2180

Add multiprocess dataset packing #2180

bratao commented Dec 19, 2024 •

edited

Loading

joecummings commented Dec 19, 2024

joecummings commented Dec 19, 2024

ebsmothers commented Dec 19, 2024

andrewkho commented Dec 19, 2024

bratao commented Dec 19, 2024

andrewkho commented Dec 19, 2024 •

edited

Loading

bratao commented Dec 19, 2024

andrewkho commented Dec 19, 2024 •

edited

Loading

alanrios2001 commented Dec 19, 2024

andrewkho commented Dec 21, 2024 •

edited

Loading

bratao commented Dec 21, 2024

andrewkho commented Dec 21, 2024

joecummings commented Dec 21, 2024

andrewkho commented Dec 21, 2024

joecummings commented Dec 22, 2024

andrewkho commented Dec 22, 2024

bratao commented Dec 22, 2024

andrewkho commented Dec 23, 2024

andrewkho commented Jan 7, 2025 •

edited

Loading

bratao commented Jan 7, 2025 •

edited

Loading

bratao commented Jan 18, 2025 •

edited

Loading

Add multiprocess dataset packing #2180

Add multiprocess dataset packing #2180

Comments

bratao commented Dec 19, 2024 • edited Loading

joecummings commented Dec 19, 2024

joecummings commented Dec 19, 2024

ebsmothers commented Dec 19, 2024

andrewkho commented Dec 19, 2024

bratao commented Dec 19, 2024

andrewkho commented Dec 19, 2024 • edited Loading

bratao commented Dec 19, 2024

andrewkho commented Dec 19, 2024 • edited Loading

alanrios2001 commented Dec 19, 2024

andrewkho commented Dec 21, 2024 • edited Loading

bratao commented Dec 21, 2024

andrewkho commented Dec 21, 2024

joecummings commented Dec 21, 2024

andrewkho commented Dec 21, 2024

joecummings commented Dec 22, 2024

andrewkho commented Dec 22, 2024

bratao commented Dec 22, 2024

andrewkho commented Dec 23, 2024

andrewkho commented Jan 7, 2025 • edited Loading

bratao commented Jan 7, 2025 • edited Loading

bratao commented Jan 18, 2025 • edited Loading

bratao commented Dec 19, 2024 •

edited

Loading

andrewkho commented Dec 19, 2024 •

edited

Loading

andrewkho commented Dec 19, 2024 •

edited

Loading

andrewkho commented Dec 21, 2024 •

edited

Loading

andrewkho commented Jan 7, 2025 •

edited

Loading

bratao commented Jan 7, 2025 •

edited

Loading

bratao commented Jan 18, 2025 •

edited

Loading