Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocess dataset packing #2180

Open
bratao opened this issue Dec 19, 2024 · 21 comments
Open

Add multiprocess dataset packing #2180

bratao opened this issue Dec 19, 2024 · 21 comments
Assignees
Labels
enhancement New feature or request triaged This issue has been assigned an owner and appropriate label

Comments

@bratao
Copy link

bratao commented Dec 19, 2024

Hello,

I have a custom JSONL dataset with 4 million examples. In Axolotl, I am able to load this with packing on a 64-core machine in 10 minutes, as it uses all cores. It also caches the dataset packing and is super handy.

The same dataset in torchtune takes 6 hours if packing is enabled. Apparently, it uses only one core.

Is there any way to accelerate this?

@joecummings joecummings added the triaged This issue has been assigned an owner and appropriate label label Dec 19, 2024
@joecummings joecummings self-assigned this Dec 19, 2024
@joecummings
Copy link
Contributor

This is a great suggestion! Let me look into how we could add this into the library.

Caching is something that we may need to think a little more about so we don't accidentally take up too much of our users memory, but we should definitely be able to utilize more cores to pack.

@joecummings
Copy link
Contributor

It looks like Axolotl makes use of the map and filter functions on the Hugging Face Dataset abstraction, which is pretty neat. That way they can just set a default num_processes to use and find a potentially cached dataset. I'll look into if this is feasible for torchtune to do or if we might have to roll our own solution based on that.

@joecummings joecummings added the enhancement New feature or request label Dec 19, 2024
@ebsmothers
Copy link
Contributor

cc @andrewkho

@andrewkho
Copy link
Contributor

@bratao what's the command you're using to run this with tune? it'd be interesting to see how we can improve on the baseline using torchdata

@bratao
Copy link
Author

bratao commented Dec 19, 2024

@andrewkho
This is my config ( I used this torchdata dataloader but there is no difference for me in the main branch. It is using only 1 cpu core)

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: llama_finetune_train_modelo_unico_v1.jsonl
  split: train
  conversation_column: conversations
  conversation_style: sharegpt
  packed: True  # True increases speed
seed: null
shuffle: True
use_torchdata: true
dataloader:
  parallel_method: thread
  num_workers: 8
  packed: True

The run command is tune run full_finetune_single_device --config 32B_full_single_device_mi300.yaml. This config is based on qwen2.5-7b with the 32B model.

@andrewkho
Copy link
Contributor

andrewkho commented Dec 19, 2024

Thanks @bratao ! What version of python are you using? Are you able to share the jsonl file or some mangled version of it? Have you tried using parallel_method: "process"? Threads may be GIL bound so processes are probably better, but it's probably easiest if I try to run this myself.

edit: are you running on a branch? I'd like to see how torchdata is set up for this recipe

@bratao
Copy link
Author

bratao commented Dec 19, 2024

@andrewkho thank you so much. I´m using Python 3.11 in Ubuntu 22.04
Unfortunately I cannot share the jsonl file, as it has sensitive information. But is like that with 4 million lines:


{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Oi! Eu sou o EscavAI! Em que posso te ajudar hoje?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Oi! Prazer, sou o EscavAI! Espero que esteja bem. Como posso te ajudar?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Que bom te ver por aqui! Me chamo EscavAI! Posso te ajudar com algo?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Oi"}, {"from": "gpt", "value": "Olá, tudo bem? Eu sou o EscavAI! Como posso te ajudar?"}]}
{"conversations": [{"from": "system", "value": "Você é o EscavAI, o assistente virtual do Escavador. Seu objetivo é tirar dúvidas jurídicas dos usuários do Escavador da melhor forma possível"}, {"from": "human", "value": "Bom dia"}, {"from": "gpt", "value": "Bom dia! Em que posso te ajudar?"}]}

I will try with parallel_method: "process"

@andrewkho
Copy link
Contributor

andrewkho commented Dec 19, 2024

@bratao totally understand, thank you, this is enough to go on, will report back soon :) obrigado

@alanrios2001
Copy link

Caching is something that we may need to think a little more about so we don't accidentally take up too much of our users memory, but we should definitely be able to utilize more cores to pack.

Caching could be optional.

@andrewkho
Copy link
Contributor

andrewkho commented Dec 21, 2024

so a small update, I copy/pasted the sample files until I got around 1.6M lines of JSON-L, which takes around 12 minutes (estimated) to load on my machine with the current implementation.

  • With a straightforward torchdata.nodes implementation on duplicated version of this dataset I'm able to see around 6x speed up, from 12 minutes to a little under 2 minutes with 16 workers and multi-process. Also found a feature we should add to our parallel mapper for automatic pre-batching.
  • If I jump through more hoops, I'm able to see around 20-30x speedup with 32 workers, bringing the time down to around 20-25s if you ignore warmup time, which is much better. This is going to need us to impelment a new mode of parallelism in torchdata which we probably need to do anyways.
  • The above hoops/pain I am experiencing is all due to passing data between processes, and with Free-threaded Python (3.13t) I imagine a lot of this pain will go away, but datasets is not able to run with free-threaded python due to aiohttp not working yet (cffi might be the offending package).

But one fundamental question, is the point here to race through packing into a map-style dataset as fast as possible, or is the real goal to set up a streaming packer to be used in training? cc @ebsmothers @joecummings @bratao

@bratao
Copy link
Author

bratao commented Dec 21, 2024

Thank you so much @andrewkho for analyzing this problem!

Would you mind sharing the configuration and command line you used? In my case, it's taking hours rather than minutes, so achieving 12 minutes would be a significant improvement for me.

Regarding the other options, while I'm not familiar with the project's specific goals, the torchdata.nodes approach with a straightforward implementation sounds like the better option. Waiting for free-threading Python isn't practical right now since it's still experimental. I don't think we need to sacrifice features just for maximum speed, especially since the training phase should dominate the overall running time. Being able to load the data in minutes would be perfectly sufficient for my needs!

@andrewkho
Copy link
Contributor

@bratao got it so the end goal is training, in that case I suggest we go with a streaming model of packing, then you won't wait for the job to start at all, packing will happen on the fly. This should also be more (CPU) memory efficient

@joecummings
Copy link
Contributor

@andrewkho But how long would it take to get streaming packing in vs. the straightforward torchdata.nodes approach for map-style datasets?

Map-style will always need to be supported, so it wouldn't be wasted effort IMO to get this initial approach in as long as it's not going to take a month to implement.

@andrewkho
Copy link
Contributor

@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some nodes to do multi process streaming to fill _packs. From there it should be somewhat straightforward to enable packed streaming with nodes , although recipe integration will take more time.

However this basic version of map style isn't going to maximally utilize the cores. To improve that (my second bullet point) it'll take more work on torchdata side, we need a new stream parallelizer (which we need anyways), but it's going to need some design work on our end to get it right.

One thing we could do is land the packer, integrate with existing packed dataset to improve startup. Then we can enable streaming packing in torchtune while we work on the stream parallelizer for a further ttfb win later on. Wdyt?

@joecummings
Copy link
Contributor

@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some nodes to do multi process streaming to fill _packs. From there it should be somewhat straightforward to enable packed streaming with nodes , although recipe integration will take more time.

However this basic version of map style isn't going to maximally utilize the cores. To improve that (my second bullet point) it'll take more work on torchdata side, we need a new stream parallelizer (which we need anyways), but it's going to need some design work on our end to get it right.

One thing we could do is land the packer, integrate with existing packed dataset to improve startup. Then we can enable streaming packing in torchtune while we work on the stream parallelizer for a further ttfb win later on. Wdyt?

Sounds brilliant. 10/10. Very excited. Lmk how we can help facilitate.

@andrewkho
Copy link
Contributor

@bratao just to set expectations, I'll be out for Christmas and new years, and we'll get going on this in January, hope that's alright!

@bratao
Copy link
Author

bratao commented Dec 22, 2024

@andrewkho of course. I will be using axolotl for this run but anxiously waiting to use the new implementation. Have a good Christmas and a prosperous new year, take the opportunity to rest and enjoy with your family!!!

BTW, just to let you know, I tried

use_torchdata: true
dataloader:
  parallel_method: process
  num_workers: 8
  shuffle: True
  packed: True

But it still used only one core, with the following ETA:
Packing dataset: 7%|¦¦¦¦¦¦¦¦? | 247394/3449510 [18:04<4:02:07, 220.42it/s]

@andrewkho
Copy link
Contributor

@bratao after some digging, thats because torchdata hasn't been integrated to this recipe yet so those settings would have no effect

@andrewkho
Copy link
Contributor

andrewkho commented Jan 7, 2025

@bratao happy new year! I put a small demo together of the most straightforward solution: https://github.com/pytorch/torchtune/compare/main...andrewkho:torchtune:andrewkh/parallel-packer?expand=1

You'll need torchdata's nightly build to test this:
pip install --index-url https://download.pytorch.org/whl/nightly/cpu torchdata

I ran this with tune run full_finetune_single_device --config recipes/configs/qwen2_5/32B_full_single_device_mi300.yaml dataset.packer_num_workers=12 and get around 14k QPS (lines of input processed) vs 2k QPS when using packer_num_workers=0.

At this point the Packer itself is the bottleneck and the next step is to run that in parallel, which will need a bit more work on our end since it's stateful. But as discussed above, for a streaming solution this is probably sufficient for most use cases. Could you give this a shot on your dataset and see how much it helps? You will need to apply patches for _packed.py and _chat.py, then update your config appropriately

@bratao
Copy link
Author

bratao commented Jan 7, 2025

Thank you so much. Happy new year! I will try and report ASAP

@bratao
Copy link
Author

bratao commented Jan 18, 2025

@andrewkho sorry I forgot to report back. I got a 8x speedup that is good enough for me!!
Looking forward for it in the main branch

Thanks for the improvement!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triaged This issue has been assigned an owner and appropriate label
Projects
None yet
Development

No branches or pull requests

5 participants