-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiprocess dataset packing #2180
Comments
This is a great suggestion! Let me look into how we could add this into the library. Caching is something that we may need to think a little more about so we don't accidentally take up too much of our users memory, but we should definitely be able to utilize more cores to pack. |
It looks like Axolotl makes use of the map and filter functions on the Hugging Face Dataset abstraction, which is pretty neat. That way they can just set a default num_processes to use and find a potentially cached dataset. I'll look into if this is feasible for torchtune to do or if we might have to roll our own solution based on that. |
cc @andrewkho |
@bratao what's the command you're using to run this with tune? it'd be interesting to see how we can improve on the baseline using torchdata |
@andrewkho
The run command is |
Thanks @bratao ! What version of python are you using? Are you able to share the jsonl file or some mangled version of it? Have you tried using edit: are you running on a branch? I'd like to see how torchdata is set up for this recipe |
@andrewkho thank you so much. I´m using Python 3.11 in Ubuntu 22.04
I will try with parallel_method: "process" |
@bratao totally understand, thank you, this is enough to go on, will report back soon :) obrigado |
Caching could be optional. |
so a small update, I copy/pasted the sample files until I got around 1.6M lines of JSON-L, which takes around 12 minutes (estimated) to load on my machine with the current implementation.
But one fundamental question, is the point here to race through packing into a map-style dataset as fast as possible, or is the real goal to set up a streaming packer to be used in training? cc @ebsmothers @joecummings @bratao |
Thank you so much @andrewkho for analyzing this problem! Would you mind sharing the configuration and command line you used? In my case, it's taking hours rather than minutes, so achieving 12 minutes would be a significant improvement for me. Regarding the other options, while I'm not familiar with the project's specific goals, the torchdata.nodes approach with a straightforward implementation sounds like the better option. Waiting for free-threading Python isn't practical right now since it's still experimental. I don't think we need to sacrifice features just for maximum speed, especially since the training phase should dominate the overall running time. Being able to load the data in minutes would be perfectly sufficient for my needs! |
@bratao got it so the end goal is training, in that case I suggest we go with a streaming model of packing, then you won't wait for the job to start at all, packing will happen on the fly. This should also be more (CPU) memory efficient |
@andrewkho But how long would it take to get streaming packing in vs. the straightforward torchdata.nodes approach for map-style datasets? Map-style will always need to be supported, so it wouldn't be wasted effort IMO to get this initial approach in as long as it's not going to take a month to implement. |
@joecummings agree that both are still necessary. It's going to be similar foundational work: land some version of streaming packer (could be in torchdata nodes), and then setting up some nodes to do multi process streaming to fill _packs. From there it should be somewhat straightforward to enable packed streaming with nodes , although recipe integration will take more time. However this basic version of map style isn't going to maximally utilize the cores. To improve that (my second bullet point) it'll take more work on torchdata side, we need a new stream parallelizer (which we need anyways), but it's going to need some design work on our end to get it right. One thing we could do is land the packer, integrate with existing packed dataset to improve startup. Then we can enable streaming packing in torchtune while we work on the stream parallelizer for a further ttfb win later on. Wdyt? |
Sounds brilliant. 10/10. Very excited. Lmk how we can help facilitate. |
@bratao just to set expectations, I'll be out for Christmas and new years, and we'll get going on this in January, hope that's alright! |
@andrewkho of course. I will be using axolotl for this run but anxiously waiting to use the new implementation. Have a good Christmas and a prosperous new year, take the opportunity to rest and enjoy with your family!!! BTW, just to let you know, I tried
But it still used only one core, with the following ETA: |
@bratao after some digging, thats because torchdata hasn't been integrated to this recipe yet so those settings would have no effect |
@bratao happy new year! I put a small demo together of the most straightforward solution: https://github.com/pytorch/torchtune/compare/main...andrewkho:torchtune:andrewkh/parallel-packer?expand=1 You'll need torchdata's nightly build to test this: I ran this with At this point the Packer itself is the bottleneck and the next step is to run that in parallel, which will need a bit more work on our end since it's stateful. But as discussed above, for a streaming solution this is probably sufficient for most use cases. Could you give this a shot on your dataset and see how much it helps? You will need to apply patches for _packed.py and _chat.py, then update your config appropriately |
Thank you so much. Happy new year! I will try and report ASAP |
@andrewkho sorry I forgot to report back. I got a 8x speedup that is good enough for me!! Thanks for the improvement!!! |
Hello,
I have a custom JSONL dataset with 4 million examples. In Axolotl, I am able to load this with packing on a 64-core machine in 10 minutes, as it uses all cores. It also caches the dataset packing and is super handy.
The same dataset in torchtune takes 6 hours if packing is enabled. Apparently, it uses only one core.
Is there any way to accelerate this?
The text was updated successfully, but these errors were encountered: