-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with Data and Expert Parallelism #204
Comments
Tutel MoE works just like it is in DDP modes for data loaders and models, so you can safely stack the Tutel MoE layer in your original forward graph design for DDP. (e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L79) There is only one thing you need to pay attention: All expert parameters shouldn't be managed by DDP allreduce.
|
Thank you very much, I'll try it out! And any explanation about the different type of parallel arguments and what are the differences between them? |
Whatever type of parallel you choose, it doesn't change how to use MoE layer out of the box. Different types of parallel just change the MoE internal parallelism to use, but those choices are all transparent to users & also math-equivalent with each other. For large scales / small scales, smartly setting of that option will improve the execution time of Tutel MoE layer, since each different parallelism has its particular network complexity and local memory consumption. |
Additionally, you can also change the parallel option for every different iteration. e.g. https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_switch.py#L88 The value of |
Wow, thank you very much I'll try it out! |
Hello,I use both deepspeed and your tutel framework, 4 experts and 4 gpus, both Data and Expert Parallelism. deepspeed.initialize( args=self.args, but it doesn't work,expert parameters cann't be updated correctly. how should I do ? Thanks! |
Is In order words, below is exactly the parameter list you need to bypass doing allreduce: Next, you need to ask for Deepspeed's doc about how their framework can avoid doing all_reduce for |
Thanks,But I can't find the result how to avoid doing all_reduce. |
Tha action of skipping all-reduce is controlled out of Tutel MoE layer, e.g. DeepSpeed/Fairseq/Megatron, so your remaining question has to be answered by DeepSpeed. |
If I placed all experts to specific gpu just by setting custom processed group during the creation of moe_layer(),I don't need to do all_reduce,it's right? |
Nop, I don't think This is Pytorch DDP's standard way to bypass allreduce, but Deepspeed may maintain its own allreduce in a different way: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L84-L98 If you have interests, can you share an example of Deepspeed for training in your context? Given a reproducible training script, we can help you with the answer. |
How should I prepare my code (data loaders, model, etc..) in order to train in a both Data and Expert Parallel mode?
And what does it change from "auto", "model" and "data" --parallel type?
In my current setup I'm training in DDP wrapping the model with torch DistributedDataParallel and using the distributed sampler in the loaders.
Now I wanted to insert a MoE in the model with 2 experts (I have 2 gpus so 1 local expert) so using both Data and Expert Parallelism.
Some help would be appreciated.
The text was updated successfully, but these errors were encountered: