[MoE][PoC] Expert Parallel: dp2ep #732

tianyu-l · 2024-12-12T03:55:13Z

Stack from ghstack (oldest at bottom):

Temporary changes to unblock exploration

[pytorch] comment out the check at https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py#L66
[torchtitan] turn optimizers foreach and clip_grad_norm_ off, as not all parameters are DTensors on the same meshes (e.g. (1) MoE non-shared experts and other params are on different FSDP meshes, and (2) moe.router.gate is a replicate torch.Tensor)

Also need to

turn the block-level compile to full_graph=False because there will be an additional FSDP inside a TransformerBlock at the non shared experts level.

Things won't work

For EP + TP, DCP resharding likely would fail due to the fact that experts would "forget" they are sharded because this meta info is not tracked as part of the 1-D DTensor. This can be solved by storing a 2-D DTensor (ep + tp), but requires several code changes including strided sharding from FSDP given a 2D DTensor.

Not including

shared expert overlapping

[ghstack-poisoned]

ghstack-source-id: 17160930f23950b91faca7b822cd3e7f9d075f7d Pull Request resolved: #732

[MoE][PoC] Expert Parallel: dp2ep

82bcdd4

[ghstack-poisoned]

This was referenced Dec 12, 2024

[MoE][PoC] model code #730

Draft

[MoE][PoC] Expert Parallel: tp and tp2ep #731

Draft

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 12, 2024

tianyu-l added a commit that referenced this pull request Dec 12, 2024

[MoE][PoC] Expert Parallel: dp2ep

25cfe6d

ghstack-source-id: 17160930f23950b91faca7b822cd3e7f9d075f7d Pull Request resolved: #732

This was referenced Dec 12, 2024

[PoC][MoE & EP] model code and various parallelisms #725

Closed

[PoC][MoE & EP] integrate with FSDP & CP #726

Closed

tianyu-l marked this pull request as draft December 12, 2024 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][PoC] Expert Parallel: dp2ep #732

[MoE][PoC] Expert Parallel: dp2ep #732

tianyu-l commented Dec 12, 2024 •

edited

Loading

[MoE][PoC] Expert Parallel: dp2ep #732

Are you sure you want to change the base?

[MoE][PoC] Expert Parallel: dp2ep #732

Conversation

tianyu-l commented Dec 12, 2024 • edited Loading

tianyu-l commented Dec 12, 2024 •

edited

Loading