Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MoE][PoC] Expert Parallel: dp2ep #732

Draft
wants to merge 1 commit into
base: gh/tianyu-l/26/base
Choose a base branch
from

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Dec 12, 2024

Stack from ghstack (oldest at bottom):

Temporary changes to unblock exploration

Also need to

  • turn the block-level compile to full_graph=False because there will be an additional FSDP inside a TransformerBlock at the non shared experts level.

Things won't work

  • For EP + TP, DCP resharding likely would fail due to the fact that experts would "forget" they are sharded because this meta info is not tracked as part of the 1-D DTensor. This can be solved by storing a 2-D DTensor (ep + tp), but requires several code changes including strided sharding from FSDP given a 2D DTensor.

Not including

  • shared expert overlapping

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 12, 2024
tianyu-l added a commit that referenced this pull request Dec 12, 2024
ghstack-source-id: 17160930f23950b91faca7b822cd3e7f9d075f7d
Pull Request resolved: #732
@tianyu-l tianyu-l marked this pull request as draft December 12, 2024 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants