Support CUDA Graph for MoE models #1233

buptzyb · 2024-10-09T09:31:55Z

Description

Different from non-MoE models like llama2, MoE models have dynamic shaped activations in FFN layers, so one cudagraph can only capture a part of one transformer layer, instead of covering the whole layer. We call this a "breaking-layer" cudagraph mode. This PR adds breaking-layer cudagraph supports for MoE models on the TE side, and fixes several related bugs in TE.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Add is_initialized() method in CudaRNGStatesTracker to align with what is already done in MCore.
Fix wrong per_callable_module_params order bug in _make_graphed_callables when _order is given.
Fix warmup argument mismatch bug in _make_graphed_callables when _order is given.
Fix fp8 accuracy issue by adding fp8_group argument to make_graphed_callables() and modifing is_first_microbatch, skip_fp8_weight_update and fp8_meta code.
Support MoE models cudagraph by filtering graphed TE modules and model weights during warmup.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Robin Zhang <[email protected]> Co-authored-by: Yifei Song <[email protected]>

timmoon10

Technically this seems mostly reasonable, although I have questions and stylistic suggestions. Have you tested that it works with Mcore?

@ptrendx @ksivaman @sbhavani What is our priority for this feature? The custom Mcore logic in make_graphed_callables is already messy and fragile, and this PR does exacerbate those problems.

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2024-10-10T20:17:35Z