NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.4k
Star 10.6k

Code
Issues 133
Pull requests 150
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

133 Open 643 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[QUESTION] SGD support in distrib_optimizer.py

#1287 opened Nov 13, 2024 by zstreeter

[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?

#1283 opened Nov 12, 2024 by leondada

Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?

#1281 opened Nov 11, 2024 by herolxl

[BUG]Megatron-LM doesn't support transformer-engine 1.13

#1280 opened Nov 11, 2024 by klhhhhh

[BUG] Encountering NaN gradients when using CUDA Graph

#1279 opened Nov 11, 2024 by DXZDXZ

[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?

#1277 opened Nov 7, 2024 by Louis-J

[QUESTION] scaleing MFU calculate

#1276 opened Nov 6, 2024 by ltm920716

[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .

#1275 opened Nov 6, 2024 by wplf

[BUG] training crash when set --tp-comm-overlap

#1274 opened Nov 5, 2024 by ltm920716

[QUESTION] How to Visualize Computational Graph

#1272 opened Nov 2, 2024 by zixianwang2022

[BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset?

#1269 opened Nov 1, 2024 by shmily326

[BUG] build multimodal dockerfile problem

#1267 opened Oct 30, 2024 by FortuneBush

[QUESTION] How to use loader_mcore and why it requires torch distributed

#1266 opened Oct 29, 2024 by KookHoiKim

[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining

#1263 opened Oct 28, 2024 by dhia680

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed

#1259 opened Oct 26, 2024 by efsotr

[BUG] MoE pre-training does not scale beyond DP dim>8

#1258 opened Oct 25, 2024 by hwang595

[BUG] Cannot Save mamba model in distributed training

#1234 opened Oct 22, 2024 by siriusctrl

[BUG] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.

#1207 opened Oct 10, 2024 by takuya576

[QUESTION]Fail to build communication between muti machines

#1202 opened Oct 8, 2024 by zmtttt

[QUESTION] Encoder with more TP than the decoder

#1200 opened Oct 6, 2024 by MlWoo

[ENHANCEMENT] Add layer name in a layer to improve code debugging

#1198 opened Oct 4, 2024 by rybakov

[QUESTION] Why do we need both " train_valid_test_datasets_provider.is_distributed = True" and batched data broadcasting ?

#1196 opened Oct 4, 2024 by rayleizhu

[BUG] "ValueError: optimizer got an empty parameter list" under pipeline parallel

#1166 opened Oct 2, 2024 by takuya576

[QUESTION] About all_reduce order while using CP

#1162 opened Sep 27, 2024 by junjzhang

Why are not all SMs active when NCCL kernel and compute kernel overlap?[QUESTION]

#1161 opened Sep 27, 2024 by yu-depend

Previous 1 2 3 4 5 6 Next

Previous Next

ProTip! Type g i on any issue or pull request to go back to the issue listing page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly