You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Bugs happens when I use TEColumnParallelLinear instead of TELayerNormColumnParallelLinear with "tp-comm-overlap" open.
The reason might be the misuse of the ub_split_rs and ub_split_ag for column linear and row linear.
I thought that ub_split_rs is not supposed to be open in TEColumnParallelLinear.
To Reproduce
The commit ID of megatron-LM is f39c48dba01ffb2f91cbee992252bb9543200633.
prepare dataset
open tp-comm-overlap and sequence-parallel and tensor-parallel in scripts.
change all TELayerNormColumnParallelLinear into TEColumnParallelLinear in megatron/core/models/gpt/gpt_layer_specs.py.
run train_gpt3_175b_distributed.sh.
Then bug will happen.
Stack trace/logs
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 300, in forward
attention_output_with_bias = self.self_attention(
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return self._call_impl(*args, **kwargs)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
return self._call_impl(*args, **kwargs)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
return forward_call(*args, **kwargs)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
return forward_call(*args, **kwargs)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
[2024-11-06 15:47:22,476] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2633092 closing signal SIGTERM
[2024-11-06 15:47:24,194] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2633093) of binary: /vepfs/home/lijinliang/miniconda3/envs/megatron/bin/python
Traceback (most recent call last):
File "/home/lijinliang/.local/bin/torchrun", line 33, in<module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
time: 2024-11-06_15:47:22
host : localhost
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2633094)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time: 2024-11-06_15:47:22
host : localhost
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2633095)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time: 2024-11-06_15:47:22
host : localhost
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2633093)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment (please complete the following information):
Megatron-LM commit ID
PyTorch 2.2.1+cu121
CUDA 12.1
NCCL 2.19.3
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
wplf
changed the title
[BUG] TP-comm-overlap when replacing TELayerNormColumnLinear into TE column linear .
[BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .
Nov 6, 2024
wplf
changed the title
[BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .
[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .
Nov 6, 2024
Describe the bug
Bugs happens when I use
TEColumnParallelLinear
instead ofTELayerNormColumnParallelLinear
with "tp-comm-overlap" open.The reason might be the misuse of the
ub_split_rs
andub_split_ag
for column linear and row linear.I thought that
ub_split_rs
is not supposed to be open inTEColumnParallelLinear
.To Reproduce
The commit ID of megatron-LM is
f39c48dba01ffb2f91cbee992252bb9543200633
.tp-comm-overlap
andsequence-parallel
andtensor-parallel
in scripts.TELayerNormColumnParallelLinear
intoTEColumnParallelLinear
inmegatron/core/models/gpt/gpt_layer_specs.py
.train_gpt3_175b_distributed.sh
.Then bug will happen.
Stack trace/logs
Environment (please complete the following information):
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: