Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . #1275

Open
wplf opened this issue Nov 6, 2024 · 0 comments

Comments

@wplf
Copy link

wplf commented Nov 6, 2024

Describe the bug
Bugs happens when I use TEColumnParallelLinear instead of TELayerNormColumnParallelLinear with "tp-comm-overlap" open.
The reason might be the misuse of the ub_split_rs and ub_split_ag for column linear and row linear.
I thought that ub_split_rs is not supposed to be open in TEColumnParallelLinear.

To Reproduce
The commit ID of megatron-LM is f39c48dba01ffb2f91cbee992252bb9543200633.

  1. prepare dataset
  2. open tp-comm-overlap and sequence-parallel and tensor-parallel in scripts.
  3. change all TELayerNormColumnParallelLinear into TEColumnParallelLinear in megatron/core/models/gpt/gpt_layer_specs.py.
  4. run train_gpt3_175b_distributed.sh.
    Then bug will happen.

Stack trace/logs

  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 300, in forward
    attention_output_with_bias = self.self_attention(
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304    
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
[2024-11-06 15:47:22,476] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2633092 closing signal SIGTERM
[2024-11-06 15:47:24,194] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2633093) of binary: /vepfs/home/lijinliang/miniconda3/envs/megatron/bin/python
Traceback (most recent call last):
  File "/home/lijinliang/.local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2633094)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2633095)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2633093)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment (please complete the following information):

  • Megatron-LM commit ID
  • PyTorch 2.2.1+cu121
  • CUDA 12.1
  • NCCL 2.19.3

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

@wplf wplf changed the title [BUG] TP-comm-overlap when replacing TELayerNormColumnLinear into TE column linear . [BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . Nov 6, 2024
@wplf wplf changed the title [BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . [BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant