[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` . #1275

wplf · 2024-11-06T08:02:43Z

Describe the bug
Bugs happens when I use TEColumnParallelLinear instead of TELayerNormColumnParallelLinear with "tp-comm-overlap" open.
The reason might be the misuse of the ub_split_rs and ub_split_ag for column linear and row linear.
I thought that ub_split_rs is not supposed to be open in TEColumnParallelLinear.

To Reproduce
The commit ID of megatron-LM is f39c48dba01ffb2f91cbee992252bb9543200633.

prepare dataset
open tp-comm-overlap and sequence-parallel and tensor-parallel in scripts.
change all TELayerNormColumnParallelLinear into TEColumnParallelLinear in megatron/core/models/gpt/gpt_layer_specs.py.
run train_gpt3_175b_distributed.sh.
Then bug will happen.

Stack trace/logs

  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 300, in forward
    attention_output_with_bias = self.self_attention(
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    return self._call_impl(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    return forward_call(*args, **kwargs)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 360, in forward
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304    
mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/vepfs/home/lijinliang/official_proj/Megatron-LM/megatron/core/transformer/attention.py", line 611, in get_query_key_value_tensors
    mixed_qkv = mixed_qkv.view(*new_tensor_shape)
RuntimeError: shape '[256, 4, 16, 192]' is invalid for input of size 4194304
[2024-11-06 15:47:22,476] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2633092 closing signal SIGTERM
[2024-11-06 15:47:24,194] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2633093) of binary: /vepfs/home/lijinliang/miniconda3/envs/megatron/bin/python
Traceback (most recent call last):
  File "/home/lijinliang/.local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2633094)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2633095)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-06_15:47:22
  host      : localhost
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2633093)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch 2.2.1+cu121
CUDA 12.1
NCCL 2.19.3

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wplf changed the title ~~[BUG] TP-comm-overlap when replacing TELayerNormColumnLinear into TE column linear .~~ [BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . Nov 6, 2024

wplf changed the title ~~[BUG] TP-comm-overlap when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .~~ [BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` . #1275

[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` . #1275

wplf commented Nov 6, 2024 •

edited

Loading

[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . #1275

[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear . #1275

Comments

wplf commented Nov 6, 2024 • edited Loading

[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` . #1275

[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` . #1275

wplf commented Nov 6, 2024 •

edited

Loading