Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double free or corruption (!prev) #770

Open
ltm920716 opened this issue Oct 15, 2024 · 1 comment
Open

double free or corruption (!prev) #770

ltm920716 opened this issue Oct 15, 2024 · 1 comment

Comments

@ltm920716
Copy link

hello,
I test the llama2-70b-lora,but replace model with llama2-7b on 2 gpu 4090 node
running log:

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING]
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:39,862] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:39,955] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:40,024] [INFO] [comm.py:637:init_distributed] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:40,173] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-15 09:46:40,173] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:41,603] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.22s/it]
Loading checkpoint shards:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 1/2 [00:07<00:07,  7.24s/it]trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.88s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaFlashAttention2(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (o_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Parameter Offload: Total persistent parameters: 4460544 in 129 params
:::MLLOG {"namespace": "", "time_ms": 1728985613696, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": "True", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "llama2_70b_lora", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 97}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 101}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 105}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 108}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_name", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 112}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_email", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 116}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 120}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 124}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 3901, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 128}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 173, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 132}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2234, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 136}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.0, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 1024, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 138}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.0001, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 0.3, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_alpha", "value": 32, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_rank", "value": 16, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 143}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "init_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_END", "key": "init_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "run_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 148}}
  0%|                                                                                                                                                                                  | 0/1024 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 3.818, 'grad_norm': 1.016438921422074, 'learning_rate': 0.00039945809133573807, 'epoch': 0.01}
  2%|███▉                                                                                                                                                                   | 24/1024 [01:36<1:06:20,  3.98s/it]:::MLLOG {"namespace": "", "time_ms": 1728985710476, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 3.818, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 48}}
{'loss': 3.3317, 'grad_norm': 1.0225008307763737, 'learning_rate': 0.0003994120140678966, 'epoch': 0.01}
{'loss': 2.7643, 'grad_norm': 0.9247777329488452, 'learning_rate': 0.0003978353019929562, 'epoch': 0.02}
{'loss': 2.2395, 'grad_norm': 0.7798605687888748, 'learning_rate': 0.0003951404260077057, 'epoch': 0.04}
  7%|███████████▋                                                                                                                                                           | 72/1024 [04:48<1:03:26,  4.00s/it]:::MLLOG {"namespace": "", "time_ms": 1728985902004, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 2.2395, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 144}}
{'loss': 2.0295, 'grad_norm': 1.0274388936605494, 'learning_rate': 0.00039500506901339887, 'epoch': 0.04}
{'loss': 2.067, 'grad_norm': 0.686157303906154, 'learning_rate': 0.0003913880671464418, 'epoch': 0.05}
{'eval_loss': 1.5954195261001587, 'eval_runtime': 136.6648, 'eval_samples_per_second': 1.266, 'eval_steps_per_second': 0.637, 'epoch': 0.05}
{'loss': 1.884, 'grad_norm': 0.6294034907656876, 'learning_rate': 0.0003865985597669478, 'epoch': 0.06}
 12%|███████████████████▍                                                                                                                                                  | 120/1024 [10:16<1:00:27,  4.01s/it]:::MLLOG {"namespace": "", "time_ms": 1728986230616, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.884, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 240}}
{'loss': 1.8423, 'grad_norm': 0.5743935530993334, 'learning_rate': 0.00038637685311633367, 'epoch': 0.06}
{'loss': 1.8052, 'grad_norm': 0.6059068728905395, 'learning_rate': 0.0003807978586246887, 'epoch': 0.07}
{'eval_loss': 1.4383825063705444, 'eval_runtime': 136.9896, 'eval_samples_per_second': 1.263, 'eval_steps_per_second': 0.635, 'epoch': 0.07}
 14%|███████████████████████▋                                                                                                                                                | 144/1024 [14:10<58:46,  4.01s/it:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "block_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 174, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 179, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_START", "key": "block_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 184, "samples_count": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "run_stop", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 195, "samples_count": 288, "status": "success"}}
{'loss': 1.8023, 'grad_norm': 0.8252179367975583, 'learning_rate': 0.0003805346636474518, 'epoch': 0.07}
{'train_runtime': 854.0888, 'train_samples_per_second': 2.398, 'train_steps_per_second': 1.199, 'train_loss': 2.429245588697236, 'epoch': 0.07}
 14%|███████████████████████▌                                                                                                                                              | 145/1024 [14:14<1:26:17,  5.89s/it]
double free or corruption (!prev)
@sharlynxy
Copy link

I met this problem too. Did you find out any solution?

I am trying with model llama2-70b-lora on 8*h100.

nb-e3p9s3o28rnk-0:21619:22709 [7] NCCL INFO comm 0x25b97d10 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e4000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21615:22705 [3] NCCL INFO comm 0x26458d30 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21612:22703 [0] NCCL INFO comm 0x255eb730 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21616:22708 [4] NCCL INFO comm 0x25c32b10 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 84000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21618:22707 [6] NCCL INFO comm 0x245f18f0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId 91000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21617:22710 [5] NCCL INFO comm 0x24ad1f00 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 8b000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21614:22706 [2] NCCL INFO comm 0x264b4b60 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x593e48e14bee5768 - Init COMPLETE
nb-e3p9s3o28rnk-0:21613:22704 [1] NCCL INFO comm 0x250a03d0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x593e48e14bee5768 - Init COMPLETE
{'loss': 1.5303, 'grad_norm': 0.14347858031266167, 'learning_rate': 0.00039945809133573807, 'epoch': 0.03}                                                                                      
  2%|███▌                                                                                                                                                   | 24/1024 [03:41<2:27:53,  8.87s/it]:::MLLOG {"namespace": "", "time_ms": 1731563076562, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.5303, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 192}}
{'loss': 1.1428, 'grad_norm': 0.4939029303148046, 'learning_rate': 0.0003994120140678966, 'epoch': 0.04}                                                                                        
{'loss': 0.9623, 'grad_norm': 0.10289518496271374, 'learning_rate': 0.0003978353019929562, 'epoch': 0.07}                                                                                       
{'loss': 0.91, 'grad_norm': 0.0725064982570268, 'learning_rate': 0.0003951404260077057, 'epoch': 0.1}                                                                                           
  7%|██████████▌                                                                                                                                            | 72/1024 [10:48<2:21:16,  8.90s/it]:::MLLOG {"namespace": "", "time_ms": 1731563504195, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 0.91, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 576}}
{'loss': 0.8095, 'grad_norm': 0.09953358797550853, 'learning_rate': 0.00039500506901339887, 'epoch': 0.1}                                                                                       
{'loss': 0.8977, 'grad_norm': 0.10508718530036582, 'learning_rate': 0.0003913880671464418, 'epoch': 0.14}                                                                                       
{'loss': 0.8651, 'grad_norm': 0.07821072058792573, 'learning_rate': 0.0003865985597669478, 'epoch': 0.17}                                                                                       
 12%|█████████████████▌                                                                                                                                    | 120/1024 [17:56<2:14:24,  8.92s/it]:::MLLOG {"namespace": "", "time_ms": 1731563932422, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 0.8651, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 960}}
{'loss': 0.8989, 'grad_norm': 0.11927183336185655, 'learning_rate': 0.00038637685311633367, 'epoch': 0.17}                                                                                      
{'loss': 0.9052, 'grad_norm': 0.05735218159573083, 'learning_rate': 0.0003807978586246887, 'epoch': 0.21}                                                                                       
{'eval_loss': 0.8295302987098694, 'eval_runtime': 85.5208, 'eval_samples_per_second': 2.841, 'eval_steps_per_second': 0.362, 'epoch': 0.21}                                                     
 14%|█████████████████████                                                                                                                                 | 144/1024 [22:56<2:10:39,  8.91s/it:::MLLOG {"namespace": "", "time_ms": 1731564232086, "event_type": "INTERVAL_END", "key": "run_stop", "value": 0.8295302987098694, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 195, "samples_count": 1152, "status": "success"}}
{'train_runtime': 1385.6147, 'train_samples_per_second': 5.912, 'train_steps_per_second': 0.739, 'train_loss': 1.0105113058254636, 'epoch': 0.21}                                               
 14%|█████████████████████▏                                                                                                                                | 145/1024 [23:05<2:19:59,  9.56s/it]
nb-e3p9s3o28rnk-0:21613:22756 [32586] NCCL INFO [Service thread] Connection closed by localRank 2
nb-e3p9s3o28rnk-0:21613:21983 [32593] NCCL INFO [Service thread] Connection closed by localRank 2
double free or corruption (!prev)
nb-e3p9s3o28rnk-0:21612:22750 [32736] NCCL INFO [Service thread] Connection closed by localRank 2
nb-e3p9s3o28rnk-0:21612:21984 [32746] NCCL INFO [Service thread] Connection closed by localRank 2
nb-e3p9s3o28rnk-0:21615:22755 [32622] NCCL INFO [Service thread] Connection closed by localRank 2
nb-e3p9s3o28rnk-0:21615:21978 [32626] NCCL INFO [Service thread] Connection closed by localRank 2
nb-e3p9s3o28rnk-0:21612:22750 [32736] NCCL INFO [Service thread] Connection closed by localRank 4
nb-e3p9s3o28rnk-0:21612:21984 [32746] NCCL INFO [Service thread] Connection closed by localRank 4
nb-e3p9s3o28rnk-0:21615:22755 [32622] NCCL INFO [Service thread] Connection closed by localRank 4
nb-e3p9s3o28rnk-0:21615:21978 [32626] NCCL INFO [Service thread] Connection closed by localRank 4
nb-e3p9s3o28rnk-0:21612:22750 [32736] NCCL INFO [Service thread] Connection closed by localRank 7
nb-e3p9s3o28rnk-0:21612:21984 [32746] NCCL INFO [Service thread] Connection closed by localRank 7
nb-e3p9s3o28rnk-0:21618:21981 [32647] NCCL INFO [Service thread] Connection closed by localRank 7
nb-e3p9s3o28rnk-0:21618:22752 [32637] NCCL INFO [Service thread] Connection closed by localRank 7
nb-e3p9s3o28rnk-0:21613:22756 [32586] NCCL INFO [Service thread] Connection closed by localRank 0
nb-e3p9s3o28rnk-0:21613:21983 [32593] NCCL INFO [Service thread] Connection closed by localRank 0
[2024-11-14 06:04:09,095] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 5 (pid: 21617) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
scripts/train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-14_06:04:09
  host      : nb-e3p9s3o28rnk-0
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 21617)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 21617
======================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants