multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

iamhappytoo · 2024-07-13T03:29:19Z

First of all, thank you so much for your helpful discussion in another issue earlier!
Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs for full parameter tuning of a 70b model using adamw_kahan optimizer.
I'm using the hostfile like this:
node01 slots=4
node04 slots=8

The training works fine for first epoch and first several evaluation steps, but when it tries to save best_loss checkpoint, it hangs for 30min and then got timeout.

The error log looks like this:

node04: [2024-07-12 19:13:00.439] [INFO] [qlora-pipe] step:  1400 /  2112 loss: 0.8146 iter time (s): 1.010 samples/sec: 0.990 eta: 11m53s 
node04: Running eval
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: [E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800240 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800281 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800325 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800357 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800409 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800447 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800493 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800526 milliseconds before timing out.
node04: node04:748072:749919 [5] NCCL INFO [Service thread] Connection closed by localRank 5
node04: node04:748071:749926 [4] NCCL INFO [Service thread] Connection closed by localRank 4
node04: node04:748073:749923 [6] NCCL INFO [Service thread] Connection closed by localRank 6
node04: node04:748070:749924 [3] NCCL INFO [Service thread] Connection closed by localRank 3
node04: node04:748074:749920 [7] NCCL INFO [Service thread] Connection closed by localRank 7
node01: node01:1559709:1560673 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer

I checked the best_loss/ and this folder still has a tmp/ after timeout, and it seems this best_loss saving process is just during the hang.
I'm saving the checkpoints onto a nfs file system shared between the two nodes. I'm not sure if this is causing the timeout issue. The training dataset is pretty small, with only ~3million tokens, the evaluation dataset is around 0.1% size of the training dataset. I found some threads discussing similar things huggingface/accelerate#314 (comment)
axolotl-ai-cloud/axolotl#967 Not sure if they are relevant.
Do you have some thoughts about what might be the potential reason/fix?
Thank you so much!
Looking forward to your reply!

The text was updated successfully, but these errors were encountered:

tdrussell · 2024-07-16T03:29:07Z

I'm not sure what's going wrong. Unfortunately I don't have access to a multi-machine environment so I can't really debug anything. All the development I did for the code assumed single-machine training.

Are you using eval_before_first_step? Did it ever complete an eval and save the model? If training works but eval hangs at some point, I guess you would want to always trigger eval first and try to find what's wrong. You'd have to add prints / logs everywhere to try to figure out exactly at which line of code it's hanging at. I would try to debug this myself but without a multi-node setup there's no easy way for me to do that, so you're mostly on your own here.

iamhappytoo · 2024-07-30T18:03:26Z

Hi @tdrussell, many thanks for your helpful reply! By adjusting the settings of the multi-machine environment, especially the infiniband and nccl socket settings to appropriate values, I can run the multi-node training with qlora-pipe now. This practice confirms the train.py has no bug related to multi-node training.

jukofyork · 2024-11-05T19:29:34Z

I've got this working now too - It was surprisingly easy to set up:

You can setup model =(and dataset_path = if needed) to be a local copy of each model so no need to send it through the network.
The output_dir = needs to be shared.

I used this to help me setup the /job/hostfile file:

https://nlp.stanford.edu/mistral/tutorials/deepspeed.html

and you can setup Passwordless SSH as explained here:

https://linuxize.com/post/how-to-setup-passwordless-ssh-login/

You can mix pipeline_stages = to use pipeline parallel per node too (don't try between-node pipeline parallel unless you have something like InfiniBand as the amount getting passed will really add up quickly...)

It doesn't seem to use much network bandwidth for the LoRA all-reduce step, but for full fine-tuning it would probably require InfiniBand too.

I'll try and put in a PR for the readme file to explain this better if I get chance.

kallewoof · 2025-01-10T12:10:07Z

I'll try and put in a PR for the readme file to explain this better if I get chance.

That would be very cool. I am looking at multi node options so having docs would be very nice.

jukofyork · 2025-01-16T14:35:06Z

I'll try and put in a PR for the readme file to explain this better if I get chance.

That would be very cool. I am looking at multi node options so having docs would be very nice.

I'll try if I get time, but really just using the two articles above was all it took and it "just worked" (surprisingly!).

You do have to be careful about how much you are passing through the network if you have consumer grade Ethernet, and possibly extend the default 30m Deepspeed time-out if needed to load models (I found it was easier just to duplicate the models on each node in the same mount location).

tdrussell mentioned this issue Dec 21, 2024

How to do Multi node DeepSpeed? Does the current code supports? tdrussell/diffusion-pipe#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

iamhappytoo commented Jul 13, 2024 •

edited

Loading

tdrussell commented Jul 16, 2024

iamhappytoo commented Jul 30, 2024

jukofyork commented Nov 5, 2024 •

edited

Loading

kallewoof commented Jan 10, 2025

jukofyork commented Jan 16, 2025 •

edited

Loading

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

Comments

iamhappytoo commented Jul 13, 2024 • edited Loading

tdrussell commented Jul 16, 2024

iamhappytoo commented Jul 30, 2024

jukofyork commented Nov 5, 2024 • edited Loading

kallewoof commented Jan 10, 2025

jukofyork commented Jan 16, 2025 • edited Loading

iamhappytoo commented Jul 13, 2024 •

edited

Loading

jukofyork commented Nov 5, 2024 •

edited

Loading

jukofyork commented Jan 16, 2025 •

edited

Loading