Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

Open
iamhappytoo opened this issue Jul 13, 2024 · 5 comments

Comments

@iamhappytoo
Copy link

iamhappytoo commented Jul 13, 2024

Hi @tdrussell,

First of all, thank you so much for your helpful discussion in another issue earlier!
Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs for full parameter tuning of a 70b model using adamw_kahan optimizer.
I'm using the hostfile like this:
node01 slots=4
node04 slots=8

The training works fine for first epoch and first several evaluation steps, but when it tries to save best_loss checkpoint, it hangs for 30min and then got timeout.

The error log looks like this:

node04: [2024-07-12 19:13:00.439] [INFO] [qlora-pipe] step:  1400 /  2112 loss: 0.8146 iter time (s): 1.010 samples/sec: 0.990 eta: 11m53s 
node04: Running eval
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: [E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800240 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800281 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800325 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800357 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800409 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800447 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800493 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800526 milliseconds before timing out.
node04: node04:748072:749919 [5] NCCL INFO [Service thread] Connection closed by localRank 5
node04: node04:748071:749926 [4] NCCL INFO [Service thread] Connection closed by localRank 4
node04: node04:748073:749923 [6] NCCL INFO [Service thread] Connection closed by localRank 6
node04: node04:748070:749924 [3] NCCL INFO [Service thread] Connection closed by localRank 3
node04: node04:748074:749920 [7] NCCL INFO [Service thread] Connection closed by localRank 7
node01: node01:1559709:1560673 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer

I checked the best_loss/ and this folder still has a tmp/ after timeout, and it seems this best_loss saving process is just during the hang.
I'm saving the checkpoints onto a nfs file system shared between the two nodes. I'm not sure if this is causing the timeout issue. The training dataset is pretty small, with only ~3million tokens, the evaluation dataset is around 0.1% size of the training dataset. I found some threads discussing similar things huggingface/accelerate#314 (comment)
axolotl-ai-cloud/axolotl#967 Not sure if they are relevant.
Do you have some thoughts about what might be the potential reason/fix?
Thank you so much!
Looking forward to your reply!

@iamhappytoo iamhappytoo changed the title Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. multi-node training error: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. Jul 13, 2024
@iamhappytoo iamhappytoo changed the title multi-node training error: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. Jul 13, 2024
@tdrussell
Copy link
Owner

I'm not sure what's going wrong. Unfortunately I don't have access to a multi-machine environment so I can't really debug anything. All the development I did for the code assumed single-machine training.

Are you using eval_before_first_step? Did it ever complete an eval and save the model? If training works but eval hangs at some point, I guess you would want to always trigger eval first and try to find what's wrong. You'd have to add prints / logs everywhere to try to figure out exactly at which line of code it's hanging at. I would try to debug this myself but without a multi-node setup there's no easy way for me to do that, so you're mostly on your own here.

@iamhappytoo
Copy link
Author

Hi @tdrussell, many thanks for your helpful reply! By adjusting the settings of the multi-machine environment, especially the infiniband and nccl socket settings to appropriate values, I can run the multi-node training with qlora-pipe now. This practice confirms the train.py has no bug related to multi-node training.

@jukofyork
Copy link
Contributor

jukofyork commented Nov 5, 2024

I've got this working now too - It was surprisingly easy to set up:

  • You can setup model =(and dataset_path = if needed) to be a local copy of each model so no need to send it through the network.
  • The output_dir = needs to be shared.

I used this to help me setup the /job/hostfile file:

https://nlp.stanford.edu/mistral/tutorials/deepspeed.html

and you can setup Passwordless SSH as explained here:

https://linuxize.com/post/how-to-setup-passwordless-ssh-login/

You can mix pipeline_stages = to use pipeline parallel per node too (don't try between-node pipeline parallel unless you have something like InfiniBand as the amount getting passed will really add up quickly...)

It doesn't seem to use much network bandwidth for the LoRA all-reduce step, but for full fine-tuning it would probably require InfiniBand too.

I'll try and put in a PR for the readme file to explain this better if I get chance.

@kallewoof
Copy link
Contributor

I'll try and put in a PR for the readme file to explain this better if I get chance.

That would be very cool. I am looking at multi node options so having docs would be very nice.

@jukofyork
Copy link
Contributor

jukofyork commented Jan 16, 2025

I'll try and put in a PR for the readme file to explain this better if I get chance.

That would be very cool. I am looking at multi node options so having docs would be very nice.

I'll try if I get time, but really just using the two articles above was all it took and it "just worked" (surprisingly!).

You do have to be careful about how much you are passing through the network if you have consumer grade Ethernet, and possibly extend the default 30m Deepspeed time-out if needed to load models (I found it was easier just to duplicate the models on each node in the same mount location).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants