-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19
Comments
I'm not sure what's going wrong. Unfortunately I don't have access to a multi-machine environment so I can't really debug anything. All the development I did for the code assumed single-machine training. Are you using eval_before_first_step? Did it ever complete an eval and save the model? If training works but eval hangs at some point, I guess you would want to always trigger eval first and try to find what's wrong. You'd have to add prints / logs everywhere to try to figure out exactly at which line of code it's hanging at. I would try to debug this myself but without a multi-node setup there's no easy way for me to do that, so you're mostly on your own here. |
Hi @tdrussell, many thanks for your helpful reply! By adjusting the settings of the multi-machine environment, especially the infiniband and nccl socket settings to appropriate values, I can run the multi-node training with qlora-pipe now. This practice confirms the train.py has no bug related to multi-node training. |
I've got this working now too - It was surprisingly easy to set up:
I used this to help me setup the https://nlp.stanford.edu/mistral/tutorials/deepspeed.html and you can setup Passwordless SSH as explained here: https://linuxize.com/post/how-to-setup-passwordless-ssh-login/ You can mix It doesn't seem to use much network bandwidth for the LoRA all-reduce step, but for full fine-tuning it would probably require InfiniBand too. I'll try and put in a PR for the readme file to explain this better if I get chance. |
That would be very cool. I am looking at multi node options so having docs would be very nice. |
I'll try if I get time, but really just using the two articles above was all it took and it "just worked" (surprisingly!). You do have to be careful about how much you are passing through the network if you have consumer grade Ethernet, and possibly extend the default 30m Deepspeed time-out if needed to load models (I found it was easier just to duplicate the models on each node in the same mount location). |
Hi @tdrussell,
First of all, thank you so much for your helpful discussion in another issue earlier!
Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs for full parameter tuning of a 70b model using adamw_kahan optimizer.
I'm using the hostfile like this:
node01 slots=4
node04 slots=8
The training works fine for first epoch and first several evaluation steps, but when it tries to save best_loss checkpoint, it hangs for 30min and then got timeout.
The error log looks like this:
I checked the best_loss/ and this folder still has a tmp/ after timeout, and it seems this best_loss saving process is just during the hang.
I'm saving the checkpoints onto a nfs file system shared between the two nodes. I'm not sure if this is causing the timeout issue. The training dataset is pretty small, with only ~3million tokens, the evaluation dataset is around 0.1% size of the training dataset. I found some threads discussing similar things huggingface/accelerate#314 (comment)
axolotl-ai-cloud/axolotl#967 Not sure if they are relevant.
Do you have some thoughts about what might be the potential reason/fix?
Thank you so much!
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: