-
Notifications
You must be signed in to change notification settings - Fork 615
About training with multi-gpu #79
Comments
Can you clarify a bit? With "second time", do you mean the second sample in a worker thread or the first sample when you start the training script a second time? Are you talk about this value here: https://github.com/facebookresearch/fairseq/blob/master/fairseq/torchnet/ResumableDPOptimEngine.lua#L375 ? |
It means i start the training script the second time. |
And so the value is different for different threads? Is it just that the association from thread ID to loss changes or do you get 4 completely different losses for 4 GPUs, for example? |
I can give you some log, with 2 GPU and 3 GPU The first time i run the training script, use 2 gpu:
The second time i run the training script:
3-GPU The second time: The first number is the thread id, and the training loss on the thread 1 is the same, but the training loss on the other thread is different. |
In single gpu, there will be no problem, In multi-GPU, the loss in every threads except id is 1 were different, i also check the data and sure the samples were the same in every training procedure. You can also change ResumableDPOptimEngine.lua like this https://gist.github.com/StillKeepTry/61c76b6e377d9d17e103849079e2b1ff#file-test-lua-L375, to check the loss in thread. I am not sure the situation is irregular. |
Hi, when i try to debug the code, I find a strange situation. Every time i run the fconv model and print the first sample loss, i find the loss in every threads is different except the threadID is 1 .
For example, in the first time i run, the loss of every thread can be
33.596954345703, 42.148887634277.
However, in second times i run, the loss can be
33.596954345703, 41.906055450439.
Is this situation regular?
The text was updated successfully, but these errors were encountered: