About training with multi-gpu #79

StillKeepTry · 2017-07-12T13:42:33Z

Hi, when i try to debug the code, I find a strange situation. Every time i run the fconv model and print the first sample loss, i find the loss in every threads is different except the threadID is 1 .

For example, in the first time i run, the loss of every thread can be
33.596954345703, 42.148887634277.
However, in second times i run, the loss can be
33.596954345703, 41.906055450439.

Is this situation regular?

jgehring · 2017-07-12T13:57:52Z

Can you clarify a bit? With "second time", do you mean the second sample in a worker thread or the first sample when you start the training script a second time? Are you talk about this value here: https://github.com/facebookresearch/fairseq/blob/master/fairseq/torchnet/ResumableDPOptimEngine.lua#L375 ?

StillKeepTry · 2017-07-12T14:13:52Z

It means i start the training script the second time.
Yes, that is the value i mean.

jgehring · 2017-07-12T14:38:28Z

And so the value is different for different threads? Is it just that the association from thread ID to loss changes or do you get 4 completely different losses for 4 GPUs, for example?

StillKeepTry · 2017-07-12T15:17:06Z

I can give you some log, with 2 GPU and 3 GPU

The first time i run the training script, use 2 gpu:

| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
1 33.596954345703
2 44.332809448242

The second time i run the training script:

| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
2 44.683753967285
1 33.596954345703

3-GPU
The first time i run the training:
| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
1 461.95471191406
2 545.96984863281
3 601.23303222656

The second time:
| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
2 557.49401855469
1 461.95471191406
3 603.85430908203

The first number is the thread id, and the training loss on the thread 1 is the same, but the training loss on the other thread is different.

StillKeepTry · 2017-07-13T03:05:59Z

In single gpu, there will be no problem, In multi-GPU, the loss in every threads except id is 1 were different, i also check the data and sure the samples were the same in every training procedure. You can also change ResumableDPOptimEngine.lua like this https://gist.github.com/StillKeepTry/61c76b6e377d9d17e103849079e2b1ff#file-test-lua-L375, to check the loss in thread. I am not sure the situation is irregular.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About training with multi-gpu #79

About training with multi-gpu #79

StillKeepTry commented Jul 12, 2017

jgehring commented Jul 12, 2017

StillKeepTry commented Jul 12, 2017

jgehring commented Jul 12, 2017

StillKeepTry commented Jul 12, 2017

StillKeepTry commented Jul 13, 2017

About training with multi-gpu #79

About training with multi-gpu #79

Comments

StillKeepTry commented Jul 12, 2017

jgehring commented Jul 12, 2017

StillKeepTry commented Jul 12, 2017

jgehring commented Jul 12, 2017

StillKeepTry commented Jul 12, 2017

StillKeepTry commented Jul 13, 2017