Skip to content
This repository has been archived by the owner on Oct 26, 2022. It is now read-only.

About training with multi-gpu #79

Open
StillKeepTry opened this issue Jul 12, 2017 · 5 comments
Open

About training with multi-gpu #79

StillKeepTry opened this issue Jul 12, 2017 · 5 comments

Comments

@StillKeepTry
Copy link

Hi, when i try to debug the code, I find a strange situation. Every time i run the fconv model and print the first sample loss, i find the loss in every threads is different except the threadID is 1 .

For example, in the first time i run, the loss of every thread can be
33.596954345703, 42.148887634277.
However, in second times i run, the loss can be
33.596954345703, 41.906055450439.

Is this situation regular?

@jgehring
Copy link
Contributor

Can you clarify a bit? With "second time", do you mean the second sample in a worker thread or the first sample when you start the training script a second time? Are you talk about this value here: https://github.com/facebookresearch/fairseq/blob/master/fairseq/torchnet/ResumableDPOptimEngine.lua#L375 ?

@StillKeepTry
Copy link
Author

It means i start the training script the second time.
Yes, that is the value i mean.

@jgehring
Copy link
Contributor

And so the value is different for different threads? Is it just that the association from thread ID to loss changes or do you get 4 completely different losses for 4 GPUs, for example?

@StillKeepTry
Copy link
Author

I can give you some log, with 2 GPU and 3 GPU

The first time i run the training script, use 2 gpu:

| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
1 33.596954345703
2 44.332809448242

The second time i run the training script:

| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
2 44.683753967285
1 33.596954345703

3-GPU
The first time i run the training:
| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
1 461.95471191406
2 545.96984863281
3 601.23303222656

The second time:
| [fr] Dictionary: 33350 types
| [en] Dictionary: 31497 types
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 10462375 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 6003 examples
| IndexedDataset: loaded /home/kaitao/Downloads/bpe/data-bin with 3003 examples
2 557.49401855469
1 461.95471191406
3 603.85430908203

The first number is the thread id, and the training loss on the thread 1 is the same, but the training loss on the other thread is different.

@StillKeepTry
Copy link
Author

In single gpu, there will be no problem, In multi-GPU, the loss in every threads except id is 1 were different, i also check the data and sure the samples were the same in every training procedure. You can also change ResumableDPOptimEngine.lua like this https://gist.github.com/StillKeepTry/61c76b6e377d9d17e103849079e2b1ff#file-test-lua-L375, to check the loss in thread. I am not sure the situation is irregular.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants