Using multilple GPUs to accomplish distributed training #30

EddieEduardo · 2023-03-18T16:10:16Z

Hello! Thanks for sharing the excellent work !!!

When I run the codes using multiple GPUs with nn.parallel.DistributedDataParallel, it'll always raise an error as follows :
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

However, when I run using a single GPU, no errors raise, I am confused...

EddieEduardo · 2023-03-20T01:43:33Z

Hi, when I run the code, the loss of each part goes like this, are they correct ?
Thanks for replying in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multilple GPUs to accomplish distributed training #30

Using multilple GPUs to accomplish distributed training #30

EddieEduardo commented Mar 18, 2023

EddieEduardo commented Mar 20, 2023

Using multilple GPUs to accomplish distributed training #30

Using multilple GPUs to accomplish distributed training #30

Comments

EddieEduardo commented Mar 18, 2023

EddieEduardo commented Mar 20, 2023