Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Training Slower #21

Open
akshay-bapat-magna opened this issue Jan 4, 2023 · 11 comments
Open

Distributed Training Slower #21

akshay-bapat-magna opened this issue Jan 4, 2023 · 11 comments

Comments

@akshay-bapat-magna
Copy link

Hi,

I have two RTX A6000 GPUs available for training (device IDs 0 and 1).
I run the GDRN training as: "./core/gdrn_modeling/train_gdrn.sh <config_file> 0,1". The training starts as usual but it is much slower (takes almost twice as long to train) than when I use just one GPU. The terminal also shows this warning:
"[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance."
Please note that there are no errors in the output, it is just way too slow.
Can anyone please help me with this issue?

@shanice-l
Copy link
Owner

It's weird since we've trained the network using two 2080Ti GPUs, and the speed is 2x faster than training on a single 2080Ti.

@akshay-bapat-magna
Copy link
Author

Are there any other changes required to run distributed training, apart from specifying multiple device IDs? For example, in the config file or somewhere else?

@shanice-l
Copy link
Owner

No extra requirement.

@akshay-bapat-magna
Copy link
Author

Here is more information on the matter: If I train a YOLO model using two GPUs, I see a big jump in speed. Only when I train GDRNPP using two GPUs is when I see a drop in speed.

@shanice-l
Copy link
Owner

shanice-l commented Jan 10, 2023

I assume the problem exists with the egl renderer. You can try to generate the XYZ coordinate map offline.

@CHDrago
Copy link

CHDrago commented Mar 12, 2023

Hi,I want to ask you how to generate the XYZ coordinate map offline.

@ustblogistics87
Copy link

@shanice-l 您好,借着请问一下,我用两张rtx3090训练ycbv数据集,参数config为IMS_PER_BATCH=48,TOTAL_EPOCHS=40 训练时长显示需要14天。同样用两张3090训练tless数据集,将TOTAL_EPOCHS改为8,同样需要2天的时间才能够训练完,可能是由于epoch不够,测试结果和官方提供的结果差异很大。
请问这样的训练时长是正常的嘛,即使采用tlessSO参数对单个物体进行训练,训练时长也较长,根据#23 离线生成xyzmap后能够提升多大的速度呢
期待您的回复

@CHDrago
Copy link

CHDrago commented Mar 13, 2023

你好,我和你是一样的问题,也是训练时间长。是不是单独一个类别单独训练一个模型,会缩短时间。单独模型进行微调即可?

@ustblogistics87
Copy link

你好,我和你是一样的问题,也是训练时间长。是不是单独一个类别单独训练一个模型,会缩短时间。单独模型进行微调即可?

单独一个类别训练一个模型,以tlesspbrSO为例,两张3090训练,时间大概显示20小时。全部类别的物体应该都需要单独训练一次

@shanice-l
Copy link
Owner

速度的瓶颈在于CPU而不在GPU,dataloader里组织数据占用时间很长,GPU做inference时间反而比较短。可以试着使用更好的CPU或者增大num_of_workers

@shanice-l
Copy link
Owner

@shanice-l 您好,借着请问一下,我用两张rtx3090训练ycbv数据集,参数config为IMS_PER_BATCH=48,TOTAL_EPOCHS=40 训练时长显示需要14天。同样用两张3090训练tless数据集,将TOTAL_EPOCHS改为8,同样需要2天的时间才能够训练完,可能是由于epoch不够,测试结果和官方提供的结果差异很大。 请问这样的训练时长是正常的嘛,即使采用tlessSO参数对单个物体进行训练,训练时长也较长,根据#23 离线生成xyzmap后能够提升多大的速度呢 期待您的回复

最好不要借楼开issue,这样我收不到邮件通知。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants