Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training speed #31

Open
hhhddddddd opened this issue Apr 13, 2024 · 1 comment
Open

training speed #31

hhhddddddd opened this issue Apr 13, 2024 · 1 comment

Comments

@hhhddddddd
Copy link

Hello, I have a strange problem with train time.

I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml,configs/specs/static.yaml,configs/specs/tiny.yaml exp_name=4k4d_0013_01_r4_static on NVIDIA GeForce RTX 4090.
But it takes me about 40 minutes to train single-frame.
image

It's even more serious when I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml. it takes me about 4 days to train all frames (NVIDIA GeForce RTX 4090).
Moreover, I also observed a strange phenomenon during my training. When I ran a 4k4d training experiment on the 4090, the gpustat command showed that there were two experiments running.
image
(The same is true on 4090)

In addition, the psnr of the training results of 4k4d_0013_01_r4_static also failed to reach about 30.
image

Can you give me any advice? Thank you so much for all your help!

@dendenxu
Copy link
Member

dendenxu commented Jun 7, 2024

Hi, thanks for using our code first! Sorry for the late reply.

For the dynamic dataset, the released default config trains for 800k iterations (defined in r4dv.yaml with the epochs parameter). It typically only requires 400k iterations (epochs=800) to converge. Another thing to note is that we test the training speed without evaluation (runner_cfg.eval_ep=800) and report training metrics only every 100 iterations (runner_cfg.log_interval=100) to reflect the real training time.

The same story goes for the static scene. It only takes 2-3k iterations to converge.

The iteration speed looks fine (60-70ms) though. I'm not sure about the cause for the two experiments showing up, the VRAM usage seems OK.

Another thing to do to speed up the training is to use our latest CUDA-backend implementation, you can enable it via this option: https://github.com/zju3dv/4K4D/blob/712eccb0e0eeef744c19eb221cfb424a2915b474/easyvolcap/models/samplers/r4dv_sampler.py#L43C18-L43C27

As for the training PSNR, the 0013_01 scene is the harder of all four for the DNA-Rendering dataset thus its training PSNR is slightly lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants