training speed #31

hhhddddddd · 2024-04-13T16:34:48Z

Hello, I have a strange problem with train time.

I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml,configs/specs/static.yaml,configs/specs/tiny.yaml exp_name=4k4d_0013_01_r4_static on NVIDIA GeForce RTX 4090.
But it takes me about 40 minutes to train single-frame.

It's even more serious when I executed evc-train -c configs/exps/4k4d/4k4d_0013_01_r4.yaml. it takes me about 4 days to train all frames (NVIDIA GeForce RTX 4090).
Moreover, I also observed a strange phenomenon during my training. When I ran a 4k4d training experiment on the 4090, the gpustat command showed that there were two experiments running.

(The same is true on 4090)

In addition, the psnr of the training results of 4k4d_0013_01_r4_static also failed to reach about 30.

Can you give me any advice? Thank you so much for all your help!

The text was updated successfully, but these errors were encountered:

dendenxu · 2024-06-07T07:23:12Z

Hi, thanks for using our code first! Sorry for the late reply.

For the dynamic dataset, the released default config trains for 800k iterations (defined in r4dv.yaml with the epochs parameter). It typically only requires 400k iterations (epochs=800) to converge. Another thing to note is that we test the training speed without evaluation (runner_cfg.eval_ep=800) and report training metrics only every 100 iterations (runner_cfg.log_interval=100) to reflect the real training time.

The same story goes for the static scene. It only takes 2-3k iterations to converge.

The iteration speed looks fine (60-70ms) though. I'm not sure about the cause for the two experiments showing up, the VRAM usage seems OK.

Another thing to do to speed up the training is to use our latest CUDA-backend implementation, you can enable it via this option: https://github.com/zju3dv/4K4D/blob/712eccb0e0eeef744c19eb221cfb424a2915b474/easyvolcap/models/samplers/r4dv_sampler.py#L43C18-L43C27

As for the training PSNR, the 0013_01 scene is the harder of all four for the DNA-Rendering dataset thus its training PSNR is slightly lower.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training speed #31

training speed #31

hhhddddddd commented Apr 13, 2024

dendenxu commented Jun 7, 2024

training speed #31

training speed #31

Comments

hhhddddddd commented Apr 13, 2024

dendenxu commented Jun 7, 2024