Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151

zmy1116 · 2024-08-08T14:26:19Z

Hello,

I'm starting to train Voicecraft on a custom dataset. I have a different hardware setup (L4 GPU instead A40) so I'm adjusting training configuration.

I noticed that you used unusually large gradient accumulation steps (12) and when you backpropagate, it looks like you didn't average by accumulation steps.

VoiceCraft/steps/trainer.py

Lines 87 to 91 in 4873249

    
           for j in range(self.args.gradient_accumulation_steps): 
        
               cur_ind = all_inds[j::self.args.gradient_accumulation_steps] 
        
               cur_batch = {key: batch[key][cur_ind] for key in batch} 
        
               with torch.cuda.amp.autocast(dtype=torch.float16 if self.args.precision=="float16" else torch.float32): 
        
                   out = self.model(cur_batch)

VoiceCraft/steps/trainer.py

Lines 138 to 141 in 4873249

    
           if self.args.optimizer_name == "ScaledAdam": 
        
               self.scaler.scale(out['loss']).backward()  
        
           else: 
        
               self.scaler.scale(out['loss']/out['effective_ntoken']).backward()

Does this mean the backpropagated loss become proportional to the gradient accumulation steps? Say you are doing 12 steps now with A40 gpu with 48 GB memory , since I use L4 GPU with 24 GB memory, I need to drop inference batch size by half and increase gradient accumulation steps. that would be equivalent to drop LR by 2.

Alternatively, I've been reworking on the dynamic sampler and I'm able to fit 20000 audio tokens training in 8 L4 gpu in 4 steps instead of 12. If I don't adjust LR, that means LR would be drop to 1/3.

What do you think?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151

Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151

zmy1116 commented Aug 8, 2024 •

edited

Loading

Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151

Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151

Comments

zmy1116 commented Aug 8, 2024 • edited Loading

zmy1116 commented Aug 8, 2024 •

edited

Loading