English latest checkpoint #97

TalapMukhamejan · 2024-07-14T18:39:34Z

TalapMukhamejan
Jul 14, 2024

Hello, thank you for writing this cool repo.
Is it possible for you to share the latest lj speech model? I have been struggling to find any on vits2, except the one you shared for 64k steps. There was a guy on vitenamise samples who shared models and configs on Drive, but not the symbols, that's why I'm getting some problems trying to inference it.
RuntimeError: Error(s) in loading state_dict for SynthesizerTrn:
size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([184, 192]) from checkpoint, the shape in current model is torch.Size([178, 192]).

I found the model trained on just VITS, but no matter how long I tried to fine-tune it, I was getting some gibberish-sounding audio. If you have one, please share it with us.

Best regards.

p0p4k · 2024-08-01T14:12:15Z

p0p4k
Aug 1, 2024
Maintainer

Hi, I do not have any model trained on latest code nor the time/resources to do it. I can help you debug your training if needed.

0 replies

JohnHerry · 2024-08-02T01:46:16Z

JohnHerry
Aug 2, 2024

Hello, thank you for writing this cool repo. Is it possible for you to share the latest lj speech model? I have been struggling to find any on vits2, except the one you shared for 64k steps. There was a guy on vitenamise samples who shared models and configs on Drive, but not the symbols, that's why I'm getting some problems trying to inference it. RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([184, 192]) from checkpoint, the shape in current model is torch.Size([178, 192]).

I found the model trained on just VITS, but no matter how long I tried to fine-tune it, I was getting some gibberish-sounding audio. If you have one, please share it with us.

Best regards.

Ah, It seems that you are trying a different audio sample rate to train from scratch. if that is true, you should first familar with the HiFiGAN model framework, You weithts is [184, 192] while current model is [178, 192], most likely you are using a different speech sample rate or a different hop_size for mel frame extraction, either of the two occasion, you should first adjust your decoder [HiFiGAN] parameters to fit you custom config.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English latest checkpoint #97

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

English latest checkpoint #97

TalapMukhamejan Jul 14, 2024

Replies: 2 comments

p0p4k Aug 1, 2024 Maintainer

JohnHerry Aug 2, 2024

TalapMukhamejan
Jul 14, 2024

p0p4k
Aug 1, 2024
Maintainer

JohnHerry
Aug 2, 2024