I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

zmy1116 · 2024-08-15T04:00:08Z

Hello,

So I finetuned voicecraft on the french common voice-french dataset. It's quite exciting since it's my first time working on LLM and on full audio model (not just spectrogram -> classification like doing image recognition )! I just want to share here some of my thoughts/findings/questions because I see many open issues about finetuning, hopefully @jasonppy can also provide some insights/ suggestions!

data preparation

I already answered under this issue #138. Again I want to emphasize that while the algorithm itself is more involved and the model/voicecraft is pretty hairy and intimidating, preparing finetuning data is really straightforward. Essentially you need to do the following:

generate audio encodec for each audio file, save them
generate phonemes set for each text file, save them
modify the model text embedding weights if the total number of phonemes exceed the number the pretrained model uses:
- pretrained model uses 80, but the embedding size is 101 and the last one is reserved for padding, so if your total phonemes size is within 100 you don't need to do anything. Otherwise you need to expand this tensor.

I want to address an issue I found when generating french phonemes. VoiceCraft generate IPA phonemes using the package phonemizer, if you use the same piece of code to generate phonemes for your language, sometimes you will get this:

for sentence:  Il va ensuite se positionner sur le dos de la femelle et s'accoupler.
['i', 'l', '_', 'v', 'a', '_', 'ɑ', '̃', 's', 'y', 'i', 't', '_', 's', 'ə', '_', 'p', 'o', 'z', 'i', 's', 'j', 'ɔ', 'n', 'e', '_', 's', 'y', 'ʁ', '_', 'l', 'ə', '_', '(', 'en', ')', 'd', 'ɒ', 's', '(', 'fr', ')', '_', 'd', 'ə', '_', 'l', 'a', '_', 'f', 'ə', 'm', 'ɛ', 'l', '_', 'e', '_', 's', 'a', 'k', 'u', 'p', 'l', 'e', '.']

You see, the phonemes set has this (en) and (fr), this is because the phonemizer thinks there is a language switch. Of course these are not true phonemes tokens, in order to remove these, set the flag

text_tokenizer = TextTokenizer(language='fr-fr', language_switch='remove-flags')

training code related

If you go through steps/train_utils.py, you see that training batches are not created with fixed sizes. Training batches are created such that:

each batch process roughly max_token_num of tokens
all sequences in a batch have roughly the same lengths.

Once a batch is distributed to a GPU process, we further separated with multiple steps of gradient accumulations. However, for whatever reason, THIS DID NOT WORK WELL ON MY GPUS SET. I'm training on 8xL4. For whatever reason, I always get OOM error even if I set accumulation steps to very high number. Therefore, I rewrote portion of the sampler such that instead of having a large batch of 10000 tokens and then split the batch to 10+ small steps, I directly make the sampler to produce a batch with at most 1000 tokens, and I do gradient update every 10 batches. The difference between the two method is that now I can control exactly how many tokens I process within a single step. If you have smaller GPUs and encounter similar issues, you can do what I did.

Training

One thing I think it would be beneficial for people if @jasonppy you can put training curves in your paper or in the repository so we know what to expect. Since this is my first time training LLM. I have no idea what to expect. My training curve look like below after 5 days. I see top10 accuracy is 0.56, I thought this is horrible!! And for the past two days I've been reviewing / validating the entire data generation / training process. Today I start to wonder what is the actual loss/accuracy when the model is trained on gigaspeech. So I compute loss and acc on 4 gigaspeech example... it turns out the returned loss and acc is worse than the values I'm currently have.

Then I realize that you are not supposed to have super high accuracy in the first place, because there are infinite number of ways to say a piece of sentence...

how it works

It works as well (also share similar problem) as the model trained on english! Also since the dataset common voice french has its own problems, I think for a fully functional french model, we probably need to curate some higher quality dataset with more diverse intonation.

I guess now the biggest problem is that tempo of generated speech is not so realistic, especially with the long pauses. I know in the paper suggest to generate multiple of them and pick the shortest one. I'm wondering if we can do the following:

at generation time, set -nan to silence tokens logts if multiple silence tokens are generated one after another... but then you need to know at prior which portion is not supposed to have long pause... it can be something like post process:
- generate utterance
- run force alignment to get word timestamp
- find unnecessary large gaps...
- remove? or restart generation from the gap place?
  seeems to take quite long time to do all these,
at training time, instead of putting one mask token, why don't we put in multiple tokens that is used to represent tempo and intonation of the portion... so at inference time these can be used as control tokens...

Anyway, thanks again to @jasonppy for this work!

The text was updated successfully, but these errors were encountered:

Revln9 · 2024-08-29T00:04:51Z

Any chance you can share that model ? I'd love to try voicecraft with a different language^^

Thanks for the feedback. Will for sure help a lot of people !

TheWayLost · 2024-12-22T19:19:27Z

@zmy1116 Thank you for sharing this valuable experience! There are 2 problems that I wonder:

How much VRam it requires to train the model?
Do you train it from the English model or from scratch with random initial weights?

zmy1116 · 2024-12-27T04:41:19Z

@zmy1116 Thank you for sharing this valuable experience! There are 2 problems that I wonder:

How much VRam it requires to train the model?

Do you train it from the English model or from scratch with random initial weights?

I used 8xL4, each of L4 has 24G VRAM. Again the more important thing is learning rate vs total tokens processed per batch. You can use gradient accumulation if you don't have enough GPU VRAM.
I use the english model directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

zmy1116 commented Aug 15, 2024 •

edited

Loading

Revln9 commented Aug 29, 2024

TheWayLost commented Dec 22, 2024

zmy1116 commented Dec 27, 2024

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

Comments

zmy1116 commented Aug 15, 2024 • edited Loading

data preparation

training code related

Training

how it works

Revln9 commented Aug 29, 2024

TheWayLost commented Dec 22, 2024

zmy1116 commented Dec 27, 2024

zmy1116 commented Aug 15, 2024 •

edited

Loading