Finetuining 1B first-stage on non-English datasets: thoughts #157

Ar4ikov · 2024-05-21T22:21:05Z

Hello everyone! I am fine-tuning a model in a non-English language. The dataset consists of 200 hours of audio recordings. Using the guideline (#70 (comment)) and the latest updates, particularly the fam/llm/finetune.py script, I have set the following:

For hyperparameters: 2 training epochs, 32 batch_size, 0.15 valid_split
For the speaker_encoder: I did not change the model, left the one that is in the hf repo
For the tokenizer: trained a new BPE tokenizer (mergeable_ranks length = 512) on my dataset and replaced it when loading from the model checkpoint (

metavoice-src/fam/llm/finetune.py

Line 137 in 12df077

tokenizer_info = checkpoint.get("meta", {}).get("tokenizer", {})

), adding the option to specify the path via terminal CLI

The model training is currently in progress, and I'm monitoring the training_loss, which is barely dropping below 2.000 (~2.200).

Regarding the dataset: it's a mix of various data, including:

common_voice_17 (sample_rate = 16000; mean duration: 5.3s, low noise / high SNR, about 300 speakers)
openstt (sample_rate = 44100, mean duration: 4s, high SNR, about 100 speakers)
custom dataset (sample_rate = 16000 / 44100, mean duration: 7.4s, high SNR, about 30 speakers)
Additionally, I performed dataset cleaning based on the number of samples per speaker and duration.

What am I doing wrong or what did I do incorrectly? The results, in my opinion, will not be impressive. I currently don't have datasets with a mean duration of ~30s; will this significantly impact training, or did I do something wrong in the training process itself (not in the dataset selection stage)?

Also, I believe that such a high error during the training steps is not quite correct in terms of overall training. I would like to know your opinion on such a high error and what results you have had on your data for the latest published model checkpoints.
I will provide the report from wandb in the thread!

The text was updated successfully, but these errors were encountered:

Ar4ikov · 2024-05-22T11:12:41Z

As promised, here are the training results:
https://api.wandb.ai/links/socialcode_donstu/9l38gko0

Additionally:

The checkpoint for the new tokenizer doesn't seem to save into the model or load correctly in inference.py, so I had to load it manually with a new line of code.
I had to disable the check for non_bpe_characters in utils.py.

The result is not very impressive, as I suspected. The model now clones voice features but produces gibberish for non-English phrases. Moreover, the output for English has degraded significantly, with the model struggling to generate words or producing highly noisy outputs.

If you have questions about the setup used for training:
1x RTX 3090 @ 24GB
batch_size: 32
total training size: 166,249 samples @ mean duration = 5s (~200 hours of speech total)
2 epochs with evaluation = ~19 hours

njawh · 2024-06-18T04:13:26Z

Hello.
I read what you explained impressively.

I also want to fine-tune the model to a language other than English,
In the part related to the Tokenizer among the explanations, can you tell me how you wrote the code for this part, "We trained a new BPE Tokenizer"?

I really appreciate your response.

Ar4ikov · 2024-06-26T11:31:44Z

@njawh Hello!

https://gist.github.com/Ar4ikov/8b22ee3ef952140611510b17c2f3f000

AIvashov · 2024-10-01T09:42:38Z

@Ar4ikov @njawh Hi, great job guys, what results did you end up with?

njawh mentioned this issue Jun 20, 2024

How to fine-tuning non-English data #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuining 1B first-stage on non-English datasets: thoughts #157

Finetuining 1B first-stage on non-English datasets: thoughts #157

Ar4ikov commented May 21, 2024 •

edited

Loading

Ar4ikov commented May 22, 2024

njawh commented Jun 18, 2024

Ar4ikov commented Jun 26, 2024

AIvashov commented Oct 1, 2024

Finetuining 1B first-stage on non-English datasets: thoughts #157

Finetuining 1B first-stage on non-English datasets: thoughts #157

Comments

Ar4ikov commented May 21, 2024 • edited Loading

Ar4ikov commented May 22, 2024

njawh commented Jun 18, 2024

Ar4ikov commented Jun 26, 2024

AIvashov commented Oct 1, 2024

Ar4ikov commented May 21, 2024 •

edited

Loading