-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuining 1B first-stage on non-English datasets: thoughts #157
Comments
As promised, here are the training results: Additionally:
The result is not very impressive, as I suspected. The model now clones voice features but produces gibberish for non-English phrases. Moreover, the output for English has degraded significantly, with the model struggling to generate words or producing highly noisy outputs. If you have questions about the setup used for training: |
Hello. I also want to fine-tune the model to a language other than English, I really appreciate your response. |
According to original discord message
Hello everyone! I am fine-tuning a model in a non-English language. The dataset consists of 200 hours of audio recordings. Using the guideline (#70 (comment)) and the latest updates, particularly the
fam/llm/finetune.py
script, I have set the following:speaker_encoder
: I did not change the model, left the one that is in the hf repotokenizer
: trained a new BPE tokenizer (mergeable_ranks length = 512
) on my dataset and replaced it when loading from the model checkpoint (metavoice-src/fam/llm/finetune.py
Line 137 in 12df077
The model training is currently in progress, and I'm monitoring the
training_loss
, which is barely dropping below 2.000 (~2.200).Regarding the dataset: it's a mix of various data, including:
sample_rate = 16000
; mean duration: 5.3s, low noise / high SNR, about 300 speakers)sample_rate = 44100
, mean duration: 4s, high SNR, about 100 speakers)sample_rate = 16000 / 44100
, mean duration: 7.4s, high SNR, about 30 speakers)Additionally, I performed dataset cleaning based on the number of samples per speaker and duration.
What am I doing wrong or what did I do incorrectly? The results, in my opinion, will not be impressive. I currently don't have datasets with a mean duration of ~30s; will this significantly impact training, or did I do something wrong in the training process itself (not in the dataset selection stage)?
Also, I believe that such a high error during the training steps is not quite correct in terms of overall training. I would like to know your opinion on such a high error and what results you have had on your data for the latest published model checkpoints.
I will provide the report from wandb in the thread!
The text was updated successfully, but these errors were encountered: