-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad timestamp prediction with some finetuned Whisper models #173
Comments
Thank you @lumpidu Can you please also give the options you use to get the transcription with the bad (too short) second segment? If I just run
|
I could see some problems with option --accurate And here is my guess: The only thing you can do to alleviate the impact with whisper-timestamped is to use option There will still be the issue that some parts of the audio might be either repeated or missing in your transcription, when transcribing audio of more than 30 seconds with such a model. |
Also another thing you could try is with the regular model |
I used for the above output The best result was using no options at all, i.e. I will rerun with your proposals |
I just ran with Could you elaborate, what exactly would be needed for fine-tuning models to predict better timestamps ? |
Have you tried that with the finetuned model? Concerning text normalization, you mean that there aredigits instead of numbers written with letters, upper case letters, and punctuation marks? Concerning fine-tuning, models should be finetuned to predict timestamps at the end of each segment. |
@lumpidu have you tried option |
I am using I also found that some voices are not being detected by the script maybe because of vad? |
Maybe... It cannot hurt to try and see
Indeed silero is a statistical model (neural net) with some weird behaviour sometimes. You can try Also, you can use the --plot option to watch the VAD result on your signal. To see if that's the problem. |
I see the following incorrect transcriptions, when running my tests with the fine-tuned model language-and-voice-lab/whisper-large-icelandic-62640-steps-967h :
Take a look at the
start
,end
segment data:1
+2
, i.e. start of id2
doesn't begin at end of segment1
1
start
-end
is1.58
seconds, but in fact the segment length is29.74
seconds, as can be seen in segment2
start time of59.74
2
start
-end
is1.06
seconds, but in fact the segment length is29.76
seconds, as segment3
starts at89.5
1
: ilög and segment2
: fsiréttar isn't correct, because these segments start in the middle of a spoken word.start
,end
timings is very low. In the above case its<< 0.1
. For all non-problematic segments, it's often close to1.0
, e.g.0.986
There is no warning on
stderr
/stdout
about non-aligning segments or low confidence values of the transcripts. There is also no way any ASR system can generate correct first or last words, if segments start or end in the middle of a spoken word. Therefore my suggestion proposes to use a less naive approach either via VAD or overlapping segments. It's not clear for me, which of these approaches already has been implemented bywhisper_timestamped
.Originally posted by @lumpidu in #64 (comment)
Here is an audio file (wav converted to webm with highest possible quality), that can be used to reproduce the error:
demo1_ice.webm
I have tried several different approaches: default values, default values as stated for whisper, with or without VAD (silero-4.0). Best results were with VAD turned on.
The text was updated successfully, but these errors were encountered: