-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark faster whisper turbo v3 #1030
Comments
We now support the new whisper-large-v3-turbo on Sieve! Use it via Just set |
Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version. |
In my test, with the same 10-minute audio, Medium took 52 seconds, and Turbo took 39 seconds. |
You may find this discussion helpful: |
Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing. I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium. |
or you could use a forced alignment model after the transcription, much better timings than whisper |
Great, thanks for your efforts! I hope faster-whisper/faster_whisper/utils.py Lines 12 to 29 in d57c5b4
|
I benchmarked the models on my laptop using the same audio file in both sequential and batched processing. Seems that System Specifications
Benchmark Details
Sequential Processing Benchmark
Observations:
Batched Processing BenchmarkFor batched processing, I used 10 batches for each model. I tried to use 16 batches, but some models thrown out-of-memory (OOM) errors due to the 6 GB VRAM limit.
Observations:
Conclusions
|
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:
produces
Whereas using medium:
produces
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better. |
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark. |
Sharing the benchmarking results with Turbo model compared to other large whisper models on one of the biggest open source long-form ASR evaluation dataset.
The whisper-turbo model achieves a Word Error Rate (WER) similar to large models and excels in speed among multilingual models. |
Also it does not seem to support translation task, even though it is mentioned. |
They specifically mentioned the translation task being excluded... openai/whisper#2363
|
oh okay. i looked at the huggingface page and there the translation task is mentioned. Thanks for pointing that out. |
Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2) |
Yeah, i did specifically try it on your clip with the same model. But i also did it with batched processing and not a sequential one, so i'm not sure if this is a specific issue with it or not. Batched processing with 6 - 10 batches works the best on my setup and actually provides more accurate transcriptions as you can see from my little benchmark earlier in this thread, so i use it for everything. |
Btw this happens on the non turbo v3 model as well, I've tried this with a lot of audio files of variable length and it happens a lot. So I've rolled back to v2 model. |
Thank you so much for recording these benchmarks, I almost cannot believe the speed and quality of these models. I have one request - for anyone reading this who is benchmarking. Could some of the people performing benchmarks please record the hardware they are benchmarking on if possible? Thanks in advance. |
It seems that long-duration audio files cannot be processed correctly. When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited. |
Is it a problem only with the |
No, using “large-v3” also doesn’t work. The memory usage spikes to 27GB, and then it shows “Killed.” ` segments, info = model.transcribe("11hours.mp3", word_timestamps=True) |
torch.stft cause GPU OOM? I know that in the old stft matrix multiplication mel_spec = self.mel_filters @ magnitudes for long audio files would use a large amount of memory. For this reason, I've written a batch version before. https://github.com/ben91lin/faster-whisper/blob/mochi/faster_whisper/feature_extractor.py |
Why does this repos use See Code: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/utils.py HF-Links: |
Because the |
@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference). Every single point of the the following fixes that in my experimenting:
( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :
(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough. Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3? @MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Edit: Question: What thresholds to set, to get less hallucination at low sound parts? |
Number of downloads doesn't imply trustworthiness, deepdml has much more downloads because it was uploaded first and was shared more widely than mobiuslabs, but when I chose between the two, I used the OpenAI model as the reference and mobiuslabs was identical unlike deepdml, the difference between the two conversions is subtle that almost no one will notice any difference in performance except for some edge cases, Systran didn't upload the new model because they are busy with internal projects so community took into their own hands, having a systran conversion wouldn't make any difference though because converting a model is a single line of code that anyone can execute |
Thanks so much for verifying @DoS007! I was a bit suspect of the deepml model at the time, but unfortunately there were no alternatives. Wil use mobiuslabsgmbh/faster-whisper-large-v3-turbo going forward! |
Hi, Any idea when Turbo V3 will be available in https://pypi.org/project/faster-whisper/ ? As, I am interested to try it out at https://github.com/runpod-workers/worker-faster_whisper/tree/main Thank you for all your effort. |
@yccheok hopefully within the next two weeks |
#WIP
Benchmark with faster-whisper-large-v3-turbo-ct2
For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:
Large-v3 model on GPU
WER on librispeech clean val split.
The text was updated successfully, but these errors were encountered: