Benchmark faster whisper turbo v3 #1030

asr-lord · 2024-10-01T15:51:32Z

#WIP

Benchmark with faster-whisper-large-v3-turbo-ct2

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

Large-v3 model on GPU

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory	WER %
openai/whisper-large-v3	fp16	5	2m23s	MB	MB
openai/whisper-turbo	fp16	5	39s	MB	MB
faster-whisper	fp16	5	52.023s	4521MB	901MB	2.883
faster-whisper	int8	5	52.639s	2953MB	2261MB	4.594
faster-distil-large-v3	fp16	5	26.126s	2409MB	900MB	2.392
faster-distil-large-v3	int8	5	22.537s	1481MB	1468MB	2.392
faster-large-v3-turbo	fp16	5	19.155s	2537MB	899MB	1.919
faster-large-v3-turbo	int8	5	19.591s	1545MB	1526MB	1.919

WER on librispeech clean val split.

mvoodarla · 2024-10-01T17:41:29Z

We now support the new whisper-large-v3-turbo on Sieve!

Use it via sieve/speech_transcriber: https://www.sievedata.com/functions/sieve/speech_transcriber
Use sieve/whisper directly: https://www.sievedata.com/functions/sieve/whisper

Just set speed_boost to True. API guide is under "Usage Guide" tab.

George0828Zhang · 2024-10-02T02:08:21Z

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

zx3777 · 2024-10-02T05:19:04Z

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

In my test, with the same 10-minute audio, Medium took 52 seconds, and Turbo took 39 seconds.

createOne999 · 2024-10-02T06:22:56Z

You may find this discussion helpful:
openai/whisper#2363 (comment)

zx3777 · 2024-10-02T06:47:55Z

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

MahmoudAshraf97 · 2024-10-02T09:14:05Z

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

or you could use a forced alignment model after the transcription, much better timings than whisper

jhj0517 · 2024-10-02T15:39:40Z

Great, thanks for your efforts! I hope turbo will be officially added here soon!

faster-whisper/faster_whisper/utils.py

Lines 12 to 29 in d57c5b4

    
           _MODELS = { 
        
               "tiny.en": "Systran/faster-whisper-tiny.en", 
        
               "tiny": "Systran/faster-whisper-tiny", 
        
               "base.en": "Systran/faster-whisper-base.en", 
        
               "base": "Systran/faster-whisper-base", 
        
               "small.en": "Systran/faster-whisper-small.en", 
        
               "small": "Systran/faster-whisper-small", 
        
               "medium.en": "Systran/faster-whisper-medium.en", 
        
               "medium": "Systran/faster-whisper-medium", 
        
               "large-v1": "Systran/faster-whisper-large-v1", 
        
               "large-v2": "Systran/faster-whisper-large-v2", 
        
               "large-v3": "Systran/faster-whisper-large-v3", 
        
               "large": "Systran/faster-whisper-large-v3", 
        
               "distil-large-v2": "Systran/faster-distil-whisper-large-v2", 
        
               "distil-medium.en": "Systran/faster-distil-whisper-medium.en", 
        
               "distil-small.en": "Systran/faster-distil-whisper-small.en", 
        
               "distil-large-v3": "Systran/faster-distil-whisper-large-v3", 
        
           }

NilaierMusic · 2024-10-02T23:44:53Z

I benchmarked the models on my laptop using the same audio file in both sequential and batched processing. Seems that large-v3-turbo generally performs exceptionally well, offering greater accuracy than the base model while maintaining efficient processing times.

System Specifications

CPU: Intel Core i7-12650H
GPU: NVIDIA GeForce RTX 3060 Laptop (6 GB VRAM)
RAM: SODIMM Samsung DDR4 8x2 GB 3200 MHz

Benchmark Details

All models were tested with int8 precision.
WER (Word Error Rate) was calculated by comparing the original French subtitles of a video with the transcriptions generated by the models.
The language was explicitly set to French to prevent any translation errors or incorrect transcriptions.

Sequential Processing Benchmark

Model	WER (%)	Total Time (s)	Transcribe Time (s)	Model Load Time (s)
tiny	24.1%	28.95	28.44	0.51
base	16.0%	33.42	32.72	0.70
small	10.5%	55.62	53.21	2.41
medium	10.7%	113.25	106.30	6.95
large	17.6%	240.52	227.31	13.20
large-v1	8.7%	168.58	155.14	13.44
large-v2	8.5%	178.28	164.74	13.53
large-v3	17.6%	230.77	217.43	13.34
large-v3-turbo	9.5%	46.14	38.99	7.15

Observations:

The large-v3-turbo model achieves a WER of 9.5%, which is significantly better than the base model and comparable to large-v2.
In terms of speed, large-v3-turbo completes transcription in 38.99 seconds, much faster than other large models.

Batched Processing Benchmark

For batched processing, I used 10 batches for each model. I tried to use 16 batches, but some models thrown out-of-memory (OOM) errors due to the 6 GB VRAM limit.

Model	WER (%)	Total Time (s)	Transcribe Time (s)	Model Load Time (s)
tiny	23.6%	5.48	4.56	0.92
base	16.5%	6.92	5.70	1.22
small	9.8%	12.45	9.98	2.47
medium	8.9%	26.33	19.47	6.86
large	7.9%	35.97	29.66	6.31
large-v1	12.1%	42.90	29.64	13.26
large-v2	8.8%	43.17	29.71	13.46
large-v3	7.9%	42.97	29.69	13.28
large-v3-turbo	7.7%	18.68	11.47	7.20

Observations:

With batched processing, large-v3-turbo achieves the best WER of 7.7%, outperforming all other models in both accuracy and speed.
The transcribe time for large-v3-turbo is 11.47 seconds, making it suitable for real-time applications even on a laptop GPU.

Conclusions

The large-v3-turbo model offers an excellent balance between accuracy and processing speed, especially evident in batched processing scenarios.
It outperforms the base model in terms of WER while maintaining significantly lower processing times compared to other large models.

tjongsma · 2024-10-04T08:42:47Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:
temp_audio_wav.zip
Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

NilaierMusic · 2024-10-04T16:32:09Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Jiltseb · 2024-10-05T08:55:57Z

Sharing the benchmarking results with Turbo model compared to other large whisper models on one of the biggest open source long-form ASR evaluation dataset.
Our tests were conducted on a subset of YouTube-commons: yotutube-commons-asr-eval ).

Model	WER	Speed
large v3-turbo	13.40%	129.5x
large v3	13.20%	55.3x
large v2	14.10%	54.6x
distill-large-v3 (en)	15.00%	142.9x

The whisper-turbo model achieves a Word Error Rate (WER) similar to large models and excels in speed among multilingual models.

Sharrnah · 2024-10-05T21:01:39Z

Also it does not seem to support translation task, even though it is mentioned.
also tried it with Transformer large-v3 turbo, same behaviour.

George0828Zhang · 2024-10-06T01:14:18Z

Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.

They specifically mentioned the translation task being excluded... openai/whisper#2363

excluding translation data, on which we don’t expect turbo to perform well.

Sharrnah · 2024-10-06T19:54:52Z

oh okay. i looked at the huggingface page and there the translation task is mentioned.
But maybe it was just copy pasted from the original large-v3 model.

Thanks for pointing that out.

tjongsma · 2024-10-07T07:26:27Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

NilaierMusic · 2024-10-08T03:47:35Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.
Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

Yeah, i did specifically try it on your clip with the same model. But i also did it with batched processing and not a sequential one, so i'm not sure if this is a specific issue with it or not. Batched processing with 6 - 10 batches works the best on my setup and actually provides more accurate transcriptions as you can see from my little benchmark earlier in this thread, so i use it for everything.

usergit · 2024-10-08T04:58:50Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Btw this happens on the non turbo v3 model as well, I've tried this with a lot of audio files of variable length and it happens a lot. So I've rolled back to v2 model.

klebster2 · 2024-10-08T18:12:25Z

Thank you so much for recording these benchmarks, I almost cannot believe the speed and quality of these models.

I have one request - for anyone reading this who is benchmarking.

Could some of the people performing benchmarks please record the hardware they are benchmarking on if possible?
GPU, CPU and RAM. Such information will help to estimate the minimal cost incurred for developing a real-time ASR application; e.g. the minimal budget for hardware possible using large-v3-turbo.

Thanks in advance.

zxl777 · 2024-10-09T02:19:51Z

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Jiltseb · 2024-10-09T15:07:36Z

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

zxl777 · 2024-10-09T19:47:07Z

It seems that long-duration audio files cannot be processed correctly.
When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

No, using “large-v3” also doesn’t work. The memory usage spikes to 27GB, and then it shows “Killed.”

`
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("11hours.mp3", word_timestamps=True)
`

ben91lin · 2024-10-15T08:19:17Z

torch.stft cause GPU OOM? I know that in the old stft matrix multiplication mel_spec = self.mel_filters @ magnitudes for long audio files would use a large amount of memory. For this reason, I've written a batch version before.

https://github.com/ben91lin/faster-whisper/blob/mochi/faster_whisper/feature_extractor.py

DoS007 · 2024-10-27T15:30:11Z

Why does this repos use mobiuslabsgmbh/faster-whisper-large-v3-turbo and not deepdml/faster-whisper-large-v3-turbo-ct2? And why not something like Systran/faster-whisper-v3-turbo?

See Code: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/utils.py

HF-Links:
https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2
https://huggingface.co/mobiuslabsgmbh/faster-whisper-large-v3-turbo

MahmoudAshraf97 · 2024-10-27T15:51:19Z

Because the deepdml conversion has wrong alignment heads and tokenizer config, this will mainly affect word timestamps, the mobius labs conversion is closer to the official one

DoS007 · 2024-10-27T16:17:42Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

Use initial_prompt="The following is a speech:"
Use word_timestamps=False
Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2
(Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance.
[1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

MahmoudAshraf97 · 2024-10-27T16:25:31Z

Number of downloads doesn't imply trustworthiness, deepdml has much more downloads because it was uploaded first and was shared more widely than mobiuslabs, but when I chose between the two, I used the OpenAI model as the reference and mobiuslabs was identical unlike deepdml, the difference between the two conversions is subtle that almost no one will notice any difference in performance except for some edge cases, Systran didn't upload the new model because they are busy with internal projects so community took into their own hands, having a systran conversion wouldn't make any difference though because converting a model is a single line of code that anyone can execute

tjongsma · 2024-10-28T12:58:00Z

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

Use initial_prompt="The following is a speech:"

Use word_timestamps=False

Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2

(Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance.
[1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

Thanks so much for verifying @DoS007! I was a bit suspect of the deepml model at the time, but unfortunately there were no alternatives. Wil use mobiuslabsgmbh/faster-whisper-large-v3-turbo going forward!

yccheok · 2024-10-28T19:42:05Z

Hi,

Any idea when Turbo V3 will be available in https://pypi.org/project/faster-whisper/ ?

As, I am interested to try it out at https://github.com/runpod-workers/worker-faster_whisper/tree/main

Thank you for all your effort.

MahmoudAshraf97 · 2024-10-31T09:19:20Z

@yccheok hopefully within the next two weeks

jhj0517 mentioned this issue Oct 2, 2024

Update available models jhj0517/Whisper-WebUI#310

Merged

tjongsma mentioned this issue Oct 4, 2024

Turbo-V3 #1025

Closed

rmusser01 mentioned this issue Oct 25, 2024

Evaluate insanely-fast-whisper the-crypt-keeper/tldw#6

Open

yccheok mentioned this issue Oct 29, 2024

Turbo-V3 runpod-workers/worker-faster_whisper#35

Open

ixff mentioned this issue Oct 30, 2024

fix: use faster-whisper's turbo model Softcatala/whisper-ctranslate2#110

Open

Enyium mentioned this issue Nov 16, 2024

Added compatible large-v3-turbo model savbell/whisper-writer#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark faster whisper turbo v3 #1030

Benchmark faster whisper turbo v3 #1030

asr-lord commented Oct 1, 2024 •

edited

Loading

mvoodarla commented Oct 1, 2024

George0828Zhang commented Oct 2, 2024 •

edited

Loading

zx3777 commented Oct 2, 2024

createOne999 commented Oct 2, 2024

zx3777 commented Oct 2, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 2, 2024

jhj0517 commented Oct 2, 2024 •

edited

Loading

NilaierMusic commented Oct 2, 2024 •

edited

Loading

tjongsma commented Oct 4, 2024 •

edited

Loading

NilaierMusic commented Oct 4, 2024

Jiltseb commented Oct 5, 2024

Sharrnah commented Oct 5, 2024

George0828Zhang commented Oct 6, 2024

Sharrnah commented Oct 6, 2024

tjongsma commented Oct 7, 2024

NilaierMusic commented Oct 8, 2024

usergit commented Oct 8, 2024

klebster2 commented Oct 8, 2024

zxl777 commented Oct 9, 2024

Jiltseb commented Oct 9, 2024

zxl777 commented Oct 9, 2024 •

edited

Loading

ben91lin commented Oct 15, 2024

DoS007 commented Oct 27, 2024

MahmoudAshraf97 commented Oct 27, 2024

DoS007 commented Oct 27, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 27, 2024

tjongsma commented Oct 28, 2024

yccheok commented Oct 28, 2024

MahmoudAshraf97 commented Oct 31, 2024

Benchmark faster whisper turbo v3 #1030

Benchmark faster whisper turbo v3 #1030

Comments

asr-lord commented Oct 1, 2024 • edited Loading

Benchmark with faster-whisper-large-v3-turbo-ct2

Large-v3 model on GPU

mvoodarla commented Oct 1, 2024

George0828Zhang commented Oct 2, 2024 • edited Loading

zx3777 commented Oct 2, 2024

createOne999 commented Oct 2, 2024

zx3777 commented Oct 2, 2024 • edited Loading

MahmoudAshraf97 commented Oct 2, 2024

jhj0517 commented Oct 2, 2024 • edited Loading

NilaierMusic commented Oct 2, 2024 • edited Loading

System Specifications

Benchmark Details

Sequential Processing Benchmark

Batched Processing Benchmark

Conclusions

tjongsma commented Oct 4, 2024 • edited Loading

NilaierMusic commented Oct 4, 2024

Jiltseb commented Oct 5, 2024

Sharrnah commented Oct 5, 2024

George0828Zhang commented Oct 6, 2024

Sharrnah commented Oct 6, 2024

tjongsma commented Oct 7, 2024

NilaierMusic commented Oct 8, 2024

usergit commented Oct 8, 2024

klebster2 commented Oct 8, 2024

zxl777 commented Oct 9, 2024

Jiltseb commented Oct 9, 2024

zxl777 commented Oct 9, 2024 • edited Loading

ben91lin commented Oct 15, 2024

DoS007 commented Oct 27, 2024

MahmoudAshraf97 commented Oct 27, 2024

DoS007 commented Oct 27, 2024 • edited Loading

MahmoudAshraf97 commented Oct 27, 2024

tjongsma commented Oct 28, 2024

yccheok commented Oct 28, 2024

MahmoudAshraf97 commented Oct 31, 2024

asr-lord commented Oct 1, 2024 •

edited

Loading

George0828Zhang commented Oct 2, 2024 •

edited

Loading

zx3777 commented Oct 2, 2024 •

edited

Loading

jhj0517 commented Oct 2, 2024 •

edited

Loading

NilaierMusic commented Oct 2, 2024 •

edited

Loading

tjongsma commented Oct 4, 2024 •

edited

Loading

zxl777 commented Oct 9, 2024 •

edited

Loading

DoS007 commented Oct 27, 2024 •

edited

Loading