Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark faster whisper turbo v3 #1030

Open
asr-lord opened this issue Oct 1, 2024 · 29 comments
Open

Benchmark faster whisper turbo v3 #1030

asr-lord opened this issue Oct 1, 2024 · 29 comments

Comments

@asr-lord
Copy link

asr-lord commented Oct 1, 2024

#WIP

Benchmark with faster-whisper-large-v3-turbo-ct2

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

Large-v3 model on GPU

Implementation Precision Beam size Time Max. GPU memory Max. CPU memory WER %
openai/whisper-large-v3 fp16 5 2m23s MB MB
openai/whisper-turbo fp16 5 39s MB MB
faster-whisper fp16 5 52.023s 4521MB 901MB 2.883
faster-whisper int8 5 52.639s 2953MB 2261MB 4.594
faster-distil-large-v3 fp16 5 26.126s 2409MB 900MB 2.392
faster-distil-large-v3 int8 5 22.537s 1481MB 1468MB 2.392
faster-large-v3-turbo fp16 5 19.155s 2537MB 899MB 1.919
faster-large-v3-turbo int8 5 19.591s 1545MB 1526MB 1.919

WER on librispeech clean val split.

@mvoodarla
Copy link

We now support the new whisper-large-v3-turbo on Sieve!

Use it via sieve/speech_transcriber: https://www.sievedata.com/functions/sieve/speech_transcriber
Use sieve/whisper directly: https://www.sievedata.com/functions/sieve/whisper

Just set speed_boost to True. API guide is under "Usage Guide" tab.

@George0828Zhang
Copy link

George0828Zhang commented Oct 2, 2024

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

@zx3777
Copy link

zx3777 commented Oct 2, 2024

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

In my test, with the same 10-minute audio, Medium took 52 seconds, and Turbo took 39 seconds.

@createOne999
Copy link

You may find this discussion helpful:
openai/whisper#2363 (comment)

372546574-d785c8ca-ffa1-47a3-bdad-9dcb4265f1c0

@zx3777
Copy link

zx3777 commented Oct 2, 2024

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

@MahmoudAshraf97
Copy link
Collaborator

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

or you could use a forced alignment model after the transcription, much better timings than whisper

@jhj0517
Copy link

jhj0517 commented Oct 2, 2024

Great, thanks for your efforts! I hope turbo will be officially added here soon!

_MODELS = {
"tiny.en": "Systran/faster-whisper-tiny.en",
"tiny": "Systran/faster-whisper-tiny",
"base.en": "Systran/faster-whisper-base.en",
"base": "Systran/faster-whisper-base",
"small.en": "Systran/faster-whisper-small.en",
"small": "Systran/faster-whisper-small",
"medium.en": "Systran/faster-whisper-medium.en",
"medium": "Systran/faster-whisper-medium",
"large-v1": "Systran/faster-whisper-large-v1",
"large-v2": "Systran/faster-whisper-large-v2",
"large-v3": "Systran/faster-whisper-large-v3",
"large": "Systran/faster-whisper-large-v3",
"distil-large-v2": "Systran/faster-distil-whisper-large-v2",
"distil-medium.en": "Systran/faster-distil-whisper-medium.en",
"distil-small.en": "Systran/faster-distil-whisper-small.en",
"distil-large-v3": "Systran/faster-distil-whisper-large-v3",
}

@NilaierMusic
Copy link

NilaierMusic commented Oct 2, 2024

I benchmarked the models on my laptop using the same audio file in both sequential and batched processing. Seems that large-v3-turbo generally performs exceptionally well, offering greater accuracy than the base model while maintaining efficient processing times.

System Specifications

  • CPU: Intel Core i7-12650H
  • GPU: NVIDIA GeForce RTX 3060 Laptop (6 GB VRAM)
  • RAM: SODIMM Samsung DDR4 8x2 GB 3200 MHz

Benchmark Details

  • All models were tested with int8 precision.
  • WER (Word Error Rate) was calculated by comparing the original French subtitles of a video with the transcriptions generated by the models.
  • The language was explicitly set to French to prevent any translation errors or incorrect transcriptions.

Sequential Processing Benchmark

Model WER (%) Total Time (s) Transcribe Time (s) Model Load Time (s)
tiny 24.1% 28.95 28.44 0.51
base 16.0% 33.42 32.72 0.70
small 10.5% 55.62 53.21 2.41
medium 10.7% 113.25 106.30 6.95
large 17.6% 240.52 227.31 13.20
large-v1 8.7% 168.58 155.14 13.44
large-v2 8.5% 178.28 164.74 13.53
large-v3 17.6% 230.77 217.43 13.34
large-v3-turbo 9.5% 46.14 38.99 7.15

Observations:

  • The large-v3-turbo model achieves a WER of 9.5%, which is significantly better than the base model and comparable to large-v2.
  • In terms of speed, large-v3-turbo completes transcription in 38.99 seconds, much faster than other large models.

Batched Processing Benchmark

For batched processing, I used 10 batches for each model. I tried to use 16 batches, but some models thrown out-of-memory (OOM) errors due to the 6 GB VRAM limit.

Model WER (%) Total Time (s) Transcribe Time (s) Model Load Time (s)
tiny 23.6% 5.48 4.56 0.92
base 16.5% 6.92 5.70 1.22
small 9.8% 12.45 9.98 2.47
medium 8.9% 26.33 19.47 6.86
large 7.9% 35.97 29.66 6.31
large-v1 12.1% 42.90 29.64 13.26
large-v2 8.8% 43.17 29.71 13.46
large-v3 7.9% 42.97 29.69 13.28
large-v3-turbo 7.7% 18.68 11.47 7.20

Observations:

  • With batched processing, large-v3-turbo achieves the best WER of 7.7%, outperforming all other models in both accuracy and speed.
  • The transcribe time for large-v3-turbo is 11.47 seconds, making it suitable for real-time applications even on a laptop GPU.

Conclusions

  • The large-v3-turbo model offers an excellent balance between accuracy and processing speed, especially evident in batched processing scenarios.
  • It outperforms the base model in terms of WER while maintaining significantly lower processing times compared to other large models.

@tjongsma
Copy link

tjongsma commented Oct 4, 2024

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:
temp_audio_wav.zip
Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

@tjongsma tjongsma mentioned this issue Oct 4, 2024
@NilaierMusic
Copy link

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

@Jiltseb
Copy link
Contributor

Jiltseb commented Oct 5, 2024

Sharing the benchmarking results with Turbo model compared to other large whisper models on one of the biggest open source long-form ASR evaluation dataset.
Our tests were conducted on a subset of YouTube-commons: yotutube-commons-asr-eval ).

Model WER Speed
large v3-turbo 13.40% 129.5x
large v3 13.20% 55.3x
large v2 14.10% 54.6x
distill-large-v3 (en) 15.00% 142.9x

The whisper-turbo model achieves a Word Error Rate (WER) similar to large models and excels in speed among multilingual models.

@Sharrnah
Copy link

Sharrnah commented Oct 5, 2024

Also it does not seem to support translation task, even though it is mentioned.
also tried it with Transformer large-v3 turbo, same behaviour.

@George0828Zhang
Copy link

Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.

They specifically mentioned the translation task being excluded... openai/whisper#2363

excluding translation data, on which we don’t expect turbo to perform well.

@Sharrnah
Copy link

Sharrnah commented Oct 6, 2024

oh okay. i looked at the huggingface page and there the translation task is mentioned.
But maybe it was just copy pasted from the original large-v3 model.

Thanks for pointing that out.

@tjongsma
Copy link

tjongsma commented Oct 7, 2024

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

@NilaierMusic
Copy link

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

Yeah, i did specifically try it on your clip with the same model. But i also did it with batched processing and not a sequential one, so i'm not sure if this is a specific issue with it or not. Batched processing with 6 - 10 batches works the best on my setup and actually provides more accurate transcriptions as you can see from my little benchmark earlier in this thread, so i use it for everything.

@usergit
Copy link

usergit commented Oct 8, 2024

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance.
[2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Btw this happens on the non turbo v3 model as well, I've tried this with a lot of audio files of variable length and it happens a lot. So I've rolled back to v2 model.

@klebster2
Copy link

Thank you so much for recording these benchmarks, I almost cannot believe the speed and quality of these models.

I have one request - for anyone reading this who is benchmarking.

Could some of the people performing benchmarks please record the hardware they are benchmarking on if possible?
GPU, CPU and RAM. Such information will help to estimate the minimal cost incurred for developing a real-time ASR application; e.g. the minimal budget for hardware possible using large-v3-turbo.

Thanks in advance.

@zxl777
Copy link

zxl777 commented Oct 9, 2024

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

@Jiltseb
Copy link
Contributor

Jiltseb commented Oct 9, 2024

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

@zxl777
Copy link

zxl777 commented Oct 9, 2024

It seems that long-duration audio files cannot be processed correctly.
When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

No, using “large-v3” also doesn’t work. The memory usage spikes to 27GB, and then it shows “Killed.”

`
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("11hours.mp3", word_timestamps=True)
`

@ben91lin
Copy link
Contributor

torch.stft cause GPU OOM? I know that in the old stft matrix multiplication mel_spec = self.mel_filters @ magnitudes for long audio files would use a large amount of memory. For this reason, I've written a batch version before.

https://github.com/ben91lin/faster-whisper/blob/mochi/faster_whisper/feature_extractor.py

resource_usage_cuda

@DoS007
Copy link

DoS007 commented Oct 27, 2024

Why does this repos use mobiuslabsgmbh/faster-whisper-large-v3-turbo and not deepdml/faster-whisper-large-v3-turbo-ct2? And why not something like Systran/faster-whisper-v3-turbo?

See Code: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/utils.py

HF-Links:
https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2
https://huggingface.co/mobiuslabsgmbh/faster-whisper-large-v3-turbo

@MahmoudAshraf97
Copy link
Collaborator

Because the deepdml conversion has wrong alignment heads and tokenizer config, this will mainly affect word timestamps, the mobius labs conversion is closer to the official one

@DoS007
Copy link

DoS007 commented Oct 27, 2024

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

  • Use initial_prompt="The following is a speech:"
  • Use word_timestamps=False
  • Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2
  • (Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance.
[1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

@MahmoudAshraf97
Copy link
Collaborator

Number of downloads doesn't imply trustworthiness, deepdml has much more downloads because it was uploaded first and was shared more widely than mobiuslabs, but when I chose between the two, I used the OpenAI model as the reference and mobiuslabs was identical unlike deepdml, the difference between the two conversions is subtle that almost no one will notice any difference in performance except for some edge cases, Systran didn't upload the new model because they are busy with internal projects so community took into their own hands, having a systran conversion wouldn't make any difference though because converting a model is a single line of code that anyone can execute

@tjongsma
Copy link

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

  • Use initial_prompt="The following is a speech:"
  • Use word_timestamps=False
  • Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2
  • (Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance.
[1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat.
[4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

Thanks so much for verifying @DoS007! I was a bit suspect of the deepml model at the time, but unfortunately there were no alternatives. Wil use mobiuslabsgmbh/faster-whisper-large-v3-turbo going forward!

@yccheok
Copy link

yccheok commented Oct 28, 2024

Hi,

Any idea when Turbo V3 will be available in https://pypi.org/project/faster-whisper/ ?

As, I am interested to try it out at https://github.com/runpod-workers/worker-faster_whisper/tree/main

Thank you for all your effort.

@MahmoudAshraf97
Copy link
Collaborator

@yccheok hopefully within the next two weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests