Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuned large-v3 inference problem #1099

Open
sinisha opened this issue Oct 28, 2024 · 14 comments
Open

Finetuned large-v3 inference problem #1099

sinisha opened this issue Oct 28, 2024 · 14 comments

Comments

@sinisha
Copy link

sinisha commented Oct 28, 2024

I am using whisperx for inference (which is built upon faster-whisper).

I have finetuned large-v3 model on 1k hours of domain-specific data. When I run standard inference the results are ok. Finetuned model is converted using ctranslate2 but the results obtained with whisperx are almost all hallucinations and repetitions (maybe first couple of phonemes at the beginning are correct). I used same ctranslate2 command to convert original large-v3 model and whisperx inference is also correct. Model is finetuned using Transformers 4.45.2. I have tried a couple of different Transformers version in inference and the results are similar. Has anyone encountered similar problem?

The problem with hallucinations and repetitions is not sporadic. It happens with every input audio. Hence I am convinced it is some problem in generation tokens.

@MahmoudAshraf97
Copy link
Collaborator

does these problems occur only with faster whisper or transformers also has this issue?

@sinisha
Copy link
Author

sinisha commented Oct 28, 2024

does these problems occur only with faster whisper or transformers also has this issue?

There is no problem in standard inference. The output is without repetitions

@bchinnari
Copy link

bchinnari commented Oct 29, 2024

hi @sinisha ... I think I also have similar problem. I also have fine-tuned a (small) model, I converted it to ct2 format and I am trying to use it as part of whisper streaming (https://github.com/ufal/whisper_streaming , this repo also uses faster whisper as backend).

I found that if the model doesn't output end timestamp token, faster whisper is giving some problems. Could you try the following code.

from faster_whisper import WhisperModel
modelPath = "/path/to/your/ct2/model"
model = WhisperModel(modelPath, device="cpu", compute_type="int8")

segments1 , info = model.transcribe(wav, task="transcribe",beam_size=5, word_timestamps=True)  #### line 1
for segment in segments1:  
      print(segment)

segments2 , info = model.transcribe(wav, task="transcribe",beam_size=5)  #### line 2
for segment in segments2:  
      print(segment)

Essentially between line 1 and 2 , there is only one difference. that is word_timestamps.
If these 2 lines produce different number of segments, we may conclude that "not outputting end time stamp token" could be the problem.
I think faster-whisper gives problems when "we have to work with word/segment time stamps and model doesn't give that end time stamp token"

@sinisha
Copy link
Author

sinisha commented Oct 29, 2024

Hi @bchinnari Thanks for suggestion. Adding word_timestamps = True really improves everything a lot.
I am still confused about this. I've finetuned models earlier and used them with whisperx with no problems.

@bchinnari
Copy link

ok.. That's wonderful. Just re-checking with you adding True flag improves everything for you ?
It is making everything worse for me. I am observing the opposite of what you are saying :)

@sinisha
Copy link
Author

sinisha commented Oct 29, 2024

ok.. That's wonderful. Just re-checking with you adding True flag improves everything for you ? It is making everything worse for me. I am observing the opposite of what you are saying :)

Oh. Wait. Setting word_timestamps = True returns more segments, whereas without word_timestamps there is always one segment. And it looks the results looked ok, only for some specific run. For every run they are different

@bchinnari
Copy link

Overall, True flag makes it worse ?

@sinisha
Copy link
Author

sinisha commented Oct 29, 2024

Overall, True flag makes it worse ?

Can't answer this, since without 'True' flag I got only one short segment

@bchinnari
Copy link

I did not understand.
If you run on multiple audio files, adding the flag makes it worse overall I believe.
Could you check running on multiple audio files and let me know. This helps in progressing my work also further.

@sinisha
Copy link
Author

sinisha commented Oct 31, 2024

I did not understand. If you run on multiple audio files, adding the flag makes it worse overall I believe. Could you check running on multiple audio files and let me know. This helps in progressing my work also further.

It gets worse, but suddenly for some file it improves a bit

@asr-lord
Copy link

asr-lord commented Nov 2, 2024

I've the same issue with my own fine-tuned model when I use it with faster-whisper: #987

@bchinnari
Copy link

bchinnari commented Nov 4, 2024

@asr-lord .. are you using faster whisper directly for transcription or are you using some other repo (like whisperX, whisper-streaming) which uses faster-whisper in the backend.

@bchinnari
Copy link

@asr-lord ..i saw your issue mentioned in #987 .
Like I said in my above previous comments, "word_timestamps=True" causing issues I believe. Try your example with "word_timestamps=False" and let us know.

@asr-lord
Copy link

asr-lord commented Nov 7, 2024

@asr-lord .. are you using faster whisper directly for transcription or are you using some other repo (like whisperX, whisper-streaming) which uses faster-whisper in the backend.

I'm using directly faster-whisper v1.0.3

@asr-lord ..i saw your issue mentioned in #987 . Like I said in my above previous comments, "word_timestamps=True" causing issues I believe. Try your example with "word_timestamps=False" and let us know.

If I need word timestamps, this not could be the solution for me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants