-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch decoding #588
Batch decoding #588
Conversation
unfinished_encoder_output = encoder_output | ||
unfinished_batch_prompt = batch_prompt | ||
if len(convert_index) != encoder_output.shape[0]: | ||
if torch_encoder_output is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does any one know how to do that with ct2 efficiently ?
Just listen to this with a pinch of salt as I am not an expert in this matter. In faster-whisper's implementation, VAD actually does the same by splitting the audio into chunks based on pauses/silence. Apart from removing the silence audio, this form of "batching" is somewhat dynamic, allowing the chunks to preserve their completeness. As this is sequential, it preserves the context(in the form of prefix) but loses the advantage of parallel computing. I suspect the reason why you did not gain any significant boost in speed is because the underlying decoder(ctranslate2) is still decoding sequentially. Unless you modify the ctranslate2 code to take batches through a buffer, it is unlikely that there will be significant improvement. In that case, to preserve the prefix, you need another algorithm that can extract important features within the audio, convert into tokens and feed it as prefix before doing batch decoding in ctranslate2. One way will be to use a super tiny model with batching to create prefix. This method would then be speculative decoding. |
#588 (comment)
I think that's impossible, no? |
Are you sure the ct2 forward pass does not support batching ? I would be very surprised if Guillaume did a for loop over the batches insanely-fast-whisper is a cli for huggingface |
It is definitely possible to keep the prompt but it will be really difficult to keep the prefix. There is slight difference between the two. Since prompt is given at the start, it is known and can be maintained. Prefix on the other hand is only known after you decode the audio. |
Have you taken a quick look at the code of this PR ? |
Apologises if I misunderstood your PR. This misunderstanding occurs as I read the article attached and the method they have employed is the splitting of the 30s. |
Python multiprocessing/subprocessing is not a way to handle batching on a GPU card / on a CUDA optimized model unfortunately |
True, multiprocessing is only for CPU. |
Did you try to set the |
Hello,
I wanted to try implementing a batching algorithm in the light of the huggingface results in this article: https://arxiv.org/pdf/2311.00430.pdf
This code is not a version ready for complete review yet.
I chose to split the audio into batch_size chunks of the same size without any overlap (for now), and to proceed with an almost identical algorithm as in faster-whisper in order to keep the prompt mechanism. I kept the beam_size to 5.
However, I see no inference speed up at all when changing the batch_size...
If anyone wants to take a look and point to potential issues that would make the code of this PR inefficient / not taking advantage of the batching, I am all ears.
Best,