Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch decoding #588

Closed
wants to merge 1 commit into from
Closed

Conversation

funboarder13920
Copy link

Hello,

I wanted to try implementing a batching algorithm in the light of the huggingface results in this article: https://arxiv.org/pdf/2311.00430.pdf

This code is not a version ready for complete review yet.

I chose to split the audio into batch_size chunks of the same size without any overlap (for now), and to proceed with an almost identical algorithm as in faster-whisper in order to keep the prompt mechanism. I kept the beam_size to 5.

However, I see no inference speed up at all when changing the batch_size...

If anyone wants to take a look and point to potential issues that would make the code of this PR inefficient / not taking advantage of the batching, I am all ears.

Best,

unfinished_encoder_output = encoder_output
unfinished_batch_prompt = batch_prompt
if len(convert_index) != encoder_output.shape[0]:
if torch_encoder_output is None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any one know how to do that with ct2 efficiently ?

@blackpolarz
Copy link

Just listen to this with a pinch of salt as I am not an expert in this matter.
From my understanding, the reason why huggingface implementation of batching helps with their whisper implementation is because they first split the audio into chunks, encode it and use greedy decoding in parallelism to decode the audio. This helps the encoded features to decrease, making it significantly faster than their original implementation. (Side note: The reason why it is slow is because of the efficiency of the decoding algorithm)
However, with this implementation, context (in the form of prefix) is no longer preserved as it is impossible for decoder to gain any knowledge about the encoded (other than initial prompt).

In faster-whisper's implementation, VAD actually does the same by splitting the audio into chunks based on pauses/silence. Apart from removing the silence audio, this form of "batching" is somewhat dynamic, allowing the chunks to preserve their completeness. As this is sequential, it preserves the context(in the form of prefix) but loses the advantage of parallel computing.

I suspect the reason why you did not gain any significant boost in speed is because the underlying decoder(ctranslate2) is still decoding sequentially. Unless you modify the ctranslate2 code to take batches through a buffer, it is unlikely that there will be significant improvement. In that case, to preserve the prefix, you need another algorithm that can extract important features within the audio, convert into tokens and feed it as prefix before doing batch decoding in ctranslate2. One way will be to use a super tiny model with batching to create prefix. This method would then be speculative decoding.

@Purfview
Copy link
Contributor

Purfview commented Dec 1, 2023

#588 (comment)
Didn't they implemented batching there ->https://github.com/Vaibhavs10/insanely-fast-whisper

keep the prompt mechanism

I think that's impossible, no?

@funboarder13920
Copy link
Author

funboarder13920 commented Dec 1, 2023

Are you sure the ct2 forward pass does not support batching ? I would be very surprised if Guillaume did a for loop over the batches
Why would it be impossible ? You can split the audio in chunks, make all the model generation with batch and keep the prompting. If it is possible with seq2seq I think it is possible with whisper.
If you look at the PR you can see that I didn't split the audio in chunks of 30s, but in batch_size chunks. For a 60 minutes audio, with a batch_size of 6, I will split the audio in 10 minutes and do the decoding in parallel using the batching when calling the models, hence I can keep the prompt for the 10 minutess segments

insanely-fast-whisper is a cli for huggingface

@blackpolarz
Copy link

It is definitely possible to keep the prompt but it will be really difficult to keep the prefix. There is slight difference between the two. Since prompt is given at the start, it is known and can be maintained. Prefix on the other hand is only known after you decode the audio.
As for whether ct2 forward pass does support batching, I am not sure but quoting from [guillaumekln] in OpenNMT/CTranslate2#1333, it is possible to batch them together before calling ctranslate2.

@funboarder13920
Copy link
Author

Have you taken a quick look at the code of this PR ?

@blackpolarz
Copy link

blackpolarz commented Dec 1, 2023

Apologises if I misunderstood your PR. This misunderstanding occurs as I read the article attached and the method they have employed is the splitting of the 30s.
I have just reread your code of the PR, not gone in depth with it.
If what you wanted to do is splitting the audio into n different parts then process those parts, won't it be easier to do it by using the multiprocessing module, use a function to break the audio into parts, then set up subprocesses to run different instances of ctranslate2?
I will need more time to try to understand and see why there is no increase in inference speed.

@funboarder13920
Copy link
Author

funboarder13920 commented Dec 1, 2023

Python multiprocessing/subprocessing is not a way to handle batching on a GPU card / on a CUDA optimized model unfortunately

@blackpolarz
Copy link

True, multiprocessing is only for CPU.
You might want to include a method to check if the user is using CPU or GPU. Those running on CPU will have errors.
As for batch size, I did see some improvements when I increased the batch size of your code. From batch size of 2 to batch size of 10, there are some improvements but not significant compared to the one without batching. (I am writing to a file so that can be the bottleneck.)
As for code wise, I haven't found any potential problem or reason on why it is not faster than the one without batching. Will update if I ever find any.

@minhthuc2502
Copy link
Collaborator

Did you try to set the inter_threads to 2? As my knowledge, when you split to multiple batches, it will go to a queue to be handled by worker. If you have only 1 worker, the batch will be handled sequentially. If you have 2 worker, I think you can speed up the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants