Batch decoding #588

funboarder13920 · 2023-11-28T10:26:02Z

Hello,

I wanted to try implementing a batching algorithm in the light of the huggingface results in this article: https://arxiv.org/pdf/2311.00430.pdf

This code is not a version ready for complete review yet.

I chose to split the audio into batch_size chunks of the same size without any overlap (for now), and to proceed with an almost identical algorithm as in faster-whisper in order to keep the prompt mechanism. I kept the beam_size to 5.

However, I see no inference speed up at all when changing the batch_size...

If anyone wants to take a look and point to potential issues that would make the code of this PR inefficient / not taking advantage of the batching, I am all ears.

Best,

funboarder13920 · 2023-11-28T10:38:45Z

faster_whisper/transcribe.py

+            unfinished_encoder_output = encoder_output
+            unfinished_batch_prompt = batch_prompt
+            if len(convert_index) != encoder_output.shape[0]:
+                if torch_encoder_output is None:


Does any one know how to do that with ct2 efficiently ?

blackpolarz · 2023-12-01T10:44:31Z

Just listen to this with a pinch of salt as I am not an expert in this matter.
From my understanding, the reason why huggingface implementation of batching helps with their whisper implementation is because they first split the audio into chunks, encode it and use greedy decoding in parallelism to decode the audio. This helps the encoded features to decrease, making it significantly faster than their original implementation. (Side note: The reason why it is slow is because of the efficiency of the decoding algorithm)
However, with this implementation, context (in the form of prefix) is no longer preserved as it is impossible for decoder to gain any knowledge about the encoded (other than initial prompt).

In faster-whisper's implementation, VAD actually does the same by splitting the audio into chunks based on pauses/silence. Apart from removing the silence audio, this form of "batching" is somewhat dynamic, allowing the chunks to preserve their completeness. As this is sequential, it preserves the context(in the form of prefix) but loses the advantage of parallel computing.

I suspect the reason why you did not gain any significant boost in speed is because the underlying decoder(ctranslate2) is still decoding sequentially. Unless you modify the ctranslate2 code to take batches through a buffer, it is unlikely that there will be significant improvement. In that case, to preserve the prefix, you need another algorithm that can extract important features within the audio, convert into tokens and feed it as prefix before doing batch decoding in ctranslate2. One way will be to use a super tiny model with batching to create prefix. This method would then be speculative decoding.

Purfview · 2023-12-01T11:58:50Z

#588 (comment)
Didn't they implemented batching there ->https://github.com/Vaibhavs10/insanely-fast-whisper

keep the prompt mechanism

I think that's impossible, no?

funboarder13920 · 2023-12-01T11:59:55Z

Are you sure the ct2 forward pass does not support batching ? I would be very surprised if Guillaume did a for loop over the batches
Why would it be impossible ? You can split the audio in chunks, make all the model generation with batch and keep the prompting. If it is possible with seq2seq I think it is possible with whisper.
If you look at the PR you can see that I didn't split the audio in chunks of 30s, but in batch_size chunks. For a 60 minutes audio, with a batch_size of 6, I will split the audio in 10 minutes and do the decoding in parallel using the batching when calling the models, hence I can keep the prompt for the 10 minutess segments

insanely-fast-whisper is a cli for huggingface

blackpolarz · 2023-12-01T12:10:57Z

It is definitely possible to keep the prompt but it will be really difficult to keep the prefix. There is slight difference between the two. Since prompt is given at the start, it is known and can be maintained. Prefix on the other hand is only known after you decode the audio.
As for whether ct2 forward pass does support batching, I am not sure but quoting from [guillaumekln] in OpenNMT/CTranslate2#1333, it is possible to batch them together before calling ctranslate2.

funboarder13920 · 2023-12-01T12:18:19Z

Have you taken a quick look at the code of this PR ?

blackpolarz · 2023-12-01T12:49:26Z

Apologises if I misunderstood your PR. This misunderstanding occurs as I read the article attached and the method they have employed is the splitting of the 30s.
I have just reread your code of the PR, not gone in depth with it.
If what you wanted to do is splitting the audio into n different parts then process those parts, won't it be easier to do it by using the multiprocessing module, use a function to break the audio into parts, then set up subprocesses to run different instances of ctranslate2?
I will need more time to try to understand and see why there is no increase in inference speed.

funboarder13920 · 2023-12-01T13:29:58Z

Python multiprocessing/subprocessing is not a way to handle batching on a GPU card / on a CUDA optimized model unfortunately

blackpolarz · 2023-12-01T14:49:27Z

True, multiprocessing is only for CPU.
You might want to include a method to check if the user is using CPU or GPU. Those running on CPU will have errors.
As for batch size, I did see some improvements when I increased the batch size of your code. From batch size of 2 to batch size of 10, there are some improvements but not significant compared to the one without batching. (I am writing to a file so that can be the bottleneck.)
As for code wise, I haven't found any potential problem or reason on why it is not faster than the one without batching. Will update if I ever find any.

minhthuc2502 · 2023-12-01T15:46:09Z

Did you try to set the inter_threads to 2? As my knowledge, when you split to multiple batches, it will go to a queue to be handled by worker. If you have only 1 worker, the batch will be handled sequentially. If you have 2 worker, I think you can speed up the process.

first iteration of batch decoding

1e1d150

funboarder13920 commented Nov 28, 2023

View reviewed changes

funboarder13920 closed this Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch decoding #588

Batch decoding #588

funboarder13920 commented Nov 28, 2023

funboarder13920 Nov 28, 2023

blackpolarz commented Dec 1, 2023

Purfview commented Dec 1, 2023

funboarder13920 commented Dec 1, 2023 •

edited

Loading

blackpolarz commented Dec 1, 2023

funboarder13920 commented Dec 1, 2023

blackpolarz commented Dec 1, 2023 •

edited

Loading

funboarder13920 commented Dec 1, 2023 •

edited

Loading

blackpolarz commented Dec 1, 2023

minhthuc2502 commented Dec 1, 2023

Batch decoding #588

Batch decoding #588

Conversation

funboarder13920 commented Nov 28, 2023

funboarder13920 Nov 28, 2023

Choose a reason for hiding this comment

blackpolarz commented Dec 1, 2023

Purfview commented Dec 1, 2023

funboarder13920 commented Dec 1, 2023 • edited Loading

blackpolarz commented Dec 1, 2023

funboarder13920 commented Dec 1, 2023

blackpolarz commented Dec 1, 2023 • edited Loading

funboarder13920 commented Dec 1, 2023 • edited Loading

blackpolarz commented Dec 1, 2023

minhthuc2502 commented Dec 1, 2023

funboarder13920 commented Dec 1, 2023 •

edited

Loading

blackpolarz commented Dec 1, 2023 •

edited

Loading

funboarder13920 commented Dec 1, 2023 •

edited

Loading