-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous batching #1333
Comments
@michaelfeil is this related? Yes, vLLM supports continuous batching, but I'm looking to understand if Ctranslate can be extended to support that, without using vLLM. |
|
Yes, bufferize incoming requests and sending them together is what i meant for static batching. Is 1. not possible today because of a difference in architecture between CT2 and HF Transformers, or is it possible in theory, but the mechanism has not been implemented? |
CT2 was not designed with the feature in mind so it is not trivial to implement it. But of course it is possible in theory. |
Recently, a lot of benchmarks point to the fact that if you want to serve your models behind an API, continuous batching grants higher throughput and lower latency compared to static batching. Some examples of systems that implement continous batching:
In order to enable continuous batching, it is necessary to be able to:
Is this concept compatible with CTranslate2 architecture? I am keen to build an inference engine on top of CTranslate2, would love to hear some thoughts around this before I deep dive into it.
The text was updated successfully, but these errors were encountered: