Replies: 1 comment 2 replies
-
@pvardanis - did you get any answer for this? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I know that it is currently possible to start a cpp server and process concurrent requests in parallel but I cannot seem to find anything similar with the python bindings without needing to spin up the cpp server and send concurrent requests via Python.
With vLLM I can serve my model in an async fast api server like:
That way I can serve concurrent requests and also get advantage of continuous batching. Is something like this possible with the python bindings of llamacpp?
Beta Was this translation helpful? Give feedback.
All reactions