Serve concurrent requests as in vLLM using continuous batching #10170

pvardanis · 2024-11-04T15:00:44Z

pvardanis
Nov 4, 2024

I know that it is currently possible to start a cpp server and process concurrent requests in parallel but I cannot seem to find anything similar with the python bindings without needing to spin up the cpp server and send concurrent requests via Python.

With vLLM I can serve my model in an async fast api server like:

async def generate(
        self,
        prompt: str
    ) -> AsyncGenerator[str, None]:
        from vllm import SamplingParams

        SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens)
        prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
        stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)

        cursor = 0
        async for request_output in stream:
            text = request_output.outputs[0].text
            yield text[cursor:]
            cursor = len(text)

That way I can serve concurrent requests and also get advantage of continuous batching. Is something like this possible with the python bindings of llamacpp?

dhirajsuvarna · 2024-12-20T12:05:36Z

dhirajsuvarna
Dec 20, 2024

@pvardanis - did you get any answer for this?

2 replies

pvardanis Dec 20, 2024
Author

unfortunately no

dhirajsuvarna Dec 20, 2024

we are facing issue similar to vllm-project/vllm#2752 - but sadly this has been marked stale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve concurrent requests as in vLLM using continuous batching #10170

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Serve concurrent requests as in vLLM using continuous batching #10170

pvardanis Nov 4, 2024

Replies: 1 comment · 2 replies

dhirajsuvarna Dec 20, 2024

pvardanis Dec 20, 2024 Author

dhirajsuvarna Dec 20, 2024

pvardanis
Nov 4, 2024

Replies: 1 comment 2 replies

dhirajsuvarna
Dec 20, 2024

pvardanis Dec 20, 2024
Author