Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic kv-cache transfer to vllm v1 #1

Open
wants to merge 3,384 commits into
base: main
Choose a base branch
from
Open

Adding basic kv-cache transfer to vllm v1 #1

wants to merge 3,384 commits into from

Conversation

mrn3088
Copy link
Collaborator

@mrn3088 mrn3088 commented Nov 19, 2024

A working implementation for vllm to do kvcache transfer between prefill and decode engine.
Tested on the examples/offline_inference.py.
To run it, using the following commands in two terminal:

Prefill Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 0 --local-rank 0 --role prefill --max-tokens 1

Decode Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 1 --local-rank 1 --role decode --max-tokens 16

The first process will execute the model for one step (the prefill step), and send the hidden_state and kv_cache to the second process to complete the following decode.

For now, at least it's working.

Decode Engine Output, seems correct

Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.18it/s, est. speed input: 20.65 toks/s, output: 50.82 toks/s]
Prompt: 'Hello, my name is', Generated text: " Joel. I'm from Massachusetts and live in Melbourne, Australia.\nI'm"
Prompt: 'The president of the United States is', Generated text: ' about to be arrested in Europe for allegedly meddling in the 2016 election.\n\n'
Prompt: 'The capital of France is', Generated text: ' becoming a state of chaos with a significant urban and industrial boom. France’'
Prompt: 'The future of AI is', Generated text: ' not as simple as you think, and you have to understand it in order to'

Things to pay attention:

  1. The current implementation has not been tested on other examples. Especially online examples. I'll check this later today.
  2. Current kv transfer is not efficient. Ideally should only do one send/recv that transfer all kv_cache along with hidden_state.
  3. What about more engines? Currently, the TP and PP is disabled since they aren't compatible yet.
  4. Lots of hardcoded stuffs...

DarkLight1337 and others added 30 commits November 1, 2024 14:09
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Prashant Gupta <[email protected]>
Co-authored-by: Prashant Gupta <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
…9897)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <[email protected]>
Signed-off-by: Shanshan Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
DarkLight1337 and others added 29 commits November 15, 2024 09:34
…tructured output with MistralTokenizer (vllm-project#10363)

Signed-off-by: Guillaume Calmettes <[email protected]>
if role == "prefill" and prefill_step:
dist.send(hidden_states, dst=1)
for i in range(len(self.kv_caches)):
dist.send(self.kv_caches[i], dst=1)
Copy link
Collaborator

@Jocn2020 Jocn2020 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the rank hack!
Quick comment from me is currently you send the entire kvcache. Next step we want to do is just sending the kvcache of specific requests' block ids which you can find in scheduler_output.scheduled_new_reqs or scheduler_output.scheduled_resumed_reqs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.