Adding basic kv-cache transfer to vllm v1 #1

mrn3088 · 2024-11-19T11:08:23Z

A working implementation for vllm to do kvcache transfer between prefill and decode engine.
Tested on the examples/offline_inference.py.
To run it, using the following commands in two terminal:

Prefill Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 0 --local-rank 0 --role prefill --max-tokens 1

Decode Engine

vllm/examples# VLLM_PORT=47651 python3 offline_inference.py --dist-factor 2 --rank 1 --local-rank 1 --role decode --max-tokens 16

The first process will execute the model for one step (the prefill step), and send the hidden_state and kv_cache to the second process to complete the following decode.

For now, at least it's working.

Decode Engine Output, seems correct

Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.18it/s, est. speed input: 20.65 toks/s, output: 50.82 toks/s]
Prompt: 'Hello, my name is', Generated text: " Joel. I'm from Massachusetts and live in Melbourne, Australia.\nI'm"
Prompt: 'The president of the United States is', Generated text: ' about to be arrested in Europe for allegedly meddling in the 2016 election.\n\n'
Prompt: 'The capital of France is', Generated text: ' becoming a state of chaos with a significant urban and industrial boom. France’'
Prompt: 'The future of AI is', Generated text: ' not as simple as you think, and you have to understand it in order to'

Things to pay attention:

The current implementation has not been tested on other examples. Especially online examples. I'll check this later today.
Current kv transfer is not efficient. Ideally should only do one send/recv that transfer all kv_cache along with hidden_state.
What about more engines? Currently, the TP and PP is disabled since they aren't compatible yet.
Lots of hardcoded stuffs...

Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

…-project#9696) Signed-off-by: André Jonasson <[email protected]>

…oject#9889) Signed-off-by: youkaichao <[email protected]>

…9933) Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Gene Su <[email protected]>

…9897) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin H. Luu <[email protected]>

…t#8346) Signed-off-by: Peter Salas <[email protected]>

) Signed-off-by: kevin <[email protected]>

…odels (vllm-project#9559)

Signed-off-by: youkaichao <[email protected]>

…ect#9930) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Michael Green <[email protected]>

…roject#9938) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Signed-off-by: youkaichao <[email protected]>

…project#9946)

Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Shanshan Wang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

Signed-off-by: Gregory Shtrasberg <[email protected]>

Co-authored-by: Yang Zheng(SW)(Alex) <[email protected]>

Signed-off-by: daitran2k1 <[email protected]>

…ject#9974) Signed-off-by: MengqingCao <[email protected]>

…-project#9915) Signed-off-by: chaunceyjiang <[email protected]>

…lm-project#10349) Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: Xin Yang <[email protected]>

Signed-off-by: Jee Jee Li <[email protected]>

…roject#10362)

…0356) Signed-off-by: youkaichao <[email protected]>

…tructured output with MistralTokenizer (vllm-project#10363) Signed-off-by: Guillaume Calmettes <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: simon-mo <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

)

…project#9919) Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: youkaichao <[email protected]>

…ject#10385) Signed-off-by: Randall Smith <[email protected]>

…ct#10287) Signed-off-by: rbbang <[email protected]>

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]>

…ect#10383) Signed-off-by: youkaichao <[email protected]>

…odels (vllm-project#10374) Signed-off-by: Roger Wang <[email protected]>

Signed-off-by: Chendi Xue <[email protected]>

…ject#10394) Signed-off-by: Isotr0py <[email protected]>

Signed-off-by: youkaichao <[email protected]>

Signed-off-by: Kunshang Ji <[email protected]>

…ject#10403) Signed-off-by: imkero <[email protected]>

vllm-project#10392) Signed-off-by: wchen61 <[email protected]>

Jocn2020 · 2024-11-19T18:35:28Z

vllm/v1/worker/gpu_model_runner.py

+            if role == "prefill" and prefill_step:
+                dist.send(hidden_states, dst=1)
+                for i in range(len(self.kv_caches)):
+                    dist.send(self.kv_caches[i], dst=1)


Nice work on the rank hack!
Quick comment from me is currently you send the entire kvcache. Next step we want to do is just sending the kvcache of specific requests' block ids which you can find in scheduler_output.scheduled_new_reqs or scheduler_output.scheduled_resumed_reqs

DarkLight1337 and others added 30 commits November 1, 2024 14:09

[Frontend] Use a proper chat template for VLM2Vec (vllm-project#9912)

bb87acb

[Bugfix] Fix edge cases for MistralTokenizer (vllm-project#9625)

b37933e

Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

[Core] Refactor: Clean up unused argument in Scheduler._preempt (vllm…

6b3e1c2

…-project#9696) Signed-off-by: André Jonasson <[email protected]>

[torch.compile] use interpreter with stable api from pytorch (vllm-pr…

e1c27fc

…oject#9889) Signed-off-by: youkaichao <[email protected]>

[Bugfix/Core] Flashinfer k_scale and v_scale (vllm-project#9861)

04506c3

[1/N] pass the complete config from engine to executor (vllm-project#…

2d75d7c

…9933) Signed-off-by: youkaichao <[email protected]>

[Bugfix] PicklingError on RayTaskError (vllm-project#9934)

7c2fc9c

Signed-off-by: Gene Su <[email protected]>

[ci/build] Bump the patch-update group with 10 updates (vllm-project#…

ac2ddd2

…9897) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin H. Luu <[email protected]>

[Core][VLM] Add precise multi-modal placeholder tracking (vllm-projec…

569fcc0

…t#8346) Signed-off-by: Peter Salas <[email protected]>

[ci/build] Have dependabot ignore pinned dependencies (vllm-project#9935

d4775ce

) Signed-off-by: kevin <[email protected]>

[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…

9550387

…odels (vllm-project#9559)

[torch.compile] fix cpu broken code (vllm-project#9947)

c812fe5

Signed-off-by: youkaichao <[email protected]>

[Docs] Update Granite 3.0 models in supported models table (vllm-proj…

68fc181

…ect#9930) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Doc] Updated tpu-installation.rst with more details (vllm-project#9926)

171ccd6

Signed-off-by: Michael Green <[email protected]>

[2/N] executor pass the complete config to worker/modelrunner (vllm-p…

22c99c0

…roject#9938) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Nick Hill <[email protected]>

[V1] Fix EngineArgs refactor on V1 (vllm-project#9954)

6d3ca46

[bugfix] fix chatglm dummy_data_for_glmv (vllm-project#9955)

21e61ba

Signed-off-by: youkaichao <[email protected]>

[3/N] model runner pass the whole config to model (vllm-project#9958)

f09a8e0

Signed-off-by: youkaichao <[email protected]>

[CI/Build] Quoting around > (vllm-project#9956)

d19ef1b

[torch.compile] Adding torch compile to vision-language models (vllm-…

f386574

…project#9946)

[bugfix] fix tsts (vllm-project#9959)

685bcc3

Signed-off-by: youkaichao <[email protected]>

[V1] Support per-request seed (vllm-project#9945)

9455b48

Signed-off-by: Nick Hill <[email protected]>

[Model] Add support for H2OVL-Mississippi models (vllm-project#9747)

a45ebaf

Signed-off-by: Shanshan Wang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

[V1] Fix Configs (vllm-project#9971)

89a2c17

[Bugfix] Fix MiniCPMV and Mllama BNB bug (vllm-project#9917)

d2310a1

Signed-off-by: Jee Jee Li <[email protected]>

[Bugfix]Using the correct type hints (vllm-project#9885)

0f1221b

Signed-off-by: Gregory Shtrasberg <[email protected]>

[Misc] Compute query_start_loc/seq_start_loc on CPU (vllm-project#9447)

2f70c75

Co-authored-by: Yang Zheng(SW)(Alex) <[email protected]>

[Bugfix] Fix E2EL mean and median stats (vllm-project#9984)

a95a2ff

Signed-off-by: daitran2k1 <[email protected]>

[Bugfix][OpenVINO] Fix circular reference vllm-project#9939 (vllm-pro…

fe486ec

…ject#9974) Signed-off-by: MengqingCao <[email protected]>

[Frontend] Multi-Modality Support for Loading Local Image Files (vllm…

a435ea4

…-project#9915) Signed-off-by: chaunceyjiang <[email protected]>

DarkLight1337 and others added 29 commits November 15, 2024 09:34

[Misc] Fix import error in tensorizer tests and cleanup some code (vl…

dc25db9

…lm-project#10349) Signed-off-by: DarkLight1337 <[email protected]>

[Doc] Remove float32 choice from --lora-dtype (vllm-project#10348)

9902c04

Signed-off-by: Xin Yang <[email protected]>

[Bugfix] Fix fully sharded LoRA bug (vllm-project#10352)

5a62465

Signed-off-by: Jee Jee Li <[email protected]>

[Misc] Fix some help info of arg_utils to improve readability (vllm-p…

1be25ac

…roject#10362)

[core][misc] keep compatibility for old-style classes (vllm-project#1…

ffddf91

…0356) Signed-off-by: youkaichao <[email protected]>

[Bugfix] Ensure special tokens are properly filtered out for guided s…

affa3bb

…tructured output with MistralTokenizer (vllm-project#10363) Signed-off-by: Guillaume Calmettes <[email protected]>

[Misc] Bump up test_fused_moe tolerance (vllm-project#10364)

e6d15ee

Signed-off-by: ElizaWszola <[email protected]>

[Misc] bump mistral common version (vllm-project#10367)

b866673

Signed-off-by: simon-mo <[email protected]>

[Docs] Add Nebius as sponsors (vllm-project#10371)

4467cd1

Signed-off-by: simon-mo <[email protected]>

[Frontend] Add --version flag to CLI (vllm-project#10369)

de1a339

Signed-off-by: Russell Bryant <[email protected]>

[Doc] Move PR template content to docs (vllm-project#10159)

b0a608b

Signed-off-by: Russell Bryant <[email protected]>

[Docs] Misc updates to TPU installation instructions (vllm-project#10165

5bef6c8

)

[Frontend] Automatic detection of chat content format from AST (vllm-…

42cdb3c

…project#9919) Signed-off-by: DarkLight1337 <[email protected]>

[doc] add doc for the plugin system (vllm-project#10372)

6d5a548

Signed-off-by: youkaichao <[email protected]>

[misc][plugin] improve log messages (vllm-project#10386)

e7257f4

Signed-off-by: youkaichao <[email protected]>

[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (vllm-pro…

9a62e9a

…ject#10385) Signed-off-by: Randall Smith <[email protected]>

[Misc] Update benchmark to support image_url file or http (vllm-proje…

2692313

…ct#10287) Signed-off-by: rbbang <[email protected]>

[Misc] Medusa supports custom bias (vllm-project#10361)

fae08af

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

24ec29c

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

[V1] Add code owners for V1 (vllm-project#10397)

d1bc041

Signed-off-by: Woosuk Kwon <[email protected]>

[2/N][torch.compile] make compilation cfg part of vllm cfg (vllm-proj…

80d031d

…ect#10383) Signed-off-by: youkaichao <[email protected]>

[V1] Refactor model executable interface for all text-only language m…

fb0e946

…odels (vllm-project#10374) Signed-off-by: Roger Wang <[email protected]>

[CI/Build] Fix IDC hpu [Device not found] issue (vllm-project#10384)

cf37750

Signed-off-by: Chendi Xue <[email protected]>

[Bugfix][CPU] Fix CPU embedding runner with tensor parallel (vllm-pro…

0399523

…ject#10394) Signed-off-by: Isotr0py <[email protected]>

[platforms] refactor cpu code (vllm-project#10402)

dc08693

Signed-off-by: youkaichao <[email protected]>

[Hardware] [HPU]add mark_step for hpu (vllm-project#10239)

52002dd

Signed-off-by: Kunshang Ji <[email protected]>

[Bugfix] Fix mrope_position_delta in non-last prefill chunk (vllm-pro…

287ed74

…ject#10403) Signed-off-by: imkero <[email protected]>

[Misc] Enhance offline_inference to support user-configurable paramet… (

c0adeb8

vllm-project#10392) Signed-off-by: wchen61 <[email protected]>

Implemented kvcache transfer (naive send/recv)

83f6707

Jocn2020 reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding basic kv-cache transfer to vllm v1 #1

Adding basic kv-cache transfer to vllm v1 #1

mrn3088 commented Nov 19, 2024 •

edited by github-actions bot

Loading

Jocn2020 Nov 19, 2024 •

edited

Loading

Adding basic kv-cache transfer to vllm v1 #1

Are you sure you want to change the base?

Adding basic kv-cache transfer to vllm v1 #1

Conversation

mrn3088 commented Nov 19, 2024 • edited by github-actions bot Loading

Prefill Engine

Decode Engine

Decode Engine Output, seems correct

Jocn2020 Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

mrn3088 commented Nov 19, 2024 •

edited by github-actions bot

Loading

Jocn2020 Nov 19, 2024 •

edited

Loading