Multi models hpu #575

xuechendi · 2024-12-02T23:52:30Z

Same codes works in upstream-based vllm HPU version:
https://github.com/xuechendi/vllm-fork/commits/multi_models_rebase/

This PR is for habana_main based impl

start server with multi models:

VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=true python3 -m \
    vllm.entrypoints.openai.mm_api_server \
    --models mistralai/Mistral-7B-Instruct-v0.3 meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 --device hpu --dtype bfloat16 \
    --gpu-memory-utilization=0.3 --use-v2-block-manager --max-model-len 4096 2>&1 > multi_models.log &

run test

bs=128
in_len=1024
out_len=1024


python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model mistralai/Mistral-7B-Instruct-v0.3 \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > mistral-sonnet-1.log 2>&1 &

python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model meta-llama/Llama-3.1-8B-Instruct \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > llama-sonnet-1.log 2>&1 &

Set vllm-hpu-extension to fb36408, that includes support for non-GQA workloads in PipelinedPA

Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]>

Signed-off-by: Chendi.Xue <[email protected]>

Same codes works in upstream-based vllm HPU version: https://github.com/xuechendi/vllm-fork/commits/multi_models_rebase/ This PR is for habana_main based impl * start server with multi models: ``` VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=true python3 -m \ vllm.entrypoints.openai.mm_api_server \ --models mistralai/Mistral-7B-Instruct-v0.3 meta-llama/Llama-3.1-8B-Instruct \ --port 8080 --device hpu --dtype bfloat16 \ --gpu-memory-utilization=0.3 --use-v2-block-manager --max-model-len 4096 2>&1 > multi_models.log & ``` * run test ``` bs=128 in_len=1024 out_len=1024 python benchmarks/benchmark_serving.py \ --backend vllm \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --dataset-name sonnet \ --dataset-path benchmarks/sonnet.txt \ --request-rate 512 \ --num-prompts ${bs} \ --port 8080 \ --sonnet-input-len ${in_len} \ --sonnet-output-len ${out_len} \ --sonnet-prefix-len 100 \ --save-result > mistral-sonnet-1.log 2>&1 & python benchmarks/benchmark_serving.py \ --backend vllm \ --model meta-llama/Llama-3.1-8B-Instruct \ --dataset-name sonnet \ --dataset-path benchmarks/sonnet.txt \ --request-rate 512 \ --num-prompts ${bs} \ --port 8080 \ --sonnet-input-len ${in_len} \ --sonnet-output-len ${out_len} \ --sonnet-prefix-len 100 \ --save-result > llama-sonnet-1.log 2>&1 & ``` --------- Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]> Co-authored-by: Marcin Swiniarski <[email protected]> Co-authored-by: Kunshang Ji <[email protected]>

mswiniarsk and others added 5 commits December 2, 2024 17:03

Set vllm-hpu-extension to fb36408 (HabanaAI#572)

4b502a6

Set vllm-hpu-extension to fb36408, that includes support for non-GQA workloads in PipelinedPA

copy origin file to mm_ files

8115183

support multi models

c04a040

Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]>

run with non ContinuousPA is working

b2cc2df

Signed-off-by: Chendi.Xue <[email protected]>

add model arg to original client

9a672a5

Signed-off-by: Chendi.Xue <[email protected]>

michalkuligowski marked this pull request as draft December 3, 2024 10:25

xuechendi changed the title ~~[WIP]Multi models hpu~~ Multi models hpu Dec 3, 2024

xuechendi marked this pull request as ready for review December 3, 2024 15:05

kzawora-intel approved these changes Dec 4, 2024

View reviewed changes

michalkuligowski merged commit c6fe99b into HabanaAI:multi_model Dec 4, 2024
1 check passed

xuechendi deleted the multi_models_hpu branch December 19, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi models hpu #575

Multi models hpu #575

xuechendi commented Dec 2, 2024 •

edited by github-actions bot

Loading

Multi models hpu #575

Multi models hpu #575

Conversation

xuechendi commented Dec 2, 2024 • edited by github-actions bot Loading

xuechendi commented Dec 2, 2024 •

edited by github-actions bot

Loading