Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi models hpu #575

Merged
merged 5 commits into from
Dec 4, 2024
Merged

Conversation

xuechendi
Copy link

@xuechendi xuechendi commented Dec 2, 2024

Same codes works in upstream-based vllm HPU version:
https://github.com/xuechendi/vllm-fork/commits/multi_models_rebase/

This PR is for habana_main based impl

  • start server with multi models:
VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=true python3 -m \
    vllm.entrypoints.openai.mm_api_server \
    --models mistralai/Mistral-7B-Instruct-v0.3 meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 --device hpu --dtype bfloat16 \
    --gpu-memory-utilization=0.3 --use-v2-block-manager --max-model-len 4096 2>&1 > multi_models.log &
  • run test
bs=128
in_len=1024
out_len=1024


python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model mistralai/Mistral-7B-Instruct-v0.3 \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > mistral-sonnet-1.log 2>&1 &

python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model meta-llama/Llama-3.1-8B-Instruct \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > llama-sonnet-1.log 2>&1 &

mswiniarsk and others added 5 commits December 2, 2024 17:03
Set vllm-hpu-extension to fb36408, that includes support for non-GQA
workloads in PipelinedPA
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
@michalkuligowski michalkuligowski marked this pull request as draft December 3, 2024 10:25
@xuechendi xuechendi changed the title [WIP]Multi models hpu Multi models hpu Dec 3, 2024
@xuechendi xuechendi marked this pull request as ready for review December 3, 2024 15:05
@michalkuligowski michalkuligowski merged commit c6fe99b into HabanaAI:multi_model Dec 4, 2024
1 check passed
xuechendi added a commit to xuechendi/vllm-fork that referenced this pull request Dec 4, 2024
Same codes works in upstream-based vllm HPU version:
https://github.com/xuechendi/vllm-fork/commits/multi_models_rebase/

This PR is for habana_main based impl

* start server with multi models:
```
VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=true python3 -m \
    vllm.entrypoints.openai.mm_api_server \
    --models mistralai/Mistral-7B-Instruct-v0.3 meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 --device hpu --dtype bfloat16 \
    --gpu-memory-utilization=0.3 --use-v2-block-manager --max-model-len 4096 2>&1 > multi_models.log &
```

* run test
```
bs=128
in_len=1024
out_len=1024

python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model mistralai/Mistral-7B-Instruct-v0.3 \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > mistral-sonnet-1.log 2>&1 &

python benchmarks/benchmark_serving.py \
		--backend vllm \
		--model meta-llama/Llama-3.1-8B-Instruct \
		--dataset-name sonnet \
		--dataset-path benchmarks/sonnet.txt \
		--request-rate 512 \
		--num-prompts ${bs} \
		--port 8080 \
		--sonnet-input-len ${in_len} \
		--sonnet-output-len ${out_len} \
		--sonnet-prefix-len 100 \
		--save-result > llama-sonnet-1.log 2>&1 &
```

---------

Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Co-authored-by: Marcin Swiniarski <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
@xuechendi xuechendi deleted the multi_models_hpu branch December 19, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants