Set vllm-hpu-extension to c2cd742 #588

szutenberg · 2024-12-04T13:54:46Z

https://github.com/HabanaAI/vllm-hpu-extension/commits/main/

Required for enabling #577

Signed-off-by: Joe Runde <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

) Signed-off-by: youkaichao <[email protected]>

….4 (vllm-project#10095) Signed-off-by: mgoin <[email protected]>

Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: yan ma <[email protected]> Co-authored-by: Kunshang Ji <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: jiang1.li <[email protected]>

Signed-off-by: youkaichao <[email protected]>

…roject#10097) Signed-off-by: Nick Hill <[email protected]>

…t#10104) Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: Rafael Vasquez <[email protected]>

…-project#10030) Signed-off-by: Hanzhi Zhou <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

Add multi step scheduling scenario to jenkins CI

…t#9506) Signed-off-by: Flavia Beo <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Max de Bayser <[email protected]>

…ct#10075) (vllm-project#10076) Signed-off-by: Lei <[email protected]>

…llm-project#10108) Signed-off-by: jiang1.li <[email protected]>

…-VL (vllm-project#10112) Signed-off-by: Jiahao Li <[email protected]>

…Benchmark. (vllm-project#10105) Signed-off-by: Mozhou <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Req - https://jira.habana-labs.com/browse/REQ-289 => target for 1.19 TODO: - There remains one hardcode to HPUWorker, need to remove Next Steps: - 1. submit necessary codes change to vllm-upstream branch => WIP - 2. support all 3 draft_model_types - mlp_speculator, medusa and others

) Signed-off-by: Max de Bayser <[email protected]>

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Fixes issue with multi LoRA during `profile_run`.

We are seeing 10% performance regression in the llama-based model due to vllm-project#10239. The mark_step() function needs to be configured differently for each model to achieve the best performance. For some models, mark_step() for every decoder step would be optimal, but for other models, it's better to run it every n-th step. We are adding a counter to only register the hook for every n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS

Making sure the model runs on habana devices. Original code did not run due to error in the split_qkv code as param unpacking was assuming lack of batch dimension. Tested inference with the changes and InternLM2 works on Gaudi2 as expected. --------- Co-authored-by: Stan Kirdey <[email protected]>

…556) This PR adds the device configurable argument as 'hpu' to the test 'test_lora_manager_hpu.py, w.r.t to the changes vllm-project#10223

…de, fix in this PR (#523) Noticed that Spec Decode went incorrect after rebase to 0.6.4 Identified root cause and fixed in the PR 1. incorrect return value position in batch_expansion.py 2. ContinuousPA generates faulty result in spec decode CI added: #524 --------- Signed-off-by: Chendi.Xue <[email protected]>

Fix for the wrong flavor in CI config

Update vllm-hpu-extension to 50e10ea, that introduces PipelinedPA: HabanaAI/vllm-hpu-extension#42

Set vllm-hpu-extension to bc01901

This PR is to fix very slow sampling process when repetition penalty is set. The fix includes: 1. Enable pin_memory on HPU 2. Padding prompt tokens and output_tokens to avoid recompile 3. Replace slow ops Before the fix: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None Warming up... Profiling iterations: 100%|5/5 [03:24<00:00, 40.99s/it] Avg latency: 40.98862759781768 seconds 10% percentile latency: 11.699748958216514 seconds 25% percentile latency: 11.73845003999304 seconds 50% percentile latency: 11.801458386995364 seconds 75% percentile latency: 11.861465670051984 seconds 90% percentile latency: 99.46527566103033 seconds 99% percentile latency: 152.02756165561732 seconds After the fix: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None Warming up... Profiling iterations: 100%| 5/5 [00:57<00:00, 11.59s/it] Avg latency: 11.58703240059549 seconds 10% percentile latency: 11.444069900200702 seconds 25% percentile latency: 11.511425047006924 seconds 50% percentile latency: 11.525146245025098 seconds 75% percentile latency: 11.556680046953261 seconds 90% percentile latency: 11.788318535778672 seconds 99% percentile latency: 11.927301629073918 seconds Testing code is by: https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh

Remove assert for alibi in case of FusedSDPA.

Set vllm-hpu-extension to fb36408, that includes support for non-GQA workloads in PipelinedPA

Reverts #442

Add support for regional compilation. It is turned on by default, but can be turned off with `VLLM_REGIONAL_COMPILATION` env variable. It works only for torch.compile execution mode. It significantly speeds up warmup time and slightly increases throughput.

Reverts #561

Moving sin/cos buffers preparation for rope outside model forward boosts performance by getting rid of unneccessary gather and memcopy ops before rope --------- Co-authored-by: Barak Goldberg <[email protected]> Co-authored-by: barak goldberg <[email protected]>

Enable DeepseekV2 Lite/Chat models

joerunde and others added 30 commits November 6, 2024 11:57

[V1] Make v1 more testable (vllm-project#9888)

d58268c

Signed-off-by: Joe Runde <[email protected]>

[CI/Build] Always run the ruff workflow (vllm-project#10092)

74f2f8a

Signed-off-by: Russell Bryant <[email protected]>

[core][distributed] add stateless_init_process_group (vllm-project#10072

719c1ca

) Signed-off-by: youkaichao <[email protected]>

[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12…

4ab3256

….4 (vllm-project#10095) Signed-off-by: mgoin <[email protected]>

[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (vllm-project#9823)

d3859f1

Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: yan ma <[email protected]> Co-authored-by: Kunshang Ji <[email protected]>

[Frontend] Adjust try/except blocks in API impl (vllm-project#10056)

29862b8

Signed-off-by: Nick Hill <[email protected]>

[Hardware][CPU] Update torch 2.5 (vllm-project#9911)

a4b3e0c

Signed-off-by: jiang1.li <[email protected]>

[doc] add back Python 3.8 ABI (vllm-project#10100)

e7b84c3

Signed-off-by: youkaichao <[email protected]>

[V1][BugFix] Fix Generator construction in greedy + seed case (vllm-p…

1fa020c

…roject#10097) Signed-off-by: Nick Hill <[email protected]>

[Misc] Consolidate ModelConfig code related to HF config (vllm-projec…

db7db4a

…t#10104) Signed-off-by: DarkLight1337 <[email protected]>

[CI/Build] re-add codespell to CI (vllm-project#10083)

104d729

Signed-off-by: Russell Bryant <[email protected]>

Doc: Improve benchmark documentation (vllm-project#9927)

d7263a1

Signed-off-by: Rafael Vasquez <[email protected]>

[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (vllm…

6192e9b

…-project#10030) Signed-off-by: Hanzhi Zhou <[email protected]>

[CI/Build] Improve mypy + python version matrix (vllm-project#10041)

e036e52

Signed-off-by: Russell Bryant <[email protected]>

Add multi step scheduling scenario to jenkins CI (#445)

11f5da6

Add multi step scheduling scenario to jenkins CI

Adds method to read the pooling types from model's files (vllm-projec…

aa9078f

…t#9506) Signed-off-by: Flavia Beo <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Max de Bayser <[email protected]>

[Frontend] Fix multiple values for keyword argument error (vllm-proje…

0dfba97

…ct#10075) (vllm-project#10076) Signed-off-by: Lei <[email protected]>

[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (v…

a6f332d

…llm-project#10108) Signed-off-by: jiang1.li <[email protected]>

[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2…

999df95

…-VL (vllm-project#10112) Signed-off-by: Jiahao Li <[email protected]>

[Misc] Add Gamma-Distribution Request Generation Support for Serving …

a62bc01

…Benchmark. (vllm-project#10105) Signed-off-by: Mozhou <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Handle offsets shape in long contexts

6eed0ef

[Frontend] Tool calling parser for Granite 3.0 models (vllm-project#9027

ae62fd1

) Signed-off-by: Max de Bayser <[email protected]>

[Feature] [Spec decode]: Combine chunked prefill with speculative dec…

9d43afc

…oding (vllm-project#9291) Signed-off-by: NickLucche <[email protected]>

[CI/Build] Always run mypy (vllm-project#10122)

de0e61a

Signed-off-by: Russell Bryant <[email protected]>

[CI/Build] Add shell script linting using shellcheck (vllm-project#7925)

3be5b26

Signed-off-by: Russell Bryant <[email protected]>

[CI/Build] Automate PR body text cleanup (vllm-project#10082)

a2f1f3b

Signed-off-by: Russell Bryant <[email protected]>

Bump actions/setup-python from 5.2.0 to 5.3.0 (vllm-project#9745)

97b8475

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Online video support for VLMs (vllm-project#10020)

28b2877

Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>

Bump actions/checkout from 4.2.1 to 4.2.2 (vllm-project#9746)

93bff42

Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

nirda7 and others added 28 commits November 25, 2024 09:41

fix marlin flag set on hpu (#540)

5eb8b1f

Fix profile run for multi LoRA (#549)

0f513bd

Fixes issue with multi LoRA during `profile_run`.

fix cutlass_fp8_supported flag set on hpu

7133502

Fix cutlass_fp8_supported flag set on HPU (#550)

38c2d10

Update cpu-test.yml (#544)

633df59

Update *.sh (#545)

4d8185f

Update run-lm-eval-gsm-vllm-baseline.sh (#552)

3f0b0e4

Add HPU information to collect_env script (#430)

b099337

Added hpu as device argument

677741e

Added "hpu" as configurable device argument in test_lora_manager_hpu (#…

0c62b0b

…556) This PR adds the device configurable argument as 'hpu' to the test 'test_lora_manager_hpu.py, w.r.t to the changes vllm-project#10223

CI fix (#563)

d83b62f

Fix for the wrong flavor in CI config

Set vllm-hpu-extension to 50e10ea (#565)

637bb57

Update vllm-hpu-extension to 50e10ea, that introduces PipelinedPA: HabanaAI/vllm-hpu-extension#42

Refactor FP8 Inc config and flow (#564)

cff5c7f

Set vllm-hpu-extension to bc01901

f295f07

Set vllm-hpu-extension to bc01901 (#567)

2aeea0b

Set vllm-hpu-extension to bc01901

Enable alibi fusedsdpa (#561)

49c9efa

Remove assert for alibi in case of FusedSDPA.

Set vllm-hpu-extension to fb36408 (#572)

4b502a6

Set vllm-hpu-extension to fb36408, that includes support for non-GQA workloads in PipelinedPA

Set vllm-hpu-extension to cd520df (#574)

3cb5420

Revert "to make repetition penalty faster" (#570)

1440f45

Reverts #442

Revert "Enable alibi fusedsdpa" (#585)

4796d16

Reverts #561

Enable DeepseekV2 Lite/Chat models (#516)

f6865f4

Enable DeepseekV2 Lite/Chat models

Set vllm-hpu-extension to c2cd742

51d3afa

szutenberg force-pushed the dev/mszutenberg/c2cd742 branch from 1c5b02c to 51d3afa Compare December 4, 2024 13:56

szutenberg closed this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set vllm-hpu-extension to c2cd742 #588

Set vllm-hpu-extension to c2cd742 #588

szutenberg commented Dec 4, 2024

Set vllm-hpu-extension to c2cd742 #588

Set vllm-hpu-extension to c2cd742 #588

Conversation

szutenberg commented Dec 4, 2024