Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set vllm-hpu-extension to c2cd742 #588

Closed
wants to merge 678 commits into from
Closed

Conversation

szutenberg
Copy link

joerunde and others added 30 commits November 6, 2024 11:57
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: yan ma <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Add multi step scheduling scenario to jenkins CI
…t#9506)

Signed-off-by: Flavia Beo <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Max de Bayser <[email protected]>
Req - https://jira.habana-labs.com/browse/REQ-289 => target for 1.19

TODO:
- There remains one hardcode to HPUWorker, need to remove

Next Steps:

- 1. submit necessary codes change to vllm-upstream branch => WIP
- 2. support all 3 draft_model_types - mlp_speculator, medusa and others
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
nirda7 and others added 28 commits November 25, 2024 09:41
Fixes issue with multi LoRA during `profile_run`.
We are seeing 10% performance regression in the llama-based model due to
vllm-project#10239. The mark_step()
function needs to be configured differently for each model to achieve
the best performance. For some models, mark_step() for every decoder
step would be optimal, but for other models, it's better to run it every
n-th step. We are adding a counter to only register the hook for every
n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS
Making sure the model runs on habana devices. Original code did not run
due to error in the split_qkv code as param unpacking was assuming lack
of batch dimension. Tested inference with the changes and InternLM2
works on Gaudi2 as expected.

---------

Co-authored-by: Stan Kirdey <[email protected]>
…556)

This PR adds the device configurable argument as 'hpu' to the test
'test_lora_manager_hpu.py, w.r.t to the changes vllm-project#10223
…de, fix in this PR (#523)

Noticed that Spec Decode went incorrect after rebase to 0.6.4

Identified root cause and fixed in the PR
1. incorrect return value position in batch_expansion.py
2. ContinuousPA generates faulty result in spec decode

CI added: #524

---------

Signed-off-by: Chendi.Xue <[email protected]>
Fix for the wrong flavor in CI config
Update vllm-hpu-extension to 50e10ea, that introduces PipelinedPA:
HabanaAI/vllm-hpu-extension#42
Set vllm-hpu-extension to bc01901
This PR is to fix very slow sampling process when repetition penalty is
set.

The fix includes:
1. Enable pin_memory on HPU
2. Padding prompt tokens and output_tokens to avoid recompile
3. Replace slow ops

Before the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0,
repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1,
min_p=0.0, seed=None, stop=[], stop_token_ids=[],
include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024,
min_tokens=0, logprobs=None, prompt_logprobs=None,
skip_special_tokens=True, spaces_between_special_tokens=True,
truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%|5/5 [03:24<00:00, 40.99s/it]
Avg latency: 40.98862759781768 seconds
10% percentile latency: 11.699748958216514 seconds
25% percentile latency: 11.73845003999304 seconds
50% percentile latency: 11.801458386995364 seconds
75% percentile latency: 11.861465670051984 seconds
90% percentile latency: 99.46527566103033 seconds
99% percentile latency: 152.02756165561732 seconds

After the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0,
repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1,
min_p=0.0, seed=None, stop=[], stop_token_ids=[],
include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024,
min_tokens=0, logprobs=None, prompt_logprobs=None,
skip_special_tokens=True, spaces_between_special_tokens=True,
truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%| 5/5 [00:57<00:00, 11.59s/it]
Avg latency: 11.58703240059549 seconds
10% percentile latency: 11.444069900200702 seconds
25% percentile latency: 11.511425047006924 seconds
50% percentile latency: 11.525146245025098 seconds
75% percentile latency: 11.556680046953261 seconds
90% percentile latency: 11.788318535778672 seconds
99% percentile latency: 11.927301629073918 seconds

Testing code is by:
https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh
Remove assert for alibi in case of FusedSDPA.
Set vllm-hpu-extension to fb36408, that includes support for non-GQA
workloads in PipelinedPA
Add support for regional compilation. It is turned on by default, but
can be turned off with `VLLM_REGIONAL_COMPILATION` env variable. It
works only for torch.compile execution mode. It significantly speeds up
warmup time and slightly increases throughput.
Moving sin/cos buffers preparation for rope outside model forward boosts
performance by getting rid of unneccessary gather and memcopy ops before rope

---------

Co-authored-by: Barak Goldberg <[email protected]>
Co-authored-by: barak goldberg <[email protected]>
Enable DeepseekV2 Lite/Chat models
@szutenberg szutenberg force-pushed the dev/mszutenberg/c2cd742 branch from 1c5b02c to 51d3afa Compare December 4, 2024 13:56
@szutenberg szutenberg closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.