forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 64
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix slow sampling when repetition_penalty is set. (#584)
This PR is to fix the slow sampling in HPU when repetition_penalty is set in the sampling parameters. It replaces the slow pytorch API on HPU and mitigate the dynamic shapes in the code. Without this PR: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None) Warming up... Profiling iterations: 100%|5/5 [03:32<00:00, 42.49s/it] Avg latency: 42.49439047839987 seconds 10% percentile latency: 11.322476224999628 seconds 25% percentile latency: 11.32563829100036 seconds 50% percentile latency: 11.331052645000455 seconds 75% percentile latency: 11.333669468998778 seconds 90% percentile latency: 104.8302020711999 seconds 99% percentile latency: 160.92812163252054 seconds With PR: Avg latency: 11.038154767800005 seconds 10% percentile latency: 10.964674918200398 seconds 25% percentile latency: 10.964709408001 seconds 50% percentile latency: 10.966433088000485 seconds 75% percentile latency: 10.967024742998547 seconds 90% percentile latency: 11.18358270219942 seconds 99% percentile latency: 11.313517477719943 seconds Testing code: https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh The only difference about this PR and #442 is that I do not enable pin_memory as this feature readiness is poor on HPU.
- Loading branch information
Showing
3 changed files
with
88 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters