Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream codebase diff #470

Draft
wants to merge 1,007 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1007 commits
Select commit Hold shift + click to select a range
25d806e
[misc] add torch.compile compatibility check (#10618)
youkaichao Nov 25, 2024
39c6b6c
Limit decode block size (#532)
mfylcek Nov 25, 2024
5eb8b1f
fix marlin flag set on hpu (#540)
nirda7 Nov 25, 2024
05d1f8c
[misc] move functions to config.py (#10624)
youkaichao Nov 25, 2024
ed46f14
[Model] Support `is_causal` HF config field for Qwen2 model (#10621)
DarkLight1337 Nov 25, 2024
2b0879b
Super tiny little typo fix (#10633)
fzyzcjy Nov 25, 2024
d04b13a
[Bug]: Authorization ignored when root_path is set (#10606)
chaunceyjiang Nov 25, 2024
c27df94
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…
wallashss Nov 25, 2024
452a4e8
[Docs] Add Snowflake Slides (#10641)
simon-mo Nov 25, 2024
b1d9205
[Model]: Add support for Aria model (#10514)
xffxff Nov 25, 2024
cf73f0c
[Model] Enable optional prefix when loading embedding models (#10639)
DarkLight1337 Nov 25, 2024
1b583cf
[Doc] Fix typos in docs (#10636)
DarkLight1337 Nov 25, 2024
9db713a
[Model] Add OLMo November 2024 model (#10503)
2015aroras Nov 25, 2024
6e9ff05
[misc] do not read HOST_IP (#10644)
youkaichao Nov 26, 2024
45ac4ff
[bugfix] fix aria model and add torch.compile (#10645)
youkaichao Nov 26, 2024
a6760f6
[Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
sanketkaleoss Nov 26, 2024
519e8e4
[v1] EngineArgs for better config handling for v1 (#10382)
rickyyx Nov 26, 2024
9a88f89
custom allreduce + torch.compile (#10121)
SageMoore Nov 26, 2024
9406353
[Misc] Remove outdated init protocols (#10655)
DarkLight1337 Nov 26, 2024
334d64d
[ci] add vllm_test_utils (#10659)
youkaichao Nov 26, 2024
0f513bd
Fix profile run for multi LoRA (#549)
kdamaszk Nov 26, 2024
7133502
fix cutlass_fp8_supported flag set on hpu
nirda7 Nov 26, 2024
38c2d10
Fix cutlass_fp8_supported flag set on HPU (#550)
nirda7 Nov 26, 2024
b62f1b2
[HPU] Add mark_step configurable for the decoder layer. (#525)
jiminha Nov 26, 2024
633df59
Update cpu-test.yml (#544)
michalkuligowski Nov 26, 2024
4d8185f
Update *.sh (#545)
michalkuligowski Nov 26, 2024
1f6584e
[V1] Enable profile for LLMEngine (#10665)
jikunshang Nov 26, 2024
3f0b0e4
Update run-lm-eval-gsm-vllm-baseline.sh (#552)
michalkuligowski Nov 26, 2024
b099337
Add HPU information to collect_env script (#430)
michalkuligowski Nov 26, 2024
b7d75b8
Intern2 habana (#489)
skirdey-inflection Nov 26, 2024
db66e01
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
andoorve Nov 26, 2024
677741e
Added hpu as device argument
rsshaik1 Nov 26, 2024
f5792c7
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
9a99273
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
7576cd3
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
2f0a0a1
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
0a71900
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
0a4d968
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
1bf905d
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
cfb3bf2
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
0c62b0b
Added "hpu" as configurable device argument in test_lora_manager_hpu …
vivekgoe Nov 27, 2024
e85250b
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
15cc2a9
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
e225110
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
1209261
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
418cb3b
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9e0a147
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
308cc5e
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
c411def
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
b98c62b
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
197b448
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
9b4b150
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
395b1c7
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
cb4e1c3
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
70dc14f
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
278be67
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d9b4b3f
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
a79b122
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
756485f
[BUG FIX] [SPEC DECODE] 0.6.4 rebase cause incorrectness in spec deco…
xuechendi Nov 28, 2024
d83b62f
CI fix (#563)
tzielinski-habana Nov 28, 2024
9a8bff0
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
3ed5e73
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
5fc5ce0
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
637bb57
Set vllm-hpu-extension to 50e10ea (#565)
mswiniarsk Nov 28, 2024
8c1e77f
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
98f47f2
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
c83919c
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
fa6ecb9
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
c82b432
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
cff5c7f
Refactor FP8 Inc config and flow (#564)
nirda7 Nov 29, 2024
f295f07
Set vllm-hpu-extension to bc01901
iboiko-habana Nov 29, 2024
2aeea0b
Set vllm-hpu-extension to bc01901 (#567)
iboiko-habana Nov 29, 2024
3132aac
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
cef2df0
to make repetition penalty faster (#442)
ccrhx4 Nov 29, 2024
661175b
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
49c9efa
Enable alibi fusedsdpa (#561)
itaraban Nov 29, 2024
40bc242
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
16ee07f
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
56da9fc
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 2, 2024
e438503
fix syntax error
kzawora-intel Dec 2, 2024
4b502a6
Set vllm-hpu-extension to fb36408 (#572)
mswiniarsk Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
3cb5420
Set vllm-hpu-extension to cd520df (#574)
mswiniarsk Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
1440f45
Revert "to make repetition penalty faster" (#570)
michalkuligowski Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-neuralmagic Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
b9d6f69
Regional compilation support (#576)
Kacper-Pietkun Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
4796d16
Revert "Enable alibi fusedsdpa" (#585)
madamczykhabana Dec 4, 2024
8c76728
Prepare sin/cos buffers for rope outside model forward (#566)
tzielinski-habana Dec 4, 2024
f6865f4
Enable DeepseekV2 Lite/Chat models (#516)
hlin99 Dec 4, 2024
8754e17
Set vllm-hpu-extension to 070591a (#591)
mswiniarsk Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
ad29332
[CI/BUILD] Spec decode ci (#524)
xuechendi Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
a805205
Add host traces to high-level profilings (#577)
szutenberg Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
e349f70
Enable patching Fused SDPA (#569)
nirda7 Dec 6, 2024
6a4f673
revert INC fixed version installation in requirements-hpu.txt for 1.1…
xuechendi Dec 6, 2024
e0e47ed
Add multiprocessing HPU executor (#559)
kzawora-intel Dec 6, 2024
858e0a0
fix WorkerWrapperBase and spec_decode rebase (#582)
xuechendi Dec 6, 2024
21323ed
Merge remote-tracking branch 'origin/habana_main' into HEAD
kzawora-intel Dec 6, 2024
d8f395e
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 6, 2024
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
48ab12b
fix mypy errors
kzawora-intel Dec 6, 2024
9204975
fix (hopefully) all linter errors
kzawora-intel Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
ad8d5b7
Dec 06 rebase (#571)
kzawora-intel Dec 9, 2024
db68690
fix hpu destructors flow and remove finish_measurements (#379)
nirda7 Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
edc4fa3
[ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
khluu Dec 9, 2024
3b61cb4
[V1] Further reduce CPU overheads in flash-attn (#10989)
WoosukKwon Dec 9, 2024
ca87149
[Misc][LoRA] Abstract PunicaWrapper (#10955)
jeejeelee Dec 9, 2024
a811dd6
[Model] merged input processor for Phi-3-Vision models (#10977)
Isotr0py Dec 9, 2024
cbcbdb1
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
kzawora-intel Dec 9, 2024
1a2f8fb
[v1] fix use compile sizes (#11000)
youkaichao Dec 9, 2024
9c6459e
[Neuron] Upgrade neuron to 2.20.2 (#11016)
xendo Dec 9, 2024
b63ba84
[ROCm][bugfix] scpecilative decoding worker class (#11035)
gshtras Dec 9, 2024
5ed5d5f
Build tpu image in release pipeline (#10936)
richardsliu Dec 9, 2024
6faec54
[V1] Do not store `None` in self.generators (#11038)
WoosukKwon Dec 9, 2024
6d52528
[Docs] Add dedicated tool calling page to docs (#10554)
mgoin Dec 10, 2024
d1f6d1c
[Model] Add has_weight to RMSNorm and re-enable weights loading track…
Isotr0py Dec 10, 2024
0cce63a
Set vllm-hpu-extension to 4312768
SanjuCSudhakaran Dec 10, 2024
3473bc1
Set vllm-hpu-extension to 4312768 (#604)
vivekgoe Dec 10, 2024
391d7b2
[Bugfix] Fix usage of `deprecated` decorator (#11025)
DarkLight1337 Dec 10, 2024
980ad39
[Frontend] Use request id from header (#10968)
joerunde Dec 10, 2024
bc192a2
[Pixtral] Improve loading (#11040)
patrickvonplaten Dec 10, 2024
28b3a1c
[V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
tlrmchlsmth Dec 10, 2024
ebf7780
monitor metrics of tokens per step using cudagraph batchsizes (#11031)
youkaichao Dec 10, 2024
e35879c
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig o…
sjuxax Dec 10, 2024
bfd6104
Update README.md (#11034)
dmoliveira Dec 10, 2024
82c73fd
[Bugfix] cuda error running llama 3.2 (#11047)
GeneDer Dec 10, 2024
fe2e10c
Add example of helm chart for vllm deployment on k8s (#9199)
mfournioux Dec 10, 2024
239739c
Support mllama (llama 3.2) model for HPU (#491)
yisonzhu Dec 10, 2024
2126fd2
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Dec 10, 2024
89266bc
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Dec 10, 2024
5a166da
Update ray_hpu_executor.py
michalkuligowski Dec 10, 2024
0ad9b59
Enable padding aware scheduling by default on HPU (#606)
kzawora-intel Dec 10, 2024
17e6be7
Update CODEOWNERS
kzawora-intel Dec 10, 2024
15774c4
Update CODEOWNERS (#608)
kzawora-intel Dec 10, 2024
def7ac2
Fix TP>1 in encoder-decoder models (#607)
jkaniecki Dec 10, 2024
b8fff21
Add PunicaWrapperHPU to handle LoRA computations
SanjuCSudhakaran Dec 11, 2024
381453c
Align LoRA handling in HPU with PunicaWrapper class (#614)
kzawora-intel Dec 11, 2024
a9fde5f
Dec 10 rebase (#605)
michalkuligowski Dec 11, 2024
641367b
Revert "Dec 10 rebase"
michalkuligowski Dec 11, 2024
55f99ea
Revert "Dec 10 rebase" (#618)
kzawora-intel Dec 11, 2024
ad10b73
Revert "Revert "Dec 10 rebase""
kzawora-intel Dec 11, 2024
df7dd05
Revert "Revert "Dec 10 rebase"" (#619)
kzawora-intel Dec 11, 2024
07dbd34
fix graceful shutdown
kzawora-intel Dec 10, 2024
d312c92
Fix multiprocessing executor shutdown (#621)
michalkuligowski Dec 11, 2024
7ef6b2c
Update GitHub Actions targets (#622)
kzawora-intel Dec 11, 2024
449a89d
Add padding to encoder_seq_lens (#610)
kdamaszk Dec 12, 2024
d2128b4
Remove workaround for one_hot in eager/compile (#632)
anko-intel Dec 16, 2024
11c07e3
Add shutdown_inc method to MultiprocessingHPUExecutor (#634)
nirda7 Dec 16, 2024
ba1d24b
Fix recompilations due to different batch_sizes in MSS (#637)
mfylcek Dec 16, 2024
c9a740f
Fix CI reports (#636)
afierka-intel Dec 16, 2024
da61ecf
Unit scales in FP8 CI scenarios (#633)
afierka-intel Dec 16, 2024
d81f829
TC llama recompile fix - no_grad to inference_mode (#640)
RafLit Dec 18, 2024
88ef381
Generic call for prepare_cos_sin in rotary embedding (#638)
tzielinski-habana Dec 18, 2024
9555fef
Update CODEOWNERS (#649)
vivekgoe Dec 19, 2024
2443ba9
Fix long contexts in LoRA (#624)
SanjuCSudhakaran Jan 2, 2025
2012336
Lora manager tests fix (#652)
rsshaik1 Jan 2, 2025
5b5bf26
Fix LoRA tests (#664)
SanjuCSudhakaran Jan 2, 2025
2d24be7
[BUG fix] Rebase caused spec decode fix (#613)
xuechendi Jan 7, 2025
27a22ab
fix slow sampling when repetition_penalty is set. (#584)
ccrhx4 Jan 7, 2025
9d6917f
Optimize for topk=1 case if we do not handle duplicates (#603)
ssarkar2 Jan 7, 2025
5d582b5
[bugfix] fix RuntimeError on apc (#648)
kkimmk Jan 7, 2025
585ca9a
Add llava support to benchmark_throuhput (#665)
adobrzyniewicz-habana Jan 8, 2025
8f53dee
Add mllama support to benchmark_throughput (#668)
kdamaszk Jan 8, 2025
49a11e2
Add mark_step for encoder layers (#669)
yma11 Jan 8, 2025
cccf363
Use FusedSDPA for MllamaVisionSdpaAttention (#620)
kdamaszk Jan 8, 2025
fa9dbf2
Limit number of dummy cross attention blocks (#667)
kdamaszk Jan 8, 2025
73aaf71
[SW-197036] - use torch._scaled_mm with hpu (#660)
nirda7 Jan 9, 2025
c5975f8
Handle LoRA specific changes in MSS (#675)
SanjuCSudhakaran Jan 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
65 changes: 48 additions & 17 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,48 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
63 changes: 25 additions & 38 deletions .buildkite/nightly-benchmarks/scripts/launch-server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,58 +50,54 @@ launch_trt_server() {
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git lfs install
cd tensorrtllm_backend
git checkout $trt_llm_version
tensorrtllm_backend_dir=$(pwd)
git checkout "$trt_llm_version"
git submodule update --init --recursive

# build trtllm engine
cd /tensorrtllm_backend
cd ./tensorrt_llm/examples/${model_type}
cd "./tensorrt_llm/examples/${model_type}"
python3 convert_checkpoint.py \
--model_dir ${model_path} \
--dtype ${model_dtype} \
--tp_size ${model_tp_size} \
--output_dir ${trt_model_path}
--model_dir "${model_path}" \
--dtype "${model_dtype}" \
--tp_size "${model_tp_size}" \
--output_dir "${trt_model_path}"
trtllm-build \
--checkpoint_dir ${trt_model_path} \
--checkpoint_dir "${trt_model_path}" \
--use_fused_mlp \
--reduce_fusion disable \
--workers 8 \
--gpt_attention_plugin ${model_dtype} \
--gemm_plugin ${model_dtype} \
--tp_size ${model_tp_size} \
--max_batch_size ${max_batch_size} \
--max_input_len ${max_input_len} \
--max_seq_len ${max_seq_len} \
--max_num_tokens ${max_num_tokens} \
--output_dir ${trt_engine_path}
--gpt_attention_plugin "${model_dtype}" \
--gemm_plugin "${model_dtype}" \
--tp_size "${model_tp_size}" \
--max_batch_size "${max_batch_size}" \
--max_input_len "${max_input_len}" \
--max_seq_len "${max_seq_len}" \
--max_num_tokens "${max_num_tokens}" \
--output_dir "${trt_engine_path}"

# handle triton protobuf files and launch triton server
cd /tensorrtllm_backend
mkdir triton_model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cd triton_model_repo
rm -rf ./tensorrt_llm/1/*
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
cp -r "${trt_engine_path}"/* ./tensorrt_llm/1
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5"
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false"
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:"$max_batch_size"
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt "triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:False,bls_instance_count:1"
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py \
--world_size=${model_tp_size} \
--world_size="${model_tp_size}" \
--model_repo=/tensorrtllm_backend/triton_model_repo &

}

launch_tgi_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -129,10 +125,7 @@ launch_tgi_server() {
launch_lmdeploy_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

server_command="lmdeploy serve api_server $model \
Expand All @@ -149,10 +142,7 @@ launch_sglang_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -185,10 +175,7 @@ launch_vllm_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -217,19 +204,19 @@ launch_vllm_server() {

main() {

if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "trt" ]]; then
launch_trt_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "tgi" ]]; then
launch_tgi_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
launch_lmdeploy_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "sglang" ]]; then
launch_sglang_server
fi

Expand Down
12 changes: 6 additions & 6 deletions .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ main() {
fi

# initial annotation
description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
#description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"

# download results
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls
Expand All @@ -30,15 +30,15 @@ main() {
/workspace/buildkite-agent artifact upload "results.zip"

# upload benchmarking scripts
cd $VLLM_SOURCE_CODE_LOC/
cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md


Expand Down Expand Up @@ -75,4 +75,4 @@ main() {
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}

main "$@"
main "$@"
Loading
Loading