Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi models support for upstream #590

Open
wants to merge 211 commits into
base: multi_models_upstream
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
211 commits
Select commit Hold shift + click to select a range
5be4e52
[Model][LoRA]LoRA support added for glm-4v (#10418)
B-201 Nov 18, 2024
e7ebb66
[Model] Remove transformers attention porting in VITs (#10414)
Isotr0py Nov 18, 2024
4186be8
[Doc] Update doc for LoRA support in GLM-4V (#10425)
B-201 Nov 18, 2024
7851b45
[5/N][torch.compile] torch.jit.script --> torch.compile (#10406)
youkaichao Nov 18, 2024
31894a2
[Doc] Add documentation for Structured Outputs (#9943)
ismael-dm Nov 18, 2024
4f686d1
Fix open_collective value in FUNDING.yml (#10426)
andrew Nov 18, 2024
281cc4b
[Model][Bugfix] Support TP for PixtralHF ViT (#10405)
mgoin Nov 18, 2024
6b2d25e
[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
yma11 Nov 18, 2024
c2170a5
[Kernel] Explicitly specify other value in tl.load calls (#9014)
angusYuhao Nov 18, 2024
96d999f
[Kernel] Initial Machete W4A8 support + Refactors (#9855)
LucasWilkinson Nov 18, 2024
a03ea40
[3/N][torch.compile] consolidate custom op logging (#10399)
youkaichao Nov 18, 2024
2298e69
[ci][bugfix] fix kernel tests (#10431)
youkaichao Nov 18, 2024
90a6c75
[misc] partial prefix & random input generation benchmark (#9929)
rickyyx Nov 18, 2024
284203f
[ci/build] Have dependabot ignore all patch update (#10436)
khluu Nov 19, 2024
7eb719d
[Bugfix]Fix Phi-3 BNB online quantization (#10417)
jeejeelee Nov 19, 2024
8c1fb50
[Platform][Refactor] Extract func `get_default_attn_backend` to `Plat…
MengqingCao Nov 19, 2024
74f8c2c
Add openai.beta.chat.completions.parse example to structured_outputs.…
mgoin Nov 19, 2024
272e31c
[Bugfix] Guard for negative counter metrics to prevent crash (#10430)
tjohnson31415 Nov 19, 2024
382b6a4
[Misc] Avoid misleading warning messages (#10438)
jeejeelee Nov 19, 2024
5390d66
[Doc] Add the start of an arch overview page (#10368)
russellb Nov 19, 2024
25f9c78
[misc][plugin] improve plugin loading (#10443)
youkaichao Nov 19, 2024
b461465
[CI][CPU] adding numa node number as container name suffix (#10441)
zhouyuan Nov 19, 2024
f028dff
[BugFix] Fix hermes tool parser output error stream arguments in some…
xiyuan-lee Nov 19, 2024
11fd7ea
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter (…
patrickvonplaten Nov 19, 2024
1ea291a
Fix: Build error seen on Power Architecture (#10421)
mikejuliet13 Nov 19, 2024
fd9f124
[Doc] fix link for page that was renamed (#10455)
russellb Nov 19, 2024
803f37e
[6/N] torch.compile rollout to users (#10437)
youkaichao Nov 19, 2024
efa9084
[Core] Avoid metrics log noise when idle (#8868)
russellb Nov 19, 2024
b00b33d
[Model][Quantization] HQQ support through Marlin kernel expansion (#9…
ElizaWszola Nov 19, 2024
a324d3a
Change granite chat template to keep json list formatting for tool ca…
maxdebayser Nov 20, 2024
d5b68ab
[CI/Build] Update Dockerfile.rocm (#10434)
Alexei-V-Ivanov-AMD Nov 20, 2024
d200972
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
LucasWilkinson Nov 20, 2024
9e05252
[Misc] Add __setitem__ for LazyDict (#10469)
liuyanyi Nov 20, 2024
ad44437
[Bugfix] Fix Mamba model initialization and MLP Speculator weights lo…
Isotr0py Nov 20, 2024
b4be5a8
[Bugfix] Enforce no chunked prefill for embedding models (#10470)
DarkLight1337 Nov 20, 2024
709c9f1
[CI/Build] Add sphinx/rst linter for docs (#10366)
rafvasq Nov 20, 2024
7629a9c
[CI/Build] Support compilation with local cutlass path (#10423) (#10424)
wchen61 Nov 20, 2024
ed701ca
[ci/build] Combine nightly and optional (#10465)
khluu Nov 20, 2024
343041c
[model] Reduce medusa weight (#10454)
skylee-01 Nov 20, 2024
09dbf9f
[Bugfix] Handle conflicts between modern and legacy fields (#10471)
DarkLight1337 Nov 20, 2024
d5b2844
[Platforms] Refactor xpu code (#10468)
MengqingCao Nov 20, 2024
63f1fde
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#1…
bigPYJ1151 Nov 20, 2024
772a667
[platforms] restore xpu check for parallel config (#10479)
youkaichao Nov 20, 2024
5f1d6af
[perf bench] H200 development (#9768)
simon-mo Nov 20, 2024
0cd3d97
[7/N] torch.compile, reduce compilation time (#10460)
youkaichao Nov 20, 2024
c68f7ed
[Bugfix]: allow extra fields in requests to openai compatible server …
gcalmettes Nov 20, 2024
2f77b6c
[TPU] Implement prefix caching for TPUs (#10307)
WoosukKwon Nov 20, 2024
388ee3d
[torch.compile] limit inductor threads and lazy import quant (#10482)
youkaichao Nov 21, 2024
6c1208d
[Core] Add Sliding Window Support with Flashinfer (#10462)
pavanimajety Nov 21, 2024
9d82717
[Platforms] Add `device_type` in `Platform` (#10508)
MengqingCao Nov 21, 2024
8b0fe06
[torch.compile] Inductor code caching fix (#10273)
ProExpertProg Nov 21, 2024
3430857
[Misc] Increase default video fetch timeout (#10495)
DarkLight1337 Nov 21, 2024
aaddce5
[platforms] improve error message for unspecified platforms (#10520)
youkaichao Nov 21, 2024
f0e0238
[Doc] fix a small typo in docstring of llama_tool_parser (#10513)
FerdinandZhong Nov 21, 2024
1cfde82
[Model] Add Support for Multimodal Granite Models (#10291)
alex-jw-brooks Nov 21, 2024
8a93a59
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (…
sywangyi Nov 21, 2024
d5ec121
[Model] Expose `dynamic_image_size` as mm_processor_kwargs for Intern…
Isotr0py Nov 21, 2024
4d676f0
[Bugfix] Embedding model pooling_type equals ALL and multi input's bu…
BBuf Nov 21, 2024
da7e702
[Bug]: When apply continue_final_message for OpenAI server, the "echo…
chaunceyjiang Nov 21, 2024
2385b60
[Kernel] Register punica ops directly (#10522)
jeejeelee Nov 21, 2024
c51e397
[Misc] Suppress duplicated logging regarding multimodal input pipelin…
ywang96 Nov 21, 2024
e7a8341
[Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
DarkLight1337 Nov 21, 2024
7560ae5
[8/N] enable cli flag without a space (#10529)
youkaichao Nov 21, 2024
f9310cb
[V1] Fix Compilation config & Enable CUDA graph by default (#10528)
WoosukKwon Nov 21, 2024
edec338
[CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
cermeng Nov 21, 2024
cf656f5
[misc] improve error message (#10553)
youkaichao Nov 21, 2024
46fe9b4
[Minor] Revert change in offline inference example (#10545)
WoosukKwon Nov 21, 2024
9afa014
Add small example to metrics.rst (#10550)
mgoin Nov 21, 2024
aed0748
[Benchmark] Add new H100 machine (#10547)
simon-mo Nov 22, 2024
33e0a25
[9/N] torch.compile LLM usage (#10552)
youkaichao Nov 22, 2024
446c780
[Minor] Fix line-too-long (#10563)
WoosukKwon Nov 22, 2024
a111d01
[platforms] absorb worker cls difference into platforms folder (#10555)
youkaichao Nov 22, 2024
b6374e0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Isotr0py Nov 22, 2024
11fcf0e
Remove token-adding chat embedding params (#10551)
noamgat Nov 22, 2024
db100c5
[bugfix] fix full graph tests (#10581)
youkaichao Nov 22, 2024
eebad39
[torch.compile] support all attention backends (#10558)
youkaichao Nov 22, 2024
97814fb
[v1] Refactor KVCacheManager for more hash input than token ids (#10507)
rickyyx Nov 22, 2024
948c859
support bitsandbytes quantization with qwen model (#10549)
zixuanzhang226 Nov 23, 2024
28598f3
[Core] remove temporary local variables in LLMEngine.__init__ (#10577)
russellb Nov 23, 2024
d345f40
[V1] EngineCore supports profiling (#10564)
Abatom Nov 23, 2024
d559979
[bugfix] fix cpu tests (#10585)
youkaichao Nov 23, 2024
9195dbd
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-To…
tjohnson31415 Nov 23, 2024
ebda519
[Core] Fix broken log configuration (#10458)
russellb Nov 23, 2024
978b397
[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432)
tlrmchlsmth Nov 23, 2024
4aba6e3
[core] gemma2 full context length support (#10584)
youkaichao Nov 23, 2024
7d8ffb3
[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
shenoyvvarun Nov 23, 2024
cfea9c0
[Model] Fix Baichuan BNB online quantization (#10572)
CNTRYROA Nov 23, 2024
02a43f8
Update default max_num_batch_tokens for chunked prefill to 2048 (#10544)
mgoin Nov 23, 2024
7c25fe4
[AMD] Add support for GGUF quantization on ROCm (#10254)
kliuae Nov 23, 2024
4634a89
Prefix Cache Aware Scheduling [1/n] (#10128)
rickyyx Nov 23, 2024
c8acd80
[2/N] handling placeholders in merged multi-modal processor (#10485)
DarkLight1337 Nov 23, 2024
4cfe5d2
[Bugfix] `multi_modal_kwargs` broadcast for CPU tensor parallel (#10541)
Isotr0py Nov 23, 2024
86a44fb
[Platforms] Refactor openvino code (#10573)
ji-huazhong Nov 23, 2024
651f6c3
For ppc64le, disabled tests for now and addressed space issues (#10538)
npanpaliya Nov 23, 2024
04668eb
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Isotr0py Nov 23, 2024
17d8fc1
[bugfix] Fix example/tensorize_vllm_model tests (#10595)
jeejeelee Nov 24, 2024
1700c54
[Bugfix] Fix LoRA weight sharding (#10450)
jeejeelee Nov 24, 2024
1c445dc
[CI/Build] Print running script to enhance CI log readability (#10594)
jeejeelee Nov 24, 2024
eda2b35
Revert "Print running script to enhance CI log readability" (#10601)
youkaichao Nov 24, 2024
c055747
[model][utils] add extract_layer_index utility function (#10599)
youkaichao Nov 24, 2024
e4fbb14
[doc] update the code to add models (#10603)
youkaichao Nov 24, 2024
49628fe
[Doc] Update README.md with Ray Summit talk links (#10610)
zhuohan123 Nov 25, 2024
214efc2
Support Cross encoder models (#10400)
maxdebayser Nov 25, 2024
7ea3cd7
[Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
MengqingCao Nov 25, 2024
571841b
[torch.compile] support encoder based models (#10613)
youkaichao Nov 25, 2024
a30a605
[Doc] Add encoder-based models to Supported Models page (#10616)
DarkLight1337 Nov 25, 2024
7c2134b
[torch.compile] force inductor threads (#10620)
jeejeelee Nov 25, 2024
6581378
[torch.compile] add warning for unsupported models (#10622)
youkaichao Nov 25, 2024
25d806e
[misc] add torch.compile compatibility check (#10618)
youkaichao Nov 25, 2024
05d1f8c
[misc] move functions to config.py (#10624)
youkaichao Nov 25, 2024
ed46f14
[Model] Support `is_causal` HF config field for Qwen2 model (#10621)
DarkLight1337 Nov 25, 2024
2b0879b
Super tiny little typo fix (#10633)
fzyzcjy Nov 25, 2024
d04b13a
[Bug]: Authorization ignored when root_path is set (#10606)
chaunceyjiang Nov 25, 2024
c27df94
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…
wallashss Nov 25, 2024
452a4e8
[Docs] Add Snowflake Slides (#10641)
simon-mo Nov 25, 2024
b1d9205
[Model]: Add support for Aria model (#10514)
xffxff Nov 25, 2024
cf73f0c
[Model] Enable optional prefix when loading embedding models (#10639)
DarkLight1337 Nov 25, 2024
1b583cf
[Doc] Fix typos in docs (#10636)
DarkLight1337 Nov 25, 2024
9db713a
[Model] Add OLMo November 2024 model (#10503)
2015aroras Nov 25, 2024
6e9ff05
[misc] do not read HOST_IP (#10644)
youkaichao Nov 26, 2024
45ac4ff
[bugfix] fix aria model and add torch.compile (#10645)
youkaichao Nov 26, 2024
a6760f6
[Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
sanketkaleoss Nov 26, 2024
519e8e4
[v1] EngineArgs for better config handling for v1 (#10382)
rickyyx Nov 26, 2024
9a88f89
custom allreduce + torch.compile (#10121)
SageMoore Nov 26, 2024
9406353
[Misc] Remove outdated init protocols (#10655)
DarkLight1337 Nov 26, 2024
334d64d
[ci] add vllm_test_utils (#10659)
youkaichao Nov 26, 2024
1f6584e
[V1] Enable profile for LLMEngine (#10665)
jikunshang Nov 26, 2024
db66e01
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
andoorve Nov 26, 2024
f5792c7
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
9a99273
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
7576cd3
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
2f0a0a1
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
0a71900
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
0a4d968
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
1bf905d
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
cfb3bf2
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
e85250b
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
15cc2a9
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
e225110
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
1209261
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
418cb3b
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9e0a147
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
308cc5e
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
c411def
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
b98c62b
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
197b448
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
9b4b150
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
395b1c7
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
cb4e1c3
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
70dc14f
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
278be67
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d9b4b3f
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
a79b122
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
9a8bff0
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
3ed5e73
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
5fc5ce0
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
8c1e77f
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
98f47f2
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
c83919c
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
fa6ecb9
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
c82b432
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
3132aac
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
661175b
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
40bc242
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
16ee07f
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-neuralmagic Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
41685cb
copy origin file to mm_ files
jikunshang Nov 22, 2024
883fac0
support multi models
jikunshang Nov 22, 2024
655b610
add model arg to original client
xuechendi Dec 3, 2024
e4ad1c5
fix GPU memory allocation for multi_models
xuechendi Dec 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
60 changes: 43 additions & 17 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,43 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

# Do not set -e, as the mixtral 8x22B model tends to crash occasionally
# and we still want to see other benchmarking results even when mixtral crashes.
set -x
set -o pipefail

check_gpus() {
Expand Down Expand Up @@ -85,11 +86,7 @@ kill_gpu_processes() {

ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3
pgrep python3 | xargs -r kill -9


# wait until GPU memory usage smaller than 1GB
Expand Down Expand Up @@ -289,7 +286,7 @@ run_serving_tests() {
# run the server
echo "Running test case $test_name"
echo "Server command: $server_command"
eval "$server_command" &
bash -c "$server_command" &
server_pid=$!

# wait until the server is alive
Expand Down Expand Up @@ -322,7 +319,7 @@ run_serving_tests() {
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"

eval "$client_command"
bash -c "$client_command"

# record the benchmarking commands
jq_output=$(jq -n \
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

Expand Down
17 changes: 15 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
Expand All @@ -18,11 +18,24 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
1 change: 0 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
Expand Down
44 changes: 3 additions & 41 deletions .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,49 +4,11 @@
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN="$HF_TOKEN" --name cpu-test cpu-test

function cpu_tests() {
set -e

# Run basic model test
docker exec cpu-test bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# online inference
docker exec cpu-test bash -c "
set -e
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
}
# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests"
25 changes: 16 additions & 9 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,27 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2 bash -c "
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
Expand All @@ -45,20 +46,26 @@ function cpu_tests() {
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
Expand All @@ -75,4 +82,4 @@ function cpu_tests() {

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests $CORE_RANGE"
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
7 changes: 5 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference.py
python3 examples/offline_inference_cli.py -tp 2
'
Loading