Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nov 18 rebase #485

Merged
merged 187 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
187 commits
Select commit Hold shift + click to select a range
2003cc3
[Model][LoRA]LoRA support added for LlamaEmbeddingModel (#10071)
jeejeelee Nov 6, 2024
a5bba7d
[Model] Add Idefics3 support (#9767)
jeejeelee Nov 6, 2024
406d4cc
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration (…
ericperfect Nov 6, 2024
399c798
Remove ScaledActivation for AWQ (#10057)
mgoin Nov 6, 2024
098f94d
[CI/Build] Drop Python 3.8 support (#10038)
russellb Nov 6, 2024
87bd7e0
[CI/Build] change conflict PR comment from mergify (#10080)
russellb Nov 6, 2024
d58268c
[V1] Make v1 more testable (#9888)
joerunde Nov 6, 2024
74f2f8a
[CI/Build] Always run the ruff workflow (#10092)
russellb Nov 6, 2024
719c1ca
[core][distributed] add stateless_init_process_group (#10072)
youkaichao Nov 7, 2024
4ab3256
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12…
mgoin Nov 7, 2024
d3859f1
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (#9823)
yma11 Nov 7, 2024
29862b8
[Frontend] Adjust try/except blocks in API impl (#10056)
njhill Nov 7, 2024
a4b3e0c
[Hardware][CPU] Update torch 2.5 (#9911)
bigPYJ1151 Nov 7, 2024
e7b84c3
[doc] add back Python 3.8 ABI (#10100)
youkaichao Nov 7, 2024
1fa020c
[V1][BugFix] Fix Generator construction in greedy + seed case (#10097)
njhill Nov 7, 2024
db7db4a
[Misc] Consolidate ModelConfig code related to HF config (#10104)
DarkLight1337 Nov 7, 2024
104d729
[CI/Build] re-add codespell to CI (#10083)
russellb Nov 7, 2024
d7263a1
Doc: Improve benchmark documentation (#9927)
rafvasq Nov 7, 2024
6192e9b
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
hanzhi713 Nov 7, 2024
e036e52
[CI/Build] Improve mypy + python version matrix (#10041)
russellb Nov 7, 2024
aa9078f
Adds method to read the pooling types from model's files (#9506)
flaviabeo Nov 7, 2024
0dfba97
[Frontend] Fix multiple values for keyword argument error (#10075) (#…
DIYer22 Nov 7, 2024
a6f332d
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#…
bigPYJ1151 Nov 7, 2024
999df95
[Bugfix] Make image processor respect `mm_processor_kwargs` for Qwen2…
li-plus Nov 7, 2024
a62bc01
[Misc] Add Gamma-Distribution Request Generation Support for Serving …
spliii Nov 7, 2024
ae62fd1
[Frontend] Tool calling parser for Granite 3.0 models (#9027)
maxdebayser Nov 7, 2024
9d43afc
[Feature] [Spec decode]: Combine chunked prefill with speculative dec…
NickLucche Nov 7, 2024
de0e61a
[CI/Build] Always run mypy (#10122)
russellb Nov 7, 2024
3be5b26
[CI/Build] Add shell script linting using shellcheck (#7925)
russellb Nov 7, 2024
a2f1f3b
[CI/Build] Automate PR body text cleanup (#10082)
russellb Nov 7, 2024
97b8475
Bump actions/setup-python from 5.2.0 to 5.3.0 (#9745)
dependabot[bot] Nov 7, 2024
28b2877
Online video support for VLMs (#10020)
litianjian Nov 7, 2024
93bff42
Bump actions/checkout from 4.2.1 to 4.2.2 (#9746)
dependabot[bot] Nov 7, 2024
073a472
[Misc] report relevant env vars in collect_env.py tool (#9293)
ycool Nov 8, 2024
42b4f46
[V1] Add all_token_ids attribute to Request (#10135)
WoosukKwon Nov 8, 2024
201fc07
[V1] Prefix caching (take 2) (#9972)
comaniac Nov 8, 2024
6bb52b0
[CI/Build] Give PR cleanup job PR write access (#10139)
russellb Nov 8, 2024
40d0e74
[Doc] Update FAQ links in spec_decode.rst (#9662)
whyiug Nov 8, 2024
ad39bd6
[Bugfix] Add error handling when server cannot respond any valid toke…
DearPlanet Nov 8, 2024
7371749
[Misc] Fix ImportError causing by triton (#9493)
MengqingCao Nov 8, 2024
3a7f15a
[Doc] Move CONTRIBUTING to docs site (#9924)
russellb Nov 8, 2024
da07a9e
Fixes a typo about 'max_decode_seq_len' which causes crashes with cud…
sighingnow Nov 8, 2024
aea6ad6
Add hf_transfer to testing image (#10096)
mgoin Nov 8, 2024
f4c2187
[Misc] Fix typo in #5895 (#10145)
DarkLight1337 Nov 8, 2024
f10797c
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator (#10144)
yma11 Nov 8, 2024
1ff4aed
[Model] Expose size to Idefics3 as mm_processor_kwargs (#10146)
Isotr0py Nov 8, 2024
208ce62
[V1]Enable APC by default only for text models (#10148)
ywang96 Nov 8, 2024
b489fc3
[CI/Build] Update CPU tests to include all "standard" tests (#5481)
DarkLight1337 Nov 8, 2024
0535e5f
Fix edge case Mistral tokenizer (#10152)
patrickvonplaten Nov 8, 2024
f677862
Disable spec-decode + chunked-prefill for draft models with tensor pa…
sroy745 Nov 8, 2024
6b30471
[Misc] Improve Web UI (#10090)
rafvasq Nov 8, 2024
b5815c8
[V1] Fix non-cudagraph op name (#10166)
WoosukKwon Nov 8, 2024
87713c6
[CI/Build] Ignore .gitignored files for shellcheck (#10162)
ProExpertProg Nov 8, 2024
e1b5a82
Rename vllm.logging to vllm.logging_utils (#10134)
flozi00 Nov 8, 2024
4f93dfe
[torch.compile] Fuse RMSNorm with quant (#9138)
ProExpertProg Nov 8, 2024
10b67d8
[Bugfix] SymIntArrayRef expected to contain concrete integers (#10170)
bnellnm Nov 8, 2024
127c074
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to su…
rasmith Nov 9, 2024
d7edca1
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking …
bigPYJ1151 Nov 9, 2024
e0191a9
[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` (#10040)
DarkLight1337 Nov 9, 2024
f83fecc
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module (#10169)
mgoin Nov 9, 2024
47672f3
[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing (#1…
Isotr0py Nov 9, 2024
49d2a41
[Doc] Adjust RunLLM location (#10176)
DarkLight1337 Nov 9, 2024
1a95f10
[5/N] pass the whole config to model (#9983)
youkaichao Nov 9, 2024
8e1529d
[CI/Build] Add run-hpu-test.sh script (#10167)
xuechendi Nov 9, 2024
f192aeb
[Bugfix] Enable some fp8 and quantized fullgraph tests (#10171)
bnellnm Nov 9, 2024
bd46357
[bugfix] fix broken tests of mlp speculator (#10177)
youkaichao Nov 9, 2024
8a4358e
[doc] explaining the integration with huggingface (#10173)
youkaichao Nov 9, 2024
9e37266
bugfix: fix the bug that stream generate not work (#2756)
caijizhuo Nov 9, 2024
d88bff1
[Frontend] add `add_request_id` middleware (#9594)
cjackal Nov 9, 2024
b09895a
[Frontend][Core] Override HF `config.json` via CLI (#5836)
KrishnaM251 Nov 9, 2024
51c2e1f
[CI/Build] Split up models tests (#10069)
DarkLight1337 Nov 9, 2024
9fa4bdd
[ci][build] limit cmake version (#10188)
youkaichao Nov 10, 2024
1968202
[Doc] Fix typo error in CONTRIBUTING.md (#10190)
FuryMartin Nov 10, 2024
bfb7d61
[doc] Polish the integration with huggingface doc (#10195)
CRZbulabula Nov 10, 2024
20cf2f5
[Misc] small fixes to function tracing file path (#9543)
ShawnD200 Nov 10, 2024
73b9083
[misc] improve cloudpickle registration and tests (#10202)
youkaichao Nov 11, 2024
ad9a78b
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py (#10196)
yansh97 Nov 11, 2024
f0f2e56
[doc] improve debugging code (#10206)
youkaichao Nov 11, 2024
f89d18f
[6/N] pass whole config to inner model (#10205)
youkaichao Nov 11, 2024
9804ac7
Bump the patch-update group with 5 updates (#10210)
dependabot[bot] Nov 11, 2024
58170d6
[Hardware][CPU] Add embedding models support for CPU backend (#10193)
Isotr0py Nov 11, 2024
36e4acd
[LoRA][Kernel] Remove the unused libentry module (#10214)
jeejeelee Nov 11, 2024
5fb1f93
[V1] Allow `tokenizer_mode` and `trust_remote_code` for Detokenizer (…
ywang96 Nov 11, 2024
2cebda4
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner (#10218)
Isotr0py Nov 11, 2024
874f551
[Metrics] add more metrics (#4464)
HarryWu99 Nov 11, 2024
36fc439
[Doc] fix doc string typo in block_manager `swap_out` function (#10212)
yyccli Nov 11, 2024
e6de978
[core][distributed] add stateless process group (#10216)
youkaichao Nov 11, 2024
25144ce
Bump actions/setup-python from 5.2.0 to 5.3.0 (#10209)
dependabot[bot] Nov 11, 2024
f9dadfb
[V1] Fix detokenizer ports (#10224)
WoosukKwon Nov 11, 2024
d7a4f22
[V1] Do not use inductor for piecewise CUDA graphs (#10225)
WoosukKwon Nov 11, 2024
330e82d
[v1][torch.compile] support managing cudagraph buffer (#10203)
youkaichao Nov 11, 2024
fe15729
[V1] Use custom ops for piecewise CUDA graphs (#10227)
WoosukKwon Nov 11, 2024
4800339
Add docs on serving with Llama Stack (#10183)
terrytangyuan Nov 11, 2024
8a7fe47
[misc][distributed] auto port selection and disable tests (#10226)
youkaichao Nov 11, 2024
9d5b4e4
[V1] Enable custom ops with piecewise CUDA graphs (#10228)
WoosukKwon Nov 11, 2024
08f93e7
Make shutil rename in python_only_dev (#10233)
shcheglovnd Nov 11, 2024
6ace6fb
[V1] `AsyncLLM` Implementation (#9826)
robertgshaw2-neuralmagic Nov 11, 2024
d1c6799
[doc] update debugging guide (#10236)
youkaichao Nov 11, 2024
9cdba96
[Doc] Update help text for `--distributed-executor-backend` (#10231)
russellb Nov 12, 2024
eea55cc
[1/N] torch.compile user interface design (#10237)
youkaichao Nov 12, 2024
7f5edb5
[Misc][LoRA] Replace hardcoded cuda device with configurable argument…
jeejeelee Nov 12, 2024
812c981
Splitting attention kernel file (#10091)
maleksan85 Nov 12, 2024
3a28f18
[doc] explain the class hierarchy in vLLM (#10240)
youkaichao Nov 12, 2024
d201d41
[CI][CPU]refactor CPU tests to allow to bind with different cores (#1…
zhouyuan Nov 12, 2024
36c513a
[BugFix] Do not raise a `ValueError` when `tool_choice` is set to the…
gcalmettes Nov 12, 2024
65a920e
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Nov 12, 2024
a838ba7
[Misc]Fix Idefics3Model argument (#10255)
jeejeelee Nov 12, 2024
176fcb1
[Bugfix] Fix QwenModel argument (#10262)
DamonFool Nov 12, 2024
47db6ec
[Frontend] Add per-request number of cached token stats (#10174)
zifeitong Nov 12, 2024
7c65527
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal in…
WoosukKwon Nov 12, 2024
b41fb9d
[Encoder Decoder] Update Mllama to run with both FlashAttention and X…
sroy745 Nov 12, 2024
8a06428
[LoRA] Adds support for bias in LoRA (#5733)
followumesh Nov 12, 2024
1f55e05
[V1] Enable Inductor when using piecewise CUDA graphs (#10268)
WoosukKwon Nov 12, 2024
96ae0ea
[doc] fix location of runllm widget (#10266)
youkaichao Nov 12, 2024
1808145
[doc] improve debugging doc (#10270)
youkaichao Nov 12, 2024
377b74f
Revert "[ci][build] limit cmake version" (#10271)
youkaichao Nov 12, 2024
112fa0b
[V1] Fix CI tests on V1 engine (#10272)
WoosukKwon Nov 13, 2024
0d4ea3f
[core][distributed] use tcp store directly (#10275)
youkaichao Nov 13, 2024
bbd3e86
[V1] Support VLMs with fine-grained scheduling (#9871)
WoosukKwon Nov 13, 2024
56a955e
Bump to compressed-tensors v0.8.0 (#10279)
dsikka Nov 13, 2024
032fcf1
[Doc] Fix typo in arg_utils.py (#10264)
xyang16 Nov 13, 2024
3945c82
[Model] Add support for Qwen2-VL video embeddings input & multiple im…
imkero Nov 13, 2024
1b886aa
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLig…
FurtherAI Nov 13, 2024
b6dde33
[Core] Flashinfer - Remove advance step size restriction (#10282)
pavanimajety Nov 13, 2024
d909acf
[Model][LoRA]LoRA support added for idefics3 (#10281)
B-201 Nov 13, 2024
bb7991a
[V1] Add missing tokenizer options for `Detokenizer` (#10288)
ywang96 Nov 13, 2024
0b8bb86
[1/N] Initial prototype for multi-modal processor (#10044)
DarkLight1337 Nov 13, 2024
ac49b59
[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
HoangCongDuc Nov 13, 2024
15bb833
[Bugfix] Fix tensor parallel for qwen2 classification model (#10297)
Isotr0py Nov 14, 2024
504ac53
[misc] error early for old-style class (#10304)
youkaichao Nov 14, 2024
e0853b6
[Misc] format.sh: Simplify tool_version_check (#10305)
russellb Nov 14, 2024
f67ce05
[Frontend] Pythonic tool parser (#9859)
mdepinet Nov 14, 2024
52b48c1
[BugFix]: properly deserialize `tool_calls` iterator before processin…
gcalmettes Nov 14, 2024
294bf46
[Model] Add BNB quantization support for Idefics3 (#10310)
B-201 Nov 14, 2024
29f3ef2
[ci][distributed] disable hanging tests (#10317)
youkaichao Nov 14, 2024
03025c0
[CI/Build] Fix CPU CI online inference timeout (#10314)
Isotr0py Nov 14, 2024
675d603
[CI/Build] Make shellcheck happy (#10285)
DarkLight1337 Nov 14, 2024
1dbae03
[Docs] Publish meetup slides (#10331)
WoosukKwon Nov 14, 2024
4a18fd1
Support Roberta embedding models (#9387)
maxdebayser Nov 14, 2024
b2e0ad3
[Perf] Reduce peak memory usage of llama (#10339)
andoorve Nov 15, 2024
554af92
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 (#9583)
jxpxxzj Nov 15, 2024
11cd1ae
[Tool parsing] Improve / correct mistral tool parsing (#10333)
patrickvonplaten Nov 15, 2024
972112d
[Bugfix] Fix unable to load some models (#10312)
DarkLight1337 Nov 15, 2024
bf2ddc6
[bugfix] Fix static asymmetric quantization case (#10334)
ProExpertProg Nov 15, 2024
2885ba0
[Misc] Change RedundantReshapesPass and FusionPass logging from info …
tlrmchlsmth Nov 15, 2024
b40cf64
[Model] Support Qwen2 embeddings and use tags to select model tests (…
DarkLight1337 Nov 15, 2024
2ec8827
[Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10…
skylee-01 Nov 15, 2024
2ac6d0e
[Misc] Consolidate pooler config overrides (#10351)
DarkLight1337 Nov 15, 2024
02dbf30
[Build] skip renaming files for release wheels pipeline (#9671)
simon-mo Nov 15, 2024
3d158cd
Add default value to avoid Falcon crash (#5363) (#10347)
wchen61 Nov 15, 2024
b311efd
[Misc] Fix import error in tensorizer tests and cleanup some code (#1…
DarkLight1337 Nov 15, 2024
2690855
[Doc] Remove float32 choice from --lora-dtype (#10348)
xyang16 Nov 15, 2024
1d65ec7
[Bugfix] Fix fully sharded LoRA bug (#10352)
jeejeelee Nov 15, 2024
f2056f7
[Misc] Fix some help info of arg_utils to improve readability (#10362)
ShangmingCai Nov 15, 2024
3a763ba
[core][misc] keep compatibility for old-style classes (#10356)
youkaichao Nov 15, 2024
691a3ec
[Bugfix] Ensure special tokens are properly filtered out for guided s…
gcalmettes Nov 15, 2024
79ee45b
[Misc] Bump up test_fused_moe tolerance (#10364)
ElizaWszola Nov 15, 2024
a6221a1
[Misc] bump mistral common version (#10367)
simon-mo Nov 15, 2024
c76ac49
[Docs] Add Nebius as sponsors (#10371)
simon-mo Nov 15, 2024
a067f85
[Frontend] Add --version flag to CLI (#10369)
russellb Nov 15, 2024
3e8d14d
[Doc] Move PR template content to docs (#10159)
russellb Nov 15, 2024
4f168f6
[Docs] Misc updates to TPU installation instructions (#10165)
mikegre-google Nov 15, 2024
32e46e0
[Frontend] Automatic detection of chat content format from AST (#9919)
DarkLight1337 Nov 16, 2024
755b853
[doc] add doc for the plugin system (#10372)
youkaichao Nov 16, 2024
2f427c2
[misc][plugin] improve log messages (#10386)
youkaichao Nov 16, 2024
1d75472
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (#10385)
rasmith Nov 16, 2024
8b6725b
[Misc] Update benchmark to support image_url file or http (#10287)
kakao-steve-ai Nov 16, 2024
b98d89e
[Misc] Medusa supports custom bias (#10361)
skylee-01 Nov 16, 2024
361c29e
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…
imkero Nov 16, 2024
661a34f
[V1] Add code owners for V1 (#10397)
WoosukKwon Nov 16, 2024
4fd9375
[2/N][torch.compile] make compilation cfg part of vllm cfg (#10383)
youkaichao Nov 17, 2024
643ecf7
[V1] Refactor model executable interface for all text-only language m…
ywang96 Nov 17, 2024
905d0f0
[CI/Build] Fix IDC hpu [Device not found] issue (#10384)
xuechendi Nov 17, 2024
cf349c4
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel (#10394)
Isotr0py Nov 17, 2024
8d74b5a
[platforms] refactor cpu code (#10402)
youkaichao Nov 17, 2024
76aab90
[Hardware] [HPU]add `mark_step` for hpu (#10239)
jikunshang Nov 17, 2024
80d85c5
[Bugfix] Fix mrope_position_delta in non-last prefill chunk (#10403)
imkero Nov 17, 2024
d1557e6
[Misc] Enhance offline_inference to support user-configurable paramet…
wchen61 Nov 17, 2024
c4e4643
[Misc] Add uninitialized params tracking for `AutoWeightsLoader` (#10…
Isotr0py Nov 18, 2024
47826ca
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU…
HollowMan6 Nov 18, 2024
51bb12d
[4/N][torch.compile] clean up set_torch_compile_backend (#10401)
youkaichao Nov 18, 2024
c7dec92
[VLM] Report multi_modal_placeholders in output (#10407)
lk-chen Nov 18, 2024
01aae1c
[Model] Remove redundant softmax when using PoolingType.STEP (#10415)
Maybewuss Nov 18, 2024
9ebcb9b
Merge remote-tracking branch 'origin/habana_main' into HEAD
kzawora-intel Nov 18, 2024
295cabe
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Nov 18, 2024
8155ba7
Merge remote-tracking branch 'origin/habana_main' into HEAD
kzawora-intel Nov 18, 2024
3400180
format.sh
kzawora-intel Nov 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
63 changes: 25 additions & 38 deletions .buildkite/nightly-benchmarks/scripts/launch-server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,58 +50,54 @@ launch_trt_server() {
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git lfs install
cd tensorrtllm_backend
git checkout $trt_llm_version
tensorrtllm_backend_dir=$(pwd)
git checkout "$trt_llm_version"
git submodule update --init --recursive

# build trtllm engine
cd /tensorrtllm_backend
cd ./tensorrt_llm/examples/${model_type}
cd "./tensorrt_llm/examples/${model_type}"
python3 convert_checkpoint.py \
--model_dir ${model_path} \
--dtype ${model_dtype} \
--tp_size ${model_tp_size} \
--output_dir ${trt_model_path}
--model_dir "${model_path}" \
--dtype "${model_dtype}" \
--tp_size "${model_tp_size}" \
--output_dir "${trt_model_path}"
trtllm-build \
--checkpoint_dir ${trt_model_path} \
--checkpoint_dir "${trt_model_path}" \
--use_fused_mlp \
--reduce_fusion disable \
--workers 8 \
--gpt_attention_plugin ${model_dtype} \
--gemm_plugin ${model_dtype} \
--tp_size ${model_tp_size} \
--max_batch_size ${max_batch_size} \
--max_input_len ${max_input_len} \
--max_seq_len ${max_seq_len} \
--max_num_tokens ${max_num_tokens} \
--output_dir ${trt_engine_path}
--gpt_attention_plugin "${model_dtype}" \
--gemm_plugin "${model_dtype}" \
--tp_size "${model_tp_size}" \
--max_batch_size "${max_batch_size}" \
--max_input_len "${max_input_len}" \
--max_seq_len "${max_seq_len}" \
--max_num_tokens "${max_num_tokens}" \
--output_dir "${trt_engine_path}"

# handle triton protobuf files and launch triton server
cd /tensorrtllm_backend
mkdir triton_model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cd triton_model_repo
rm -rf ./tensorrt_llm/1/*
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
cp -r "${trt_engine_path}"/* ./tensorrt_llm/1
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5"
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false"
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:"$max_batch_size"
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt "triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:False,bls_instance_count:1"
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py \
--world_size=${model_tp_size} \
--world_size="${model_tp_size}" \
--model_repo=/tensorrtllm_backend/triton_model_repo &

}

launch_tgi_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -129,10 +125,7 @@ launch_tgi_server() {
launch_lmdeploy_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

server_command="lmdeploy serve api_server $model \
Expand All @@ -149,10 +142,7 @@ launch_sglang_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -185,10 +175,7 @@ launch_vllm_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -217,19 +204,19 @@ launch_vllm_server() {

main() {

if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "trt" ]]; then
launch_trt_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "tgi" ]]; then
launch_tgi_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
launch_lmdeploy_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "sglang" ]]; then
launch_sglang_server
fi

Expand Down
12 changes: 6 additions & 6 deletions .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ main() {
fi

# initial annotation
description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
#description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"

# download results
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls
Expand All @@ -30,15 +30,15 @@ main() {
/workspace/buildkite-agent artifact upload "results.zip"

# upload benchmarking scripts
cd $VLLM_SOURCE_CODE_LOC/
cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md


Expand Down Expand Up @@ -75,4 +75,4 @@ main() {
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}

main "$@"
main "$@"
30 changes: 14 additions & 16 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ check_gpus() {
echo "Need at least 1 GPU to run benchmarking."
exit 1
fi
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
declare -g gpu_type="$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')"
echo "GPU type is $gpu_type"
}

Expand Down Expand Up @@ -102,7 +102,7 @@ kill_gpu_processes() {
pkill -f text-generation
pkill -f lmdeploy

while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
done
}
Expand All @@ -119,8 +119,8 @@ wait_for_server() {
ensure_installed() {
# Ensure that the given command is installed by apt-get
local cmd=$1
if ! which $cmd >/dev/null; then
apt-get update && apt-get install -y $cmd
if ! which "$cmd" >/dev/null; then
apt-get update && apt-get install -y "$cmd"
fi
}

Expand Down Expand Up @@ -173,13 +173,11 @@ run_serving_tests() {
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh \
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi

wait_for_server

if [ $? -eq 0 ]; then
if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
Expand All @@ -190,13 +188,13 @@ run_serving_tests() {

# prepare tokenizer
# this is required for lmdeploy.
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
rm -rf /tokenizer_cache
mkdir /tokenizer_cache
python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
--model "$model" \
--cachedir /tokenizer_cache
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"


# change model name for lmdeploy (it will not follow standard hf name)
Expand Down Expand Up @@ -307,11 +305,11 @@ run_serving_tests() {
prepare_dataset() {

# download sharegpt dataset
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# duplicate sonnet by 4x, to allow benchmarking with input length 2048
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
echo "" > sonnet_4x.txt
for _ in {1..4}
do
Expand Down Expand Up @@ -339,17 +337,17 @@ main() {

prepare_dataset

cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
declare -g RESULTS_FOLDER=results/
mkdir -p $RESULTS_FOLDER
BENCHMARK_ROOT=$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
BENCHMARK_ROOT="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"

# run the test
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"

# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
upload_to_buildkite

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ check_gpus() {
echo "Need at least 1 GPU to run benchmarking."
exit 1
fi
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
echo "GPU type is $gpu_type"
}

Expand Down Expand Up @@ -93,7 +93,7 @@ kill_gpu_processes() {


# wait until GPU memory usage smaller than 1GB
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
done

Expand All @@ -117,7 +117,7 @@ upload_to_buildkite() {
fi

# Use the determined command to annotate and upload artifacts
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < "$RESULTS_FOLDER/benchmark_results.md"
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
}

Expand Down Expand Up @@ -150,7 +150,7 @@ run_latency_tests() {
# check if there is enough GPU to run the test
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname."
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

Expand Down Expand Up @@ -206,9 +206,9 @@ run_throughput_tests() {
throughput_args=$(json2args "$throughput_params")

# check if there is enough GPU to run the test
tp=$(echo $throughput_params | jq -r '.tensor_parallel_size')
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname."
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

Expand Down Expand Up @@ -270,15 +270,15 @@ run_serving_tests() {
# check if there is enough GPU to run the test
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname."
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

# check if server model and client model is aligned
server_model=$(echo "$server_params" | jq -r '.model')
client_model=$(echo "$client_params" | jq -r '.model')
if [[ $server_model != "$client_model" ]]; then
echo "Server model and client model must be the same. Skip testcase $testname."
echo "Server model and client model must be the same. Skip testcase $test_name."
continue
fi

Expand All @@ -293,8 +293,7 @@ run_serving_tests() {
server_pid=$!

# wait until the server is alive
wait_for_server
if [ $? -eq 0 ]; then
if wait_for_server; then
echo ""
echo "vllm server is up and running."
else
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ TIMEOUT_SECONDS=10

retries=0
while [ $retries -lt 1000 ]; do
if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
if [ "$(curl -s --max-time "$TIMEOUT_SECONDS" -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" "$URL")" -eq 200 ]; then
exit 0
fi

Expand All @@ -16,4 +16,4 @@ while [ $retries -lt 1000 ]; do
sleep 5
done

exit 1
exit 1
Loading
Loading