Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic kv-cache transfer to vllm v1 #1

Open
wants to merge 3,384 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3384 commits
Select commit Hold shift + click to select a range
bb87acb
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 Nov 1, 2024
b37933e
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 Nov 1, 2024
6b3e1c2
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson Nov 1, 2024
e1c27fc
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao Nov 1, 2024
04506c3
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety Nov 1, 2024
2d75d7c
[1/N] pass the complete config from engine to executor (#9933)
youkaichao Nov 1, 2024
7c2fc9c
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer Nov 1, 2024
ac2ddd2
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] Nov 1, 2024
569fcc0
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas Nov 1, 2024
d4775ce
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu Nov 1, 2024
9550387
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 Nov 2, 2024
c812fe5
[torch.compile] fix cpu broken code (#9947)
youkaichao Nov 2, 2024
68fc181
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill Nov 2, 2024
171ccd6
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google Nov 2, 2024
22c99c0
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao Nov 2, 2024
6d3ca46
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic Nov 2, 2024
21e61ba
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao Nov 2, 2024
f09a8e0
[3/N] model runner pass the whole config to model (#9958)
youkaichao Nov 2, 2024
d19ef1b
[CI/Build] Quoting around > (#9956)
nokados Nov 2, 2024
f386574
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula Nov 2, 2024
685bcc3
[bugfix] fix tsts (#9959)
youkaichao Nov 2, 2024
9455b48
[V1] Support per-request seed (#9945)
njhill Nov 3, 2024
a45ebaf
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel Nov 4, 2024
89a2c17
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic Nov 4, 2024
d2310a1
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee Nov 4, 2024
0f1221b
[Bugfix]Using the correct type hints (#9885)
gshtras Nov 4, 2024
2f70c75
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 Nov 4, 2024
a95a2ff
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 Nov 4, 2024
fe486ec
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao Nov 4, 2024
a435ea4
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang Nov 4, 2024
928e1f8
[4/N] make quant config first-class citizen (#9978)
youkaichao Nov 4, 2024
9fa00ee
[Misc]Reduce BNB static variable (#9987)
jeejeelee Nov 4, 2024
022a7ef
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman Nov 4, 2024
2a650a3
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic Nov 4, 2024
2730ca6
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen Nov 4, 2024
2f516f0
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 Nov 4, 2024
be0ebf7
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic Nov 4, 2024
afc6238
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen Nov 4, 2024
0e80d15
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 Nov 4, 2024
be187f3
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm Nov 4, 2024
4b203ad
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao Nov 4, 2024
7c0dce4
[Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
mgoin Nov 4, 2024
b4262a3
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
tlrmchlsmth Nov 5, 2024
eb2eede
[Core] Make encoder-decoder inputs a nested structure to be more comp…
DarkLight1337 Nov 5, 2024
a5f8263
[Bugfix] Fixup Mamba (#10004)
tlrmchlsmth Nov 5, 2024
a58b63e
[BugFix] Lazy import ray (#10021)
GeneDer Nov 5, 2024
1505111
[Misc] vllm CLI flags should be ordered for better user readability (…
chaunceyjiang Nov 5, 2024
875993d
[Frontend] Fix tcp port reservation for api server (#10012)
russellb Nov 5, 2024
0fe7a6a
Refactor TPU requirements file and pin build dependencies (#10010)
richardsliu Nov 5, 2024
335d966
[Misc] Add logging for CUDA memory (#10027)
yangalan123 Nov 5, 2024
d9f8572
[CI/Build] Limit github CI jobs based on files changed (#9928)
russellb Nov 5, 2024
3dd9fad
[Model] Support quantization of PixtralHFTransformer for PixtralHF (#…
mgoin Nov 5, 2024
f3e5037
[Feature] Update benchmark_throughput.py to support image input (#9851)
lk-chen Nov 5, 2024
e4cafe7
[Misc] Modify BNB parameter name (#9997)
jeejeelee Nov 5, 2024
164ebf1
[CI] Prune tests/models/decoder_only/language/* tests (#9940)
mgoin Nov 5, 2024
7f20782
[CI] Prune back the number of tests in tests/kernels/* (#9932)
mgoin Nov 5, 2024
1e78983
[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
youkaichao Nov 5, 2024
eb33f4b
[Bugfix] Properly propagate trust_remote_code settings (#10047)
zifeitong Nov 6, 2024
235503b
[Bugfix] Fix pickle of input when async output processing is on (#9931)
wallashss Nov 6, 2024
032d62d
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (…
llsj14 Nov 6, 2024
b33d834
[v1] reduce graph capture time for piecewise cudagraph (#10059)
youkaichao Nov 6, 2024
c1a00d6
[Misc] Sort the list of embedding models (#10037)
DarkLight1337 Nov 6, 2024
0dbd820
[Model][OpenVINO] Fix regressions from #8346 (#10045)
petersalas Nov 6, 2024
c1813a8
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken …
tjohnson31415 Nov 6, 2024
5895fdb
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
arakowsk-amd Nov 6, 2024
0950ed6
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input…
zifeitong Nov 6, 2024
8f8deb4
[V1] Integrate Piecewise CUDA graphs (#10058)
WoosukKwon Nov 6, 2024
623b808
[distributed] add function to create ipc buffers directly (#10064)
youkaichao Nov 6, 2024
d910cb4
[CI/Build] drop support for Python 3.8 EOL (#8464)
aarnphm Nov 6, 2024
9799c39
[CI/Build] Fix large_gpu_mark reason (#10070)
Isotr0py Nov 6, 2024
a8baff6
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
kzawora-intel Nov 6, 2024
0e870e9
[Hotfix] Fix ruff errors (#10073)
WoosukKwon Nov 6, 2024
911b6ad
[Model][LoRA]LoRA support added for LlamaEmbeddingModel (#10071)
jeejeelee Nov 6, 2024
4a1e6eb
[Model] Add Idefics3 support (#9767)
jeejeelee Nov 6, 2024
13630ad
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration (…
ericperfect Nov 6, 2024
ba73345
Remove ScaledActivation for AWQ (#10057)
mgoin Nov 6, 2024
9544bf6
[CI/Build] Drop Python 3.8 support (#10038)
russellb Nov 6, 2024
03cab74
[CI/Build] change conflict PR comment from mergify (#10080)
russellb Nov 6, 2024
a946941
[V1] Make v1 more testable (#9888)
joerunde Nov 6, 2024
cf8c108
[CI/Build] Always run the ruff workflow (#10092)
russellb Nov 6, 2024
81ca531
[core][distributed] add stateless_init_process_group (#10072)
youkaichao Nov 7, 2024
c6e222d
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12…
mgoin Nov 7, 2024
b8b6e25
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (#9823)
yma11 Nov 7, 2024
f706cd4
[Frontend] Adjust try/except blocks in API impl (#10056)
njhill Nov 7, 2024
edd385c
[Hardware][CPU] Update torch 2.5 (#9911)
bigPYJ1151 Nov 7, 2024
71b51ec
[doc] add back Python 3.8 ABI (#10100)
youkaichao Nov 7, 2024
079e436
[V1][BugFix] Fix Generator construction in greedy + seed case (#10097)
njhill Nov 7, 2024
f6d5d4c
[Misc] Consolidate ModelConfig code related to HF config (#10104)
DarkLight1337 Nov 7, 2024
13c2162
[CI/Build] re-add codespell to CI (#10083)
russellb Nov 7, 2024
f1d5aa4
Doc: Improve benchmark documentation (#9927)
rafvasq Nov 7, 2024
30d3f31
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
hanzhi713 Nov 7, 2024
f72df98
[CI/Build] Improve mypy + python version matrix (#10041)
russellb Nov 7, 2024
216cc83
Adds method to read the pooling types from model's files (#9506)
flaviabeo Nov 7, 2024
d327465
[Frontend] Fix multiple values for keyword argument error (#10075) (#…
DIYer22 Nov 7, 2024
d407193
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#…
bigPYJ1151 Nov 7, 2024
768b283
[Bugfix] Make image processor respect `mm_processor_kwargs` for Qwen2…
li-plus Nov 7, 2024
55bf35c
[Misc] Add Gamma-Distribution Request Generation Support for Serving …
spliii Nov 7, 2024
69c92a4
[Frontend] Tool calling parser for Granite 3.0 models (#9027)
maxdebayser Nov 7, 2024
eb0f67e
[Feature] [Spec decode]: Combine chunked prefill with speculative dec…
NickLucche Nov 7, 2024
4de2161
[CI/Build] Always run mypy (#10122)
russellb Nov 7, 2024
a1afb5a
[CI/Build] Add shell script linting using shellcheck (#7925)
russellb Nov 7, 2024
c2b5e84
[CI/Build] Automate PR body text cleanup (#10082)
russellb Nov 7, 2024
2b5b225
Bump actions/setup-python from 5.2.0 to 5.3.0 (#9745)
dependabot[bot] Nov 7, 2024
b2352bd
Online video support for VLMs (#10020)
litianjian Nov 7, 2024
0180e7b
Bump actions/checkout from 4.2.1 to 4.2.2 (#9746)
dependabot[bot] Nov 7, 2024
b2678bc
[Misc] report relevant env vars in collect_env.py tool (#9293)
ycool Nov 8, 2024
4d5bbf7
[V1] Add all_token_ids attribute to Request (#10135)
WoosukKwon Nov 8, 2024
2a99570
[V1] Prefix caching (take 2) (#9972)
comaniac Nov 8, 2024
ac0f819
[CI/Build] Give PR cleanup job PR write access (#10139)
russellb Nov 8, 2024
838e463
[Doc] Update FAQ links in spec_decode.rst (#9662)
whyiug Nov 8, 2024
e339030
[Bugfix] Add error handling when server cannot respond any valid toke…
DearPlanet Nov 8, 2024
331ae66
[Misc] Fix ImportError causing by triton (#9493)
MengqingCao Nov 8, 2024
4685207
[Doc] Move CONTRIBUTING to docs site (#9924)
russellb Nov 8, 2024
ce36a71
Fixes a typo about 'max_decode_seq_len' which causes crashes with cud…
sighingnow Nov 8, 2024
ba039fe
Add hf_transfer to testing image (#10096)
mgoin Nov 8, 2024
ae09bc7
[Misc] Fix typo in #5895 (#10145)
DarkLight1337 Nov 8, 2024
d10efba
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator (#10144)
yma11 Nov 8, 2024
8c656f1
[Model] Expose size to Idefics3 as mm_processor_kwargs (#10146)
Isotr0py Nov 8, 2024
c990f27
[V1]Enable APC by default only for text models (#10148)
ywang96 Nov 8, 2024
2cc2ac8
[CI/Build] Update CPU tests to include all "standard" tests (#5481)
DarkLight1337 Nov 8, 2024
f61c86e
Fix edge case Mistral tokenizer (#10152)
patrickvonplaten Nov 8, 2024
11cd990
Disable spec-decode + chunked-prefill for draft models with tensor pa…
sroy745 Nov 8, 2024
e29989d
[Misc] Improve Web UI (#10090)
rafvasq Nov 8, 2024
21d7d43
[V1] Fix non-cudagraph op name (#10166)
WoosukKwon Nov 8, 2024
2708e46
[CI/Build] Ignore .gitignored files for shellcheck (#10162)
ProExpertProg Nov 8, 2024
cde4d88
Rename vllm.logging to vllm.logging_utils (#10134)
flozi00 Nov 8, 2024
36fa841
[torch.compile] Fuse RMSNorm with quant (#9138)
ProExpertProg Nov 8, 2024
e5f31f8
[Bugfix] SymIntArrayRef expected to contain concrete integers (#10170)
bnellnm Nov 8, 2024
fe09246
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to su…
rasmith Nov 9, 2024
10bf3f0
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking …
bigPYJ1151 Nov 9, 2024
6c05c67
[0/N] Rename `MultiModalInputs` to `MultiModalKwargs` (#10040)
DarkLight1337 Nov 9, 2024
65fb386
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module (#10169)
mgoin Nov 9, 2024
1ace54e
[CI/Build] Fix VLM broadcast tests `tensor_parallel_size` passing (#1…
Isotr0py Nov 9, 2024
17437d0
[Doc] Adjust RunLLM location (#10176)
DarkLight1337 Nov 9, 2024
32f0c6c
[5/N] pass the whole config to model (#9983)
youkaichao Nov 9, 2024
1042e29
[CI/Build] Add run-hpu-test.sh script (#10167)
xuechendi Nov 9, 2024
28effad
[Bugfix] Enable some fp8 and quantized fullgraph tests (#10171)
bnellnm Nov 9, 2024
3fe9637
[bugfix] fix broken tests of mlp speculator (#10177)
youkaichao Nov 9, 2024
1b10a98
[doc] explaining the integration with huggingface (#10173)
youkaichao Nov 9, 2024
e12b591
bugfix: fix the bug that stream generate not work (#2756)
caijizhuo Nov 9, 2024
611bd3b
[Frontend] add `add_request_id` middleware (#9594)
cjackal Nov 9, 2024
bf011ad
[Frontend][Core] Override HF `config.json` via CLI (#5836)
KrishnaM251 Nov 9, 2024
6b5651d
[CI/Build] Split up models tests (#10069)
DarkLight1337 Nov 9, 2024
be9c852
[ci][build] limit cmake version (#10188)
youkaichao Nov 10, 2024
723f08d
[Doc] Fix typo error in CONTRIBUTING.md (#10190)
FuryMartin Nov 10, 2024
3a5c530
[doc] Polish the integration with huggingface doc (#10195)
CRZbulabula Nov 10, 2024
54876ea
[Misc] small fixes to function tracing file path (#9543)
ShawnD200 Nov 10, 2024
84b0ded
[misc] improve cloudpickle registration and tests (#10202)
youkaichao Nov 11, 2024
75ef93b
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py (#10196)
yansh97 Nov 11, 2024
c700999
[doc] improve debugging code (#10206)
youkaichao Nov 11, 2024
29ef173
[6/N] pass whole config to inner model (#10205)
youkaichao Nov 11, 2024
820bf72
Bump the patch-update group with 5 updates (#10210)
dependabot[bot] Nov 11, 2024
b02f529
[Hardware][CPU] Add embedding models support for CPU backend (#10193)
Isotr0py Nov 11, 2024
73e8dcb
[LoRA][Kernel] Remove the unused libentry module (#10214)
jeejeelee Nov 11, 2024
efc1948
[V1] Allow `tokenizer_mode` and `trust_remote_code` for Detokenizer (…
ywang96 Nov 11, 2024
f8b502f
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner (#10218)
Isotr0py Nov 11, 2024
1cc0ac7
[Metrics] add more metrics (#4464)
HarryWu99 Nov 11, 2024
05f9584
[Doc] fix doc string typo in block_manager `swap_out` function (#10212)
yyccli Nov 11, 2024
49e2ff8
[core][distributed] add stateless process group (#10216)
youkaichao Nov 11, 2024
2421f39
Bump actions/setup-python from 5.2.0 to 5.3.0 (#10209)
dependabot[bot] Nov 11, 2024
250b6a1
[V1] Fix detokenizer ports (#10224)
WoosukKwon Nov 11, 2024
f805a6c
[V1] Do not use inductor for piecewise CUDA graphs (#10225)
WoosukKwon Nov 11, 2024
f5c0f30
[v1][torch.compile] support managing cudagraph buffer (#10203)
youkaichao Nov 11, 2024
791df99
[V1] Use custom ops for piecewise CUDA graphs (#10227)
WoosukKwon Nov 11, 2024
290b6fe
Add docs on serving with Llama Stack (#10183)
terrytangyuan Nov 11, 2024
4c81612
[misc][distributed] auto port selection and disable tests (#10226)
youkaichao Nov 11, 2024
eadece4
[V1] Enable custom ops with piecewise CUDA graphs (#10228)
WoosukKwon Nov 11, 2024
cbe9807
Make shutil rename in python_only_dev (#10233)
shcheglovnd Nov 11, 2024
0b65c33
[V1] `AsyncLLM` Implementation (#9826)
robertgshaw2-neuralmagic Nov 11, 2024
509f846
[doc] update debugging guide (#10236)
youkaichao Nov 11, 2024
3bae750
[Doc] Update help text for `--distributed-executor-backend` (#10231)
russellb Nov 12, 2024
a2f06b3
[1/N] torch.compile user interface design (#10237)
youkaichao Nov 12, 2024
a62fb02
[Misc][LoRA] Replace hardcoded cuda device with configurable argument…
jeejeelee Nov 12, 2024
077c1cd
Splitting attention kernel file (#10091)
maleksan85 Nov 12, 2024
d2a3352
[doc] explain the class hierarchy in vLLM (#10240)
youkaichao Nov 12, 2024
6d0517d
[CI][CPU]refactor CPU tests to allow to bind with different cores (#1…
zhouyuan Nov 12, 2024
3efae76
[BugFix] Do not raise a `ValueError` when `tool_choice` is set to the…
gcalmettes Nov 12, 2024
7f2f001
[Misc]Fix Idefics3Model argument (#10255)
jeejeelee Nov 12, 2024
a7b00f4
[Bugfix] Fix QwenModel argument (#10262)
DamonFool Nov 12, 2024
e0ad748
[Frontend] Add per-request number of cached token stats (#10174)
zifeitong Nov 12, 2024
aade215
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal in…
WoosukKwon Nov 12, 2024
05b47fe
[Encoder Decoder] Update Mllama to run with both FlashAttention and X…
sroy745 Nov 12, 2024
e989df4
[LoRA] Adds support for bias in LoRA (#5733)
followumesh Nov 12, 2024
2a763eb
[V1] Enable Inductor when using piecewise CUDA graphs (#10268)
WoosukKwon Nov 12, 2024
6dc5c20
[doc] fix location of runllm widget (#10266)
youkaichao Nov 12, 2024
cfe93cc
[doc] improve debugging doc (#10270)
youkaichao Nov 12, 2024
3b33736
Revert "[ci][build] limit cmake version" (#10271)
youkaichao Nov 12, 2024
bc36761
[V1] Fix CI tests on V1 engine (#10272)
WoosukKwon Nov 13, 2024
13b85d4
[core][distributed] use tcp store directly (#10275)
youkaichao Nov 13, 2024
275e40a
[V1] Support VLMs with fine-grained scheduling (#9871)
WoosukKwon Nov 13, 2024
2b952e4
Bump to compressed-tensors v0.8.0 (#10279)
dsikka Nov 13, 2024
787e49c
[Doc] Fix typo in arg_utils.py (#10264)
xyang16 Nov 13, 2024
a0766d2
[Model] Add support for Qwen2-VL video embeddings input & multiple im…
imkero Nov 13, 2024
e53d9d9
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLig…
FurtherAI Nov 13, 2024
6282943
[Core] Flashinfer - Remove advance step size restriction (#10282)
pavanimajety Nov 13, 2024
5fe4de3
[Model][LoRA]LoRA support added for idefics3 (#10281)
B-201 Nov 13, 2024
87b0ae5
[V1] Add missing tokenizer options for `Detokenizer` (#10288)
ywang96 Nov 13, 2024
0b80787
[1/N] Initial prototype for multi-modal processor (#10044)
DarkLight1337 Nov 13, 2024
4a86fd8
[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
HoangCongDuc Nov 13, 2024
a1db9e8
[Bugfix] Fix tensor parallel for qwen2 classification model (#10297)
Isotr0py Nov 14, 2024
141a018
[misc] error early for old-style class (#10304)
youkaichao Nov 14, 2024
029e233
[Misc] format.sh: Simplify tool_version_check (#10305)
russellb Nov 14, 2024
79ebfb5
[Frontend] Pythonic tool parser (#9859)
mdepinet Nov 14, 2024
4d532a1
[BugFix]: properly deserialize `tool_calls` iterator before processin…
gcalmettes Nov 14, 2024
4b9feed
[Model] Add BNB quantization support for Idefics3 (#10310)
B-201 Nov 14, 2024
e68d8fd
[ci][distributed] disable hanging tests (#10317)
youkaichao Nov 14, 2024
b5f144d
[CI/Build] Fix CPU CI online inference timeout (#10314)
Isotr0py Nov 14, 2024
f00713e
[CI/Build] Make shellcheck happy (#10285)
DarkLight1337 Nov 14, 2024
8415dc4
[Docs] Publish meetup slides (#10331)
WoosukKwon Nov 14, 2024
267c88a
Support Roberta embedding models (#9387)
maxdebayser Nov 14, 2024
b158536
[Perf] Reduce peak memory usage of llama (#10339)
andoorve Nov 15, 2024
c41a4c5
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 (#9583)
jxpxxzj Nov 15, 2024
b2b4cca
[Tool parsing] Improve / correct mistral tool parsing (#10333)
patrickvonplaten Nov 15, 2024
ed457e5
[Bugfix] Fix unable to load some models (#10312)
DarkLight1337 Nov 15, 2024
d901d11
[bugfix] Fix static asymmetric quantization case (#10334)
ProExpertProg Nov 15, 2024
b36aa70
[Misc] Change RedundantReshapesPass and FusionPass logging from info …
tlrmchlsmth Nov 15, 2024
bf32934
[Model] Support Qwen2 embeddings and use tags to select model tests (…
DarkLight1337 Nov 15, 2024
82474f8
[Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10…
skylee-01 Nov 15, 2024
22cc268
[Misc] Consolidate pooler config overrides (#10351)
DarkLight1337 Nov 15, 2024
9a3ec17
[Build] skip renaming files for release wheels pipeline (#9671)
simon-mo Nov 15, 2024
416ef5d
Add default value to avoid Falcon crash (#5363) (#10347)
wchen61 Nov 15, 2024
dc25db9
[Misc] Fix import error in tensorizer tests and cleanup some code (#1…
DarkLight1337 Nov 15, 2024
9902c04
[Doc] Remove float32 choice from --lora-dtype (#10348)
xyang16 Nov 15, 2024
5a62465
[Bugfix] Fix fully sharded LoRA bug (#10352)
jeejeelee Nov 15, 2024
1be25ac
[Misc] Fix some help info of arg_utils to improve readability (#10362)
ShangmingCai Nov 15, 2024
ffddf91
[core][misc] keep compatibility for old-style classes (#10356)
youkaichao Nov 15, 2024
affa3bb
[Bugfix] Ensure special tokens are properly filtered out for guided s…
gcalmettes Nov 15, 2024
e6d15ee
[Misc] Bump up test_fused_moe tolerance (#10364)
ElizaWszola Nov 15, 2024
b866673
[Misc] bump mistral common version (#10367)
simon-mo Nov 15, 2024
4467cd1
[Docs] Add Nebius as sponsors (#10371)
simon-mo Nov 15, 2024
de1a339
[Frontend] Add --version flag to CLI (#10369)
russellb Nov 15, 2024
b0a608b
[Doc] Move PR template content to docs (#10159)
russellb Nov 15, 2024
5bef6c8
[Docs] Misc updates to TPU installation instructions (#10165)
mikegre-google Nov 15, 2024
42cdb3c
[Frontend] Automatic detection of chat content format from AST (#9919)
DarkLight1337 Nov 16, 2024
6d5a548
[doc] add doc for the plugin system (#10372)
youkaichao Nov 16, 2024
e7257f4
[misc][plugin] improve log messages (#10386)
youkaichao Nov 16, 2024
9a62e9a
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (#10385)
rasmith Nov 16, 2024
2692313
[Misc] Update benchmark to support image_url file or http (#10287)
kakao-steve-ai Nov 16, 2024
fae08af
[Misc] Medusa supports custom bias (#10361)
skylee-01 Nov 16, 2024
24ec29c
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…
imkero Nov 16, 2024
d1bc041
[V1] Add code owners for V1 (#10397)
WoosukKwon Nov 16, 2024
80d031d
[2/N][torch.compile] make compilation cfg part of vllm cfg (#10383)
youkaichao Nov 17, 2024
fb0e946
[V1] Refactor model executable interface for all text-only language m…
ywang96 Nov 17, 2024
cf37750
[CI/Build] Fix IDC hpu [Device not found] issue (#10384)
xuechendi Nov 17, 2024
0399523
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel (#10394)
Isotr0py Nov 17, 2024
dc08693
[platforms] refactor cpu code (#10402)
youkaichao Nov 17, 2024
52002dd
[Hardware] [HPU]add `mark_step` for hpu (#10239)
jikunshang Nov 17, 2024
287ed74
[Bugfix] Fix mrope_position_delta in non-last prefill chunk (#10403)
imkero Nov 17, 2024
c0adeb8
[Misc] Enhance offline_inference to support user-configurable paramet…
wchen61 Nov 17, 2024
83f6707
Implemented kvcache transfer (naive send/recv)
mrn3088 Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
43 changes: 43 additions & 0 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import os
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))


def print_top_10_largest_files(zip_file):
"""Print the top 10 largest files in the given zip file."""
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")


def check_wheel_size(directory):
"""Check the size of .whl files in the given directory."""
for root, _, files in os.walk(directory):
for file_name in files:
if file_name.endswith(".whl"):
wheel_path = os.path.join(root, file_name)
wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)
if wheel_size_mb > VLLM_MAX_SIZE_MB:
print(f"Not allowed: Wheel {wheel_path} is larger "
f"({wheel_size_mb:.2f} MB) than the limit "
f"({VLLM_MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb:.2f} MB).")
return 0


if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python check-wheel-size.py <directory>")
sys.exit(1)

directory = sys.argv[1]
sys.exit(check_wheel_size(directory))
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.671
- name: "exact_match,flexible-extract"
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.905
- name: "exact_match,flexible-extract"
value: 0.905
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.752
- name: "exact_match,flexible-extract"
value: 0.754
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.755
- name: "exact_match,flexible-extract"
value: 0.755
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.728
- name: "exact_match,flexible-extract"
value: 0.728
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.758
- name: "exact_match,flexible-extract"
value: 0.759
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.233
- name: "exact_match,flexible-extract"
value: 0.236
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.86
- name: "exact_match,flexible-extract"
value: 0.86
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.624
- name: "exact_match,flexible-extract"
value: 0.624
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.578
- name: "exact_match,flexible-extract"
value: 0.585
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.593
- name: "exact_match,flexible-extract"
value: 0.588
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.595
- name: "exact_match,flexible-extract"
value: 0.582
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.792
- name: "exact_match,flexible-extract"
value: 0.824
limit: 250
num_fewshot: 5
5 changes: 5 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform.yaml
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
10 changes: 10 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
Loading
Loading