Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Support #421

Closed
wants to merge 362 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
362 commits
Select commit Hold shift + click to select a range
29061ed
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang Oct 23, 2024
831540c
[Model] Support E5-V (#9576)
DarkLight1337 Oct 23, 2024
51c24c9
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg Oct 23, 2024
2394962
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao Oct 23, 2024
3ff57eb
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py Oct 23, 2024
c18e1a3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 Oct 23, 2024
31a08f5
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks Oct 23, 2024
e7116c0
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 Oct 23, 2024
dbdd3b5
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao Oct 23, 2024
e5ac6a4
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth Oct 23, 2024
fd0e2cf
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin Oct 23, 2024
9013e24
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula Oct 23, 2024
150b779
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks Oct 23, 2024
fc6c274
[Model] Add Qwen2-Audio model support (#9248)
faychu Oct 23, 2024
d1fbc94
gptq hpu support added
maktukmak Oct 23, 2024
b548d7a
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb Oct 23, 2024
bb01f29
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin Oct 24, 2024
b7df53c
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin Oct 24, 2024
33bab41
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 Oct 24, 2024
056a68c
[XPU] avoid triton import for xpu (#9440)
yma11 Oct 24, 2024
836e8ef
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 Oct 24, 2024
3770071
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon Oct 24, 2024
4fdc581
[core] simplify seq group code (#9569)
youkaichao Oct 24, 2024
8a02cd0
[torch.compile] Adding torch compile annotations to some models (#9639)
CRZbulabula Oct 24, 2024
295a061
[Kernel] add kernel for FATReLU (#9610)
jeejeelee Oct 24, 2024
ad6f780
[torch.compile] expanding support and fix allgather compilation (#9637)
CRZbulabula Oct 24, 2024
b979143
[Doc] Move additional tips/notes to the top (#9647)
DarkLight1337 Oct 24, 2024
f584549
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA m…
litianjian Oct 24, 2024
de662d3
Increase operation per run limit for "Close inactive issues and PRs" …
hmellor Oct 24, 2024
d27cfbf
[torch.compile] Adding torch compile annotations to some models (#9641)
CRZbulabula Oct 24, 2024
c866e00
[CI/Build] Fix VLM test failures when using transformers v4.46 (#9666)
DarkLight1337 Oct 24, 2024
722d46e
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#…
alex-jw-brooks Oct 24, 2024
e26d37a
[Log][Bugfix] Fix default value check for `image_url.detail` (#9663)
mgoin Oct 24, 2024
5944909
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
charlifu Oct 24, 2024
c91ed47
[Bugfix] Remove xformers requirement for Pixtral (#9597)
mgoin Oct 24, 2024
9f7b4ba
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #…
khluu Oct 25, 2024
a6f3721
[Model] add a lora module for granite 3.0 MoE models (#9673)
willmj Oct 25, 2024
9645b9f
[V1] Support sliding window attention (#9679)
WoosukKwon Oct 25, 2024
f603353
Update README_GAUDI about fp8 calibration procedure (#423)
afierka-intel Oct 25, 2024
a5136ec
Set vllm-hpu-extension to 341a77f (#428)
madamczykhabana Oct 25, 2024
a926d14
Create scorecard.yml
rozhukov Oct 25, 2024
5b7f685
Contiguous PA (#424)
mfylcek Oct 25, 2024
e3ae2eb
Revert "Contiguous PA" (#432)
madamczykhabana Oct 25, 2024
93609a2
Enable Dynamic MoE for Mixtral on 1.19.0 (#425)
tpawlows Oct 25, 2024
ca0d922
[Bugfix] Fix compressed_tensors_moe bad config.strategy (#9677)
mgoin Oct 25, 2024
228cfbd
[Doc] Improve quickstart documentation (#9256)
rafvasq Oct 25, 2024
6567e13
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding (…
tjohnson31415 Oct 25, 2024
067e77f
[Bugfix] Steaming continuous_usage_stats default to False (#9709)
samos123 Oct 26, 2024
5cbdccd
[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9…
MengqingCao Oct 26, 2024
55137e8
Fix: MI100 Support By Bypassing Custom Paged Attention (#9560)
MErkinSag Oct 26, 2024
07e981f
[Frontend] Bad words sampling parameter (#9717)
Alvant Oct 26, 2024
6650e6a
[Model] Add classification Task with Qwen2ForSequenceClassification …
kakao-kevin-us Oct 26, 2024
67a6882
[Misc] SpecDecodeWorker supports profiling (#9719)
Abatom Oct 27, 2024
8549c82
[core] cudagraph output with tensor weak reference (#9724)
youkaichao Oct 27, 2024
3cb07a3
[Misc] Upgrade to pytorch 2.5 (#9588)
bnellnm Oct 27, 2024
e130c40
Fix cache management in "Close inactive issues and PRs" actions workf…
hmellor Oct 27, 2024
34a9941
[Bugfix] Fix load config when using bools (#9533)
madt2709 Oct 27, 2024
4e2d95e
[Hardware][ROCM] using current_platform.is_rocm (#9642)
wangshuai09 Oct 28, 2024
32176fe
[torch.compile] support moe models (#9632)
youkaichao Oct 28, 2024
feb92fb
Fix beam search eos (#9627)
robertgshaw2-neuralmagic Oct 28, 2024
2adb440
[Bugfix] Fix ray instance detect issue (#9439)
yma11 Oct 28, 2024
3a55e77
Support long contexts with LoRA (#418)
SanjuCSudhakaran Oct 28, 2024
4fd5c4c
Add HPU specific changes to benchmark_latency.py (#436)
kdamaszk Oct 28, 2024
3e06110
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Oct 28, 2024
96e0d6f
Rebase fix
kzawora-intel Oct 28, 2024
ebebbbb
fix ci fails
kzawora-intel Oct 28, 2024
4c0caa5
fix ci again
kzawora-intel Oct 28, 2024
72a2856
formatting
kzawora-intel Oct 28, 2024
8b0e4f2
[CI/Build] Adopt Mergify for auto-labeling PRs (#9259)
russellb Oct 28, 2024
2a38e6f
sarkar/Add htrandom generator for hpu (#246)
ssarkar2 Oct 28, 2024
5f8d807
[Model][VLM] Add multi-video support for LLaVA-Onevision (#8905)
litianjian Oct 28, 2024
aa0addb
Adding "torch compile" annotations to moe models (#9758)
CRZbulabula Oct 28, 2024
97b61bf
[misc] avoid circular import (#9765)
youkaichao Oct 28, 2024
76ed534
[torch.compile] add deepseek v2 compile (#9775)
youkaichao Oct 28, 2024
c5d7fb9
[Doc] fix third-party model example (#9771)
russellb Oct 29, 2024
7a4df5f
[Model][LoRA]LoRA support added for Qwen (#9622)
jeejeelee Oct 29, 2024
e74f2d4
[Doc] Specify async engine args in docs (#9726)
DarkLight1337 Oct 29, 2024
eae3d48
[Bugfix] Use temporary directory in registry (#9721)
DarkLight1337 Oct 29, 2024
3e135ae
Fix one_hot bug in torch compile mode (#427)
yuwenzho Oct 29, 2024
3203bd9
HPU: offload logits processing to CPU (#358)
madamczykhabana Oct 29, 2024
2fa54e2
Lora layers (#435)
rsshaik1 Oct 29, 2024
1dcdb37
initial works on enabling automatic prefix caching (#162)
huijjj Oct 29, 2024
ef7865b
[Frontend] re-enable multi-modality input in the new beam search impl…
FerdinandZhong Oct 29, 2024
09500f7
[Model] Add BNB quantization support for Mllama (#9720)
Isotr0py Oct 29, 2024
78e947a
Multi step scheduling (#441)
tzielinski-habana Oct 29, 2024
622b7ab
[Hardware] using current_platform.seed_everything (#9785)
wangshuai09 Oct 29, 2024
74fc2d7
[Misc] Add metrics for request queue time, forward time, and execute …
Abatom Oct 29, 2024
08600dd
Fix the log to correct guide user to install modelscope (#9793)
tastelikefeet Oct 29, 2024
0f43387
[Bugfix] Use host argument to bind to interface (#9798)
svenseeberg Oct 29, 2024
0ce7798
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) (#9801)
yannicks1 Oct 29, 2024
ac3d748
[Model] Add LlamaEmbeddingModel as an embedding Implementation of Ll…
jsato8094 Oct 29, 2024
ab6f981
[CI][Bugfix] Skip chameleon for transformers 4.46.1 (#9808)
mgoin Oct 29, 2024
7585ec9
[CI/Build] mergify: fix rules for ci/build label (#9804)
russellb Oct 29, 2024
0ad216f
[MISC] Set label value to timestamp over 0, to keep track of recent h…
coolkp Oct 29, 2024
67bdf8e
[Bugfix][Frontend] Guard against bad token ids (#9634)
joerunde Oct 29, 2024
882a1ad
[Model] tool calling support for ibm-granite/granite-20b-functioncall…
wseaton Oct 29, 2024
8d77241
[Docs] Add notes about Snowflake Meetup (#9814)
simon-mo Oct 29, 2024
bc73e98
[Bugfix] Fix prefix strings for quantized VLMs (#9772)
mgoin Oct 29, 2024
1ab6f6b
[core][distributed] fix custom allreduce in pytorch 2.5 (#9815)
youkaichao Oct 30, 2024
64cb1cd
Update README.md (#9819)
LiuXiaoxuanPKU Oct 30, 2024
226688b
[Bugfix][VLM] Make apply_fp8_linear work with >2D input (#9812)
mgoin Oct 30, 2024
62fac4b
[ci/build] Pin CI dependencies version with pip-compile (#9810)
khluu Oct 30, 2024
04a3ae0
[Bugfix] Fix multi nodes TP+PP for XPU (#8884)
yma11 Oct 30, 2024
7b0365e
[Doc] Add the DCO to CONTRIBUTING.md (#9803)
russellb Oct 30, 2024
ff5ed6e
[torch.compile] rework compile control with piecewise cudagraph (#9715)
youkaichao Oct 30, 2024
6aa6020
[Misc] Specify minimum pynvml version (#9827)
jeejeelee Oct 30, 2024
211fe91
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
WoosukKwon Oct 30, 2024
a821717
Add fp8 test to jenkins CI (#429)
afierka-intel Oct 30, 2024
79dc102
Enable FusedSDPA prefill by default (#447)
kzawora-intel Oct 30, 2024
2f7f963
Contiguous PA (#433)
mfylcek Oct 30, 2024
94858b5
Fix default value for FSDPA (#448)
madamczykhabana Oct 30, 2024
d3257b2
Fix performance of top_p and top_k calculations (#449)
kdamaszk Oct 30, 2024
cc98f1e
[CI/Build] VLM Test Consolidation (#9372)
alex-jw-brooks Oct 30, 2024
81f09cf
[Model] Support math-shepherd-mistral-7b-prm model (#9697)
Went-Liang Oct 30, 2024
9ff4511
[Misc] Add chunked-prefill support on FlashInfer. (#9781)
elfiegg Oct 30, 2024
3b3f1e7
[Bugfix][core] replace heartbeat with pid check (#9818)
joerunde Oct 30, 2024
4272c16
row vs column paralel fix
maktukmak Oct 30, 2024
33d2577
[Doc] link bug for multistep guided decoding (#9843)
joerunde Oct 30, 2024
c787f2d
[Neuron] Update Dockerfile.neuron to fix build failure (#9822)
hbikki Oct 30, 2024
c2cd1a2
[doc] update pp support (#9853)
youkaichao Oct 30, 2024
00d91c8
[CI/Build] Simplify exception trace in api server tests (#9787)
CRZbulabula Oct 30, 2024
64384bb
[torch.compile] upgrade tests (#9858)
youkaichao Oct 30, 2024
abbfb61
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_to…
gcalmettes Oct 31, 2024
890ca36
Revert "[Bugfix] Use host argument to bind to interface (#9798)" (#9852)
khluu Oct 31, 2024
d087bf8
[Model] Support quantization of Qwen2VisionTransformer (#9817)
mgoin Oct 31, 2024
3ea2dc2
[Misc] Remove deprecated arg for cuda graph capture (#9864)
ywang96 Oct 31, 2024
5608e61
[Doc] Update Qwen documentation (#9869)
jeejeelee Oct 31, 2024
d42c2a2
Reduce block fragmentation (#426)
yangw1234 Oct 31, 2024
16b8f7a
[CI/Build] Add Model Tests for Qwen2-VL (#9846)
alex-jw-brooks Oct 31, 2024
6643aa6
Create scorecard.yml (#431)
rozhukov Oct 31, 2024
77f7ef2
[CI/Build] Adding a forced docker system prune to clean up space (#9849)
Alexei-V-Ivanov-AMD Oct 31, 2024
55650c8
[Bugfix] Fix `illegal memory access` error with chunked prefill, pref…
sasha0552 Oct 31, 2024
9fb12f7
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (…
mzusman Oct 31, 2024
b63c64d
[ci/build] Configure dependabot to update pip dependencies (#9811)
khluu Oct 31, 2024
031a799
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
joerunde Nov 1, 2024
96e0c9c
[torch.compile] directly register custom op (#9896)
youkaichao Nov 1, 2024
37a4947
[Bugfix] Fix layer skip logic with bitsandbytes (#9887)
mgoin Nov 1, 2024
566cd27
[torch.compile] rework test plans (#9866)
youkaichao Nov 1, 2024
93a76dd
[Model] Support bitsandbytes for MiniCPMV (#9891)
mgoin Nov 1, 2024
2b5bf20
[torch.compile] Adding torch compile annotations to some models (#9876)
CRZbulabula Nov 1, 2024
d3aa2a8
[Doc] Update multi-input support (#9906)
DarkLight1337 Nov 1, 2024
06386a6
[Frontend] Chat-based Embeddings API (#9759)
DarkLight1337 Nov 1, 2024
30a2e80
[CI/Build] Add Model Tests for PixtralHF (#9813)
mgoin Nov 1, 2024
ba0d892
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 Nov 1, 2024
1dd4cb2
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 Nov 1, 2024
4581d2c
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson Nov 1, 2024
aff1fd8
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao Nov 1, 2024
598b6d7
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety Nov 1, 2024
48a90dc
g_idx check added
maktukmak Nov 1, 2024
18bd758
[1/N] pass the complete config from engine to executor (#9933)
youkaichao Nov 1, 2024
27cd36e
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer Nov 1, 2024
d151fde
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] Nov 1, 2024
6c0b7f5
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas Nov 1, 2024
d522034
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu Nov 1, 2024
a78dd33
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 Nov 2, 2024
af7380d
[torch.compile] fix cpu broken code (#9947)
youkaichao Nov 2, 2024
eed92f1
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill Nov 2, 2024
1d4cfe2
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google Nov 2, 2024
e893795
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao Nov 2, 2024
d6459b4
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic Nov 2, 2024
74b529c
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao Nov 2, 2024
cea808f
[3/N] model runner pass the whole config to model (#9958)
youkaichao Nov 2, 2024
1b73ab2
[CI/Build] Quoting around > (#9956)
nokados Nov 2, 2024
ae5279a
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula Nov 2, 2024
3bb4bef
[bugfix] fix tsts (#9959)
youkaichao Nov 2, 2024
1f1b6d6
[V1] Support per-request seed (#9945)
njhill Nov 3, 2024
5459772
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel Nov 4, 2024
91c9ebb
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic Nov 4, 2024
c49f040
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee Nov 4, 2024
0cc72b9
Enable HPUGraphs for lora long-contexts tests
SanjuCSudhakaran Nov 4, 2024
b67feb1
[Bugfix]Using the correct type hints (#9885)
gshtras Nov 4, 2024
24ba4d4
[CI] Add Llama2 to torch compile tests (#446)
anko-intel Nov 4, 2024
4dbcbbe
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 Nov 4, 2024
ea4aded
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 Nov 4, 2024
1bb808a
Enable HPUGraphs for lora long-contexts tests (#454)
vivekgoe Nov 4, 2024
ccb5376
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao Nov 4, 2024
ac6b8f1
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang Nov 4, 2024
8d72bb2
[4/N] make quant config first-class citizen (#9978)
youkaichao Nov 4, 2024
fb2716d
[Misc]Reduce BNB static variable (#9987)
jeejeelee Nov 4, 2024
603a661
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman Nov 4, 2024
1b8e7d4
exllama state removed
maktukmak Nov 4, 2024
c305f09
removed custom ops check
maktukmak Nov 4, 2024
2ea889a
format fixes
maktukmak Nov 4, 2024
1c45f4c
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic Nov 4, 2024
5208dc7
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen Nov 4, 2024
6e056bc
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 Nov 4, 2024
04cef2c
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic Nov 4, 2024
9a5664d
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen Nov 4, 2024
ac04a97
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 Nov 4, 2024
d93478b
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm Nov 4, 2024
2094062
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao Nov 4, 2024
8f0a9ca
[Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
mgoin Nov 4, 2024
04bbf38
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
tlrmchlsmth Nov 5, 2024
bbc3619
[Core] Make encoder-decoder inputs a nested structure to be more comp…
DarkLight1337 Nov 5, 2024
ad23318
[Bugfix] Fixup Mamba (#10004)
tlrmchlsmth Nov 5, 2024
ac12d53
Fix SchedulerConfig params (#459)
ldurejko Nov 5, 2024
653e56c
Tensor parallelism for multi-step scheduling (#457)
tzielinski-habana Nov 5, 2024
7a83b1a
[BugFix] Lazy import ray (#10021)
GeneDer Nov 5, 2024
93dee88
[Misc] vllm CLI flags should be ordered for better user readability (…
chaunceyjiang Nov 5, 2024
1033c3e
Set tokenizers version to <0.20.2 (#460)
madamczykhabana Nov 5, 2024
5e56d88
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Nov 5, 2024
18f00d7
Merge remote-tracking branch 'upstream/main' into private/kzawora/oct…
kzawora-intel Nov 5, 2024
d397ba5
fix hpu execution
kzawora-intel Nov 5, 2024
4c0647f
format.sh
kzawora-intel Nov 5, 2024
c41788f
fix type checks
kzawora-intel Nov 5, 2024
5952d81
[Frontend] Fix tcp port reservation for api server (#10012)
russellb Nov 5, 2024
cd34029
Refactor TPU requirements file and pin build dependencies (#10010)
richardsliu Nov 5, 2024
09d3550
[Misc] Add logging for CUDA memory (#10027)
yangalan123 Nov 5, 2024
731aec5
[CI/Build] Limit github CI jobs based on files changed (#9928)
russellb Nov 5, 2024
a53046b
[Model] Support quantization of PixtralHFTransformer for PixtralHF (#…
mgoin Nov 5, 2024
d2e8033
[Feature] Update benchmark_throughput.py to support image input (#9851)
lk-chen Nov 5, 2024
b9c64c0
[Misc] Modify BNB parameter name (#9997)
jeejeelee Nov 5, 2024
0246246
[CI] Prune tests/models/decoder_only/language/* tests (#9940)
mgoin Nov 5, 2024
235366f
[CI] Prune back the number of tests in tests/kernels/* (#9932)
mgoin Nov 5, 2024
ca9844b
[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
youkaichao Nov 5, 2024
43300bd
[Bugfix] Properly propagate trust_remote_code settings (#10047)
zifeitong Nov 6, 2024
966e316
[Bugfix] Fix pickle of input when async output processing is on (#9931)
wallashss Nov 6, 2024
0c63c34
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (…
llsj14 Nov 6, 2024
c4cacba
[v1] reduce graph capture time for piecewise cudagraph (#10059)
youkaichao Nov 6, 2024
82bfc38
[Misc] Sort the list of embedding models (#10037)
DarkLight1337 Nov 6, 2024
ffc0f2b
[Model][OpenVINO] Fix regressions from #8346 (#10045)
petersalas Nov 6, 2024
2bcbae7
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken …
tjohnson31415 Nov 6, 2024
ea928f6
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
arakowsk-amd Nov 6, 2024
9d59b75
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input…
zifeitong Nov 6, 2024
4089985
[V1] Integrate Piecewise CUDA graphs (#10058)
WoosukKwon Nov 6, 2024
4be3a45
[distributed] add function to create ipc buffers directly (#10064)
youkaichao Nov 6, 2024
21063c1
[CI/Build] drop support for Python 3.8 EOL (#8464)
aarnphm Nov 6, 2024
a5fda50
[CI/Build] Fix large_gpu_mark reason (#10070)
Isotr0py Nov 6, 2024
a02a50e
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
kzawora-intel Nov 6, 2024
6a585a2
[Hotfix] Fix ruff errors (#10073)
WoosukKwon Nov 6, 2024
c3c0e90
[BugFix][Habana_main][Multistep]Fix multistep deepcopy overhead (#452)
xuechendi Nov 6, 2024
dc5cdfb
Set vllm-hpu-extension to 0063520 (#455)
madamczykhabana Nov 6, 2024
7578f3b
Oct 28 rebase (#439)
kzawora-intel Nov 6, 2024
07a6441
Revert "Oct 28 rebase" (#466)
kzawora-intel Nov 6, 2024
5812cb6
Oct 28 rebase - attempt 2 (#467)
kzawora-intel Nov 6, 2024
40882f3
Merge commit 'a5fda50a10641e47c0c290907f30ef2add6d4e7a' into HEAD
kzawora-intel Nov 6, 2024
8e62377
format.sh
kzawora-intel Nov 6, 2024
5eb7f3d
Nov 6 rebase (sans vllm-project#6143) (#468)
kzawora-intel Nov 6, 2024
0a17a2e
Fix missed conflict (#469)
kzawora-intel Nov 6, 2024
b91403a
Merge commit 'a02a50e' into HEAD
kzawora-intel Nov 6, 2024
843ae37
Merge commit '6a585a2' into HEAD
kzawora-intel Nov 6, 2024
60b981e
Align fork with HPU upstream code (#465)
michalkuligowski Nov 6, 2024
3c39626
The output tensor from sampling is the input_tokens to the (#471)
tzielinski-habana Nov 6, 2024
66a67fc
gptq hpu support added
maktukmak Oct 23, 2024
f9cf700
row vs column paralel fix
maktukmak Oct 30, 2024
cbcba5d
g_idx check added
maktukmak Nov 1, 2024
a6ab053
exllama state removed
maktukmak Nov 4, 2024
9b323b5
removed custom ops check
maktukmak Nov 4, 2024
7077a99
format fixes
maktukmak Nov 4, 2024
4ef6b7e
Merge branch 'gptq_hpu' of https://github.com/maktukmak/vllm-fork int…
maktukmak Nov 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@

def read_markdown(file):
if os.path.exists(file):
with open(file, "r") as f:
with open(file) as f:
return f.read() + "\n"
else:
return f"{file} not found.\n"
Expand All @@ -75,14 +75,14 @@ def results_to_json(latency, throughput, serving):
# collect results
for test_file in results_folder.glob("*.json"):

with open(test_file, "r") as f:
with open(test_file) as f:
raw_result = json.loads(f.read())

if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
raw_result.update(command)

Expand All @@ -97,7 +97,7 @@ def results_to_json(latency, throughput, serving):
# this result is generated via `benchmark_latency.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
raw_result.update(command)

Expand All @@ -119,7 +119,7 @@ def results_to_json(latency, throughput, serving):
# this result is generated via `benchmark_throughput.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
raw_result.update(command)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,15 +72,15 @@ def main(args):

# collect results
for test_file in results_folder.glob("*_nightly_results.json"):
with open(test_file, "r") as f:
with open(test_file) as f:
results = results + json.loads(f.read())

# generate markdown table
df = pd.DataFrame.from_dict(results)

md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)

with open(args.description, "r") as f:
with open(args.description) as f:
description = f.read()

description = description.format(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@
# collect results
for test_file in results_folder.glob("*.json"):

with open(test_file, "r") as f:
with open(test_file) as f:
raw_result = json.loads(f.read())

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
raw_result.update(command)

Expand Down
4 changes: 2 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg buildkite_commit=$BUILDKITE_COMMIT --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand All @@ -22,7 +22,7 @@ steps:
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg buildkite_commit=$BUILDKITE_COMMIT --build-arg USE_SCCACHE=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
Expand Down
15 changes: 8 additions & 7 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ cleanup_docker() {
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes
docker volume prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune --force --filter "until=72h" --all
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
Expand Down Expand Up @@ -107,11 +107,12 @@ fi
PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
echo "Shard ${GPU} commands:$commands"
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
echo "Shard ${GPU} commands:$commands_gpu"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
Expand All @@ -123,7 +124,7 @@ if [[ $commands == *"--shard-id="* ]]; then
-e HF_HOME=${HF_MOUNT} \
--name ${container_name}_${GPU} \
${image_name} \
/bin/bash -c "${commands}" \
/bin/bash -c "${commands_gpu}" \
|& while read -r line; do echo ">>Shard $GPU: $line"; done &
PIDS+=($!)
done
Expand Down
8 changes: 4 additions & 4 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ docker exec cpu-test bash -c "
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
# docker exec cpu-test bash -c "
# pytest -s -v \
# tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
# tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py
2 changes: 1 addition & 1 deletion .buildkite/run-tpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
73 changes: 47 additions & 26 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# label(str): the name of the test. emoji allowed.
# fast_check(bool): whether to run this on each commit on fastcheck pipeline.
# fast_check_only(bool): run this test on fastcheck pipeline only
# nightly(bool): run this test in nightly pipeline only
# optional(bool): never run this test by default (i.e. need to unblock manually)
# command(str): the single command to run for tests. incompatible with commands.
# commands(list): the list of commands to run for test. incompatbile with command.
Expand Down Expand Up @@ -77,8 +78,8 @@ steps:
- vllm/
- tests/basic_correctness/test_chunked_prefill
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py

- label: Core Test # 10min
mirror_hardwares: [amd]
Expand All @@ -88,11 +89,7 @@ steps:
- vllm/distributed
- tests/core
commands:
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core/test_scheduler.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/test_chunked_prefill_scheduler.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/block/e2e/test_correctness.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s core core/block/e2e/test_correctness_sliding_window.py
- pytest -v -s core --ignore=core/block/e2e/test_correctness.py --ignore=core/test_scheduler.py --ignore=core/test_chunked_prefill_scheduler.py --ignore=core/block/e2e/test_correctness.py --ignore=core/block/e2e/test_correctness_sliding_window.py
- pytest -v -s core

- label: Entrypoints Test # 40min
working_dir: "/vllm-workspace/tests"
Expand Down Expand Up @@ -184,15 +181,15 @@ steps:
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_profile.py --model facebook/opt-125m

- label: Prefix Caching Test # 9min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/prefix_caching
commands:
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s prefix_caching/test_prefix_caching.py
- pytest -v -s prefix_caching --ignore=prefix_caching/test_prefix_caching.py
- pytest -v -s prefix_caching

- label: Samplers Test # 36min
source_file_dependencies:
Expand All @@ -216,8 +213,7 @@ steps:
- tests/spec_decode
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest -v -s spec_decode/e2e/test_compatibility.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_compatibility.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py

- label: LoRA Test %N # 15min each
mirror_hardwares: [amd]
Expand All @@ -234,15 +230,16 @@ steps:
- tests/compile
commands:
- pytest -v -s compile/test_basic_correctness.py
# these tests need to be separated, cannot combine
- pytest -v -s compile/piecewise/test_simple.py
- pytest -v -s compile/piecewise/test_toy_llama.py

# TODO: re-write in comparison tests, and fix symbolic shape
# for quantization ops.
# - label: "PyTorch Fullgraph Test" # 18min
# source_file_dependencies:
# - vllm/
# - tests/compile
# commands:
# - pytest -v -s compile/test_full_graph.py
- label: "PyTorch Fullgraph Test" # 18min
source_file_dependencies:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py

- label: Kernels Test %N # 1h each
mirror_hardwares: [amd]
Expand Down Expand Up @@ -317,33 +314,56 @@ steps:
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py

- label: Decoder-only Language Models Test # 1h36min
- label: Decoder-only Language Models Test (Standard) # 35min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language
- pytest -v -s models/decoder_only/language/test_models.py

- label: Decoder-only Multi-Modal Models Test # 1h31min
- label: Decoder-only Language Models Test (Extended) # 1h20min
nightly: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language --ignore=models/decoder_only/language/test_models.py

- label: Decoder-only Multi-Modal Models Test (Standard)
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
commands:
- pytest -v -s models/decoder_only/audio_language
- pytest -v -s models/decoder_only/vision_language
- pytest -v -s models/decoder_only/audio_language -m core_model
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m core_model

- label: Decoder-only Multi-Modal Models Test (Extended)
nightly: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
commands:
- pytest -v -s models/decoder_only/audio_language -m 'not core_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
- pytest -v -s models/decoder_only/vision_language/test_phi3v.py
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model'

- label: Other Models Test # 6min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/language
- tests/models/encoder_decoder/vision_language
commands:
- pytest -v -s models/embedding/language
- pytest -v -s models/embedding/vision_language
- pytest -v -s models/encoder_decoder/language
- pytest -v -s models/encoder_decoder/vision_language

Expand Down Expand Up @@ -402,11 +422,11 @@ steps:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 VLLM_ALLOW_DEPRECATED_BLOCK_MANAGER_V1=1 pytest basic_correctness/ -v -s -m distributed_2_gpus
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py -v -s -m distributed_2_gpus
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest models/decoder_only/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest models/decoder_only/vision_language/test_models.py -v -s -m distributed_2_gpus
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
Expand Down Expand Up @@ -490,6 +510,7 @@ steps:
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m distributed_2_gpus
- pytest -v -s -x lora/test_mixtral.py

Expand Down
31 changes: 29 additions & 2 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
/.github/
/.venv
/build
dist
Dockerfile*
vllm/*.so

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

.mypy_cache

# Distribution / packaging
.Python
/build/
cmake-build-*/
CMakeUserPresets.json
develop-eggs/
/dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
Loading