Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Support #421

Closed
wants to merge 362 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
362 commits
Select commit Hold shift + click to select a range
29061ed
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang Oct 23, 2024
831540c
[Model] Support E5-V (#9576)
DarkLight1337 Oct 23, 2024
51c24c9
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg Oct 23, 2024
2394962
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao Oct 23, 2024
3ff57eb
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py Oct 23, 2024
c18e1a3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 Oct 23, 2024
31a08f5
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks Oct 23, 2024
e7116c0
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 Oct 23, 2024
dbdd3b5
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao Oct 23, 2024
e5ac6a4
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth Oct 23, 2024
fd0e2cf
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin Oct 23, 2024
9013e24
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula Oct 23, 2024
150b779
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks Oct 23, 2024
fc6c274
[Model] Add Qwen2-Audio model support (#9248)
faychu Oct 23, 2024
d1fbc94
gptq hpu support added
maktukmak Oct 23, 2024
b548d7a
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb Oct 23, 2024
bb01f29
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin Oct 24, 2024
b7df53c
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin Oct 24, 2024
33bab41
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 Oct 24, 2024
056a68c
[XPU] avoid triton import for xpu (#9440)
yma11 Oct 24, 2024
836e8ef
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 Oct 24, 2024
3770071
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon Oct 24, 2024
4fdc581
[core] simplify seq group code (#9569)
youkaichao Oct 24, 2024
8a02cd0
[torch.compile] Adding torch compile annotations to some models (#9639)
CRZbulabula Oct 24, 2024
295a061
[Kernel] add kernel for FATReLU (#9610)
jeejeelee Oct 24, 2024
ad6f780
[torch.compile] expanding support and fix allgather compilation (#9637)
CRZbulabula Oct 24, 2024
b979143
[Doc] Move additional tips/notes to the top (#9647)
DarkLight1337 Oct 24, 2024
f584549
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA m…
litianjian Oct 24, 2024
de662d3
Increase operation per run limit for "Close inactive issues and PRs" …
hmellor Oct 24, 2024
d27cfbf
[torch.compile] Adding torch compile annotations to some models (#9641)
CRZbulabula Oct 24, 2024
c866e00
[CI/Build] Fix VLM test failures when using transformers v4.46 (#9666)
DarkLight1337 Oct 24, 2024
722d46e
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#…
alex-jw-brooks Oct 24, 2024
e26d37a
[Log][Bugfix] Fix default value check for `image_url.detail` (#9663)
mgoin Oct 24, 2024
5944909
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
charlifu Oct 24, 2024
c91ed47
[Bugfix] Remove xformers requirement for Pixtral (#9597)
mgoin Oct 24, 2024
9f7b4ba
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #…
khluu Oct 25, 2024
a6f3721
[Model] add a lora module for granite 3.0 MoE models (#9673)
willmj Oct 25, 2024
9645b9f
[V1] Support sliding window attention (#9679)
WoosukKwon Oct 25, 2024
f603353
Update README_GAUDI about fp8 calibration procedure (#423)
afierka-intel Oct 25, 2024
a5136ec
Set vllm-hpu-extension to 341a77f (#428)
madamczykhabana Oct 25, 2024
a926d14
Create scorecard.yml
rozhukov Oct 25, 2024
5b7f685
Contiguous PA (#424)
mfylcek Oct 25, 2024
e3ae2eb
Revert "Contiguous PA" (#432)
madamczykhabana Oct 25, 2024
93609a2
Enable Dynamic MoE for Mixtral on 1.19.0 (#425)
tpawlows Oct 25, 2024
ca0d922
[Bugfix] Fix compressed_tensors_moe bad config.strategy (#9677)
mgoin Oct 25, 2024
228cfbd
[Doc] Improve quickstart documentation (#9256)
rafvasq Oct 25, 2024
6567e13
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding (…
tjohnson31415 Oct 25, 2024
067e77f
[Bugfix] Steaming continuous_usage_stats default to False (#9709)
samos123 Oct 26, 2024
5cbdccd
[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9…
MengqingCao Oct 26, 2024
55137e8
Fix: MI100 Support By Bypassing Custom Paged Attention (#9560)
MErkinSag Oct 26, 2024
07e981f
[Frontend] Bad words sampling parameter (#9717)
Alvant Oct 26, 2024
6650e6a
[Model] Add classification Task with Qwen2ForSequenceClassification …
kakao-kevin-us Oct 26, 2024
67a6882
[Misc] SpecDecodeWorker supports profiling (#9719)
Abatom Oct 27, 2024
8549c82
[core] cudagraph output with tensor weak reference (#9724)
youkaichao Oct 27, 2024
3cb07a3
[Misc] Upgrade to pytorch 2.5 (#9588)
bnellnm Oct 27, 2024
e130c40
Fix cache management in "Close inactive issues and PRs" actions workf…
hmellor Oct 27, 2024
34a9941
[Bugfix] Fix load config when using bools (#9533)
madt2709 Oct 27, 2024
4e2d95e
[Hardware][ROCM] using current_platform.is_rocm (#9642)
wangshuai09 Oct 28, 2024
32176fe
[torch.compile] support moe models (#9632)
youkaichao Oct 28, 2024
feb92fb
Fix beam search eos (#9627)
robertgshaw2-neuralmagic Oct 28, 2024
2adb440
[Bugfix] Fix ray instance detect issue (#9439)
yma11 Oct 28, 2024
3a55e77
Support long contexts with LoRA (#418)
SanjuCSudhakaran Oct 28, 2024
4fd5c4c
Add HPU specific changes to benchmark_latency.py (#436)
kdamaszk Oct 28, 2024
3e06110
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Oct 28, 2024
96e0d6f
Rebase fix
kzawora-intel Oct 28, 2024
ebebbbb
fix ci fails
kzawora-intel Oct 28, 2024
4c0caa5
fix ci again
kzawora-intel Oct 28, 2024
72a2856
formatting
kzawora-intel Oct 28, 2024
8b0e4f2
[CI/Build] Adopt Mergify for auto-labeling PRs (#9259)
russellb Oct 28, 2024
2a38e6f
sarkar/Add htrandom generator for hpu (#246)
ssarkar2 Oct 28, 2024
5f8d807
[Model][VLM] Add multi-video support for LLaVA-Onevision (#8905)
litianjian Oct 28, 2024
aa0addb
Adding "torch compile" annotations to moe models (#9758)
CRZbulabula Oct 28, 2024
97b61bf
[misc] avoid circular import (#9765)
youkaichao Oct 28, 2024
76ed534
[torch.compile] add deepseek v2 compile (#9775)
youkaichao Oct 28, 2024
c5d7fb9
[Doc] fix third-party model example (#9771)
russellb Oct 29, 2024
7a4df5f
[Model][LoRA]LoRA support added for Qwen (#9622)
jeejeelee Oct 29, 2024
e74f2d4
[Doc] Specify async engine args in docs (#9726)
DarkLight1337 Oct 29, 2024
eae3d48
[Bugfix] Use temporary directory in registry (#9721)
DarkLight1337 Oct 29, 2024
3e135ae
Fix one_hot bug in torch compile mode (#427)
yuwenzho Oct 29, 2024
3203bd9
HPU: offload logits processing to CPU (#358)
madamczykhabana Oct 29, 2024
2fa54e2
Lora layers (#435)
rsshaik1 Oct 29, 2024
1dcdb37
initial works on enabling automatic prefix caching (#162)
huijjj Oct 29, 2024
ef7865b
[Frontend] re-enable multi-modality input in the new beam search impl…
FerdinandZhong Oct 29, 2024
09500f7
[Model] Add BNB quantization support for Mllama (#9720)
Isotr0py Oct 29, 2024
78e947a
Multi step scheduling (#441)
tzielinski-habana Oct 29, 2024
622b7ab
[Hardware] using current_platform.seed_everything (#9785)
wangshuai09 Oct 29, 2024
74fc2d7
[Misc] Add metrics for request queue time, forward time, and execute …
Abatom Oct 29, 2024
08600dd
Fix the log to correct guide user to install modelscope (#9793)
tastelikefeet Oct 29, 2024
0f43387
[Bugfix] Use host argument to bind to interface (#9798)
svenseeberg Oct 29, 2024
0ce7798
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) (#9801)
yannicks1 Oct 29, 2024
ac3d748
[Model] Add LlamaEmbeddingModel as an embedding Implementation of Ll…
jsato8094 Oct 29, 2024
ab6f981
[CI][Bugfix] Skip chameleon for transformers 4.46.1 (#9808)
mgoin Oct 29, 2024
7585ec9
[CI/Build] mergify: fix rules for ci/build label (#9804)
russellb Oct 29, 2024
0ad216f
[MISC] Set label value to timestamp over 0, to keep track of recent h…
coolkp Oct 29, 2024
67bdf8e
[Bugfix][Frontend] Guard against bad token ids (#9634)
joerunde Oct 29, 2024
882a1ad
[Model] tool calling support for ibm-granite/granite-20b-functioncall…
wseaton Oct 29, 2024
8d77241
[Docs] Add notes about Snowflake Meetup (#9814)
simon-mo Oct 29, 2024
bc73e98
[Bugfix] Fix prefix strings for quantized VLMs (#9772)
mgoin Oct 29, 2024
1ab6f6b
[core][distributed] fix custom allreduce in pytorch 2.5 (#9815)
youkaichao Oct 30, 2024
64cb1cd
Update README.md (#9819)
LiuXiaoxuanPKU Oct 30, 2024
226688b
[Bugfix][VLM] Make apply_fp8_linear work with >2D input (#9812)
mgoin Oct 30, 2024
62fac4b
[ci/build] Pin CI dependencies version with pip-compile (#9810)
khluu Oct 30, 2024
04a3ae0
[Bugfix] Fix multi nodes TP+PP for XPU (#8884)
yma11 Oct 30, 2024
7b0365e
[Doc] Add the DCO to CONTRIBUTING.md (#9803)
russellb Oct 30, 2024
ff5ed6e
[torch.compile] rework compile control with piecewise cudagraph (#9715)
youkaichao Oct 30, 2024
6aa6020
[Misc] Specify minimum pynvml version (#9827)
jeejeelee Oct 30, 2024
211fe91
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
WoosukKwon Oct 30, 2024
a821717
Add fp8 test to jenkins CI (#429)
afierka-intel Oct 30, 2024
79dc102
Enable FusedSDPA prefill by default (#447)
kzawora-intel Oct 30, 2024
2f7f963
Contiguous PA (#433)
mfylcek Oct 30, 2024
94858b5
Fix default value for FSDPA (#448)
madamczykhabana Oct 30, 2024
d3257b2
Fix performance of top_p and top_k calculations (#449)
kdamaszk Oct 30, 2024
cc98f1e
[CI/Build] VLM Test Consolidation (#9372)
alex-jw-brooks Oct 30, 2024
81f09cf
[Model] Support math-shepherd-mistral-7b-prm model (#9697)
Went-Liang Oct 30, 2024
9ff4511
[Misc] Add chunked-prefill support on FlashInfer. (#9781)
elfiegg Oct 30, 2024
3b3f1e7
[Bugfix][core] replace heartbeat with pid check (#9818)
joerunde Oct 30, 2024
4272c16
row vs column paralel fix
maktukmak Oct 30, 2024
33d2577
[Doc] link bug for multistep guided decoding (#9843)
joerunde Oct 30, 2024
c787f2d
[Neuron] Update Dockerfile.neuron to fix build failure (#9822)
hbikki Oct 30, 2024
c2cd1a2
[doc] update pp support (#9853)
youkaichao Oct 30, 2024
00d91c8
[CI/Build] Simplify exception trace in api server tests (#9787)
CRZbulabula Oct 30, 2024
64384bb
[torch.compile] upgrade tests (#9858)
youkaichao Oct 30, 2024
abbfb61
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_to…
gcalmettes Oct 31, 2024
890ca36
Revert "[Bugfix] Use host argument to bind to interface (#9798)" (#9852)
khluu Oct 31, 2024
d087bf8
[Model] Support quantization of Qwen2VisionTransformer (#9817)
mgoin Oct 31, 2024
3ea2dc2
[Misc] Remove deprecated arg for cuda graph capture (#9864)
ywang96 Oct 31, 2024
5608e61
[Doc] Update Qwen documentation (#9869)
jeejeelee Oct 31, 2024
d42c2a2
Reduce block fragmentation (#426)
yangw1234 Oct 31, 2024
16b8f7a
[CI/Build] Add Model Tests for Qwen2-VL (#9846)
alex-jw-brooks Oct 31, 2024
6643aa6
Create scorecard.yml (#431)
rozhukov Oct 31, 2024
77f7ef2
[CI/Build] Adding a forced docker system prune to clean up space (#9849)
Alexei-V-Ivanov-AMD Oct 31, 2024
55650c8
[Bugfix] Fix `illegal memory access` error with chunked prefill, pref…
sasha0552 Oct 31, 2024
9fb12f7
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (…
mzusman Oct 31, 2024
b63c64d
[ci/build] Configure dependabot to update pip dependencies (#9811)
khluu Oct 31, 2024
031a799
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
joerunde Nov 1, 2024
96e0c9c
[torch.compile] directly register custom op (#9896)
youkaichao Nov 1, 2024
37a4947
[Bugfix] Fix layer skip logic with bitsandbytes (#9887)
mgoin Nov 1, 2024
566cd27
[torch.compile] rework test plans (#9866)
youkaichao Nov 1, 2024
93a76dd
[Model] Support bitsandbytes for MiniCPMV (#9891)
mgoin Nov 1, 2024
2b5bf20
[torch.compile] Adding torch compile annotations to some models (#9876)
CRZbulabula Nov 1, 2024
d3aa2a8
[Doc] Update multi-input support (#9906)
DarkLight1337 Nov 1, 2024
06386a6
[Frontend] Chat-based Embeddings API (#9759)
DarkLight1337 Nov 1, 2024
30a2e80
[CI/Build] Add Model Tests for PixtralHF (#9813)
mgoin Nov 1, 2024
ba0d892
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 Nov 1, 2024
1dd4cb2
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 Nov 1, 2024
4581d2c
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson Nov 1, 2024
aff1fd8
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao Nov 1, 2024
598b6d7
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety Nov 1, 2024
48a90dc
g_idx check added
maktukmak Nov 1, 2024
18bd758
[1/N] pass the complete config from engine to executor (#9933)
youkaichao Nov 1, 2024
27cd36e
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer Nov 1, 2024
d151fde
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] Nov 1, 2024
6c0b7f5
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas Nov 1, 2024
d522034
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu Nov 1, 2024
a78dd33
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 Nov 2, 2024
af7380d
[torch.compile] fix cpu broken code (#9947)
youkaichao Nov 2, 2024
eed92f1
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill Nov 2, 2024
1d4cfe2
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google Nov 2, 2024
e893795
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao Nov 2, 2024
d6459b4
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic Nov 2, 2024
74b529c
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao Nov 2, 2024
cea808f
[3/N] model runner pass the whole config to model (#9958)
youkaichao Nov 2, 2024
1b73ab2
[CI/Build] Quoting around > (#9956)
nokados Nov 2, 2024
ae5279a
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula Nov 2, 2024
3bb4bef
[bugfix] fix tsts (#9959)
youkaichao Nov 2, 2024
1f1b6d6
[V1] Support per-request seed (#9945)
njhill Nov 3, 2024
5459772
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel Nov 4, 2024
91c9ebb
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic Nov 4, 2024
c49f040
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee Nov 4, 2024
0cc72b9
Enable HPUGraphs for lora long-contexts tests
SanjuCSudhakaran Nov 4, 2024
b67feb1
[Bugfix]Using the correct type hints (#9885)
gshtras Nov 4, 2024
24ba4d4
[CI] Add Llama2 to torch compile tests (#446)
anko-intel Nov 4, 2024
4dbcbbe
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 Nov 4, 2024
ea4aded
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 Nov 4, 2024
1bb808a
Enable HPUGraphs for lora long-contexts tests (#454)
vivekgoe Nov 4, 2024
ccb5376
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao Nov 4, 2024
ac6b8f1
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang Nov 4, 2024
8d72bb2
[4/N] make quant config first-class citizen (#9978)
youkaichao Nov 4, 2024
fb2716d
[Misc]Reduce BNB static variable (#9987)
jeejeelee Nov 4, 2024
603a661
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman Nov 4, 2024
1b8e7d4
exllama state removed
maktukmak Nov 4, 2024
c305f09
removed custom ops check
maktukmak Nov 4, 2024
2ea889a
format fixes
maktukmak Nov 4, 2024
1c45f4c
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic Nov 4, 2024
5208dc7
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen Nov 4, 2024
6e056bc
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 Nov 4, 2024
04cef2c
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic Nov 4, 2024
9a5664d
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen Nov 4, 2024
ac04a97
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 Nov 4, 2024
d93478b
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm Nov 4, 2024
2094062
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao Nov 4, 2024
8f0a9ca
[Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
mgoin Nov 4, 2024
04bbf38
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
tlrmchlsmth Nov 5, 2024
bbc3619
[Core] Make encoder-decoder inputs a nested structure to be more comp…
DarkLight1337 Nov 5, 2024
ad23318
[Bugfix] Fixup Mamba (#10004)
tlrmchlsmth Nov 5, 2024
ac12d53
Fix SchedulerConfig params (#459)
ldurejko Nov 5, 2024
653e56c
Tensor parallelism for multi-step scheduling (#457)
tzielinski-habana Nov 5, 2024
7a83b1a
[BugFix] Lazy import ray (#10021)
GeneDer Nov 5, 2024
93dee88
[Misc] vllm CLI flags should be ordered for better user readability (…
chaunceyjiang Nov 5, 2024
1033c3e
Set tokenizers version to <0.20.2 (#460)
madamczykhabana Nov 5, 2024
5e56d88
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Nov 5, 2024
18f00d7
Merge remote-tracking branch 'upstream/main' into private/kzawora/oct…
kzawora-intel Nov 5, 2024
d397ba5
fix hpu execution
kzawora-intel Nov 5, 2024
4c0647f
format.sh
kzawora-intel Nov 5, 2024
c41788f
fix type checks
kzawora-intel Nov 5, 2024
5952d81
[Frontend] Fix tcp port reservation for api server (#10012)
russellb Nov 5, 2024
cd34029
Refactor TPU requirements file and pin build dependencies (#10010)
richardsliu Nov 5, 2024
09d3550
[Misc] Add logging for CUDA memory (#10027)
yangalan123 Nov 5, 2024
731aec5
[CI/Build] Limit github CI jobs based on files changed (#9928)
russellb Nov 5, 2024
a53046b
[Model] Support quantization of PixtralHFTransformer for PixtralHF (#…
mgoin Nov 5, 2024
d2e8033
[Feature] Update benchmark_throughput.py to support image input (#9851)
lk-chen Nov 5, 2024
b9c64c0
[Misc] Modify BNB parameter name (#9997)
jeejeelee Nov 5, 2024
0246246
[CI] Prune tests/models/decoder_only/language/* tests (#9940)
mgoin Nov 5, 2024
235366f
[CI] Prune back the number of tests in tests/kernels/* (#9932)
mgoin Nov 5, 2024
ca9844b
[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
youkaichao Nov 5, 2024
43300bd
[Bugfix] Properly propagate trust_remote_code settings (#10047)
zifeitong Nov 6, 2024
966e316
[Bugfix] Fix pickle of input when async output processing is on (#9931)
wallashss Nov 6, 2024
0c63c34
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (…
llsj14 Nov 6, 2024
c4cacba
[v1] reduce graph capture time for piecewise cudagraph (#10059)
youkaichao Nov 6, 2024
82bfc38
[Misc] Sort the list of embedding models (#10037)
DarkLight1337 Nov 6, 2024
ffc0f2b
[Model][OpenVINO] Fix regressions from #8346 (#10045)
petersalas Nov 6, 2024
2bcbae7
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken …
tjohnson31415 Nov 6, 2024
ea928f6
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
arakowsk-amd Nov 6, 2024
9d59b75
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input…
zifeitong Nov 6, 2024
4089985
[V1] Integrate Piecewise CUDA graphs (#10058)
WoosukKwon Nov 6, 2024
4be3a45
[distributed] add function to create ipc buffers directly (#10064)
youkaichao Nov 6, 2024
21063c1
[CI/Build] drop support for Python 3.8 EOL (#8464)
aarnphm Nov 6, 2024
a5fda50
[CI/Build] Fix large_gpu_mark reason (#10070)
Isotr0py Nov 6, 2024
a02a50e
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
kzawora-intel Nov 6, 2024
6a585a2
[Hotfix] Fix ruff errors (#10073)
WoosukKwon Nov 6, 2024
c3c0e90
[BugFix][Habana_main][Multistep]Fix multistep deepcopy overhead (#452)
xuechendi Nov 6, 2024
dc5cdfb
Set vllm-hpu-extension to 0063520 (#455)
madamczykhabana Nov 6, 2024
7578f3b
Oct 28 rebase (#439)
kzawora-intel Nov 6, 2024
07a6441
Revert "Oct 28 rebase" (#466)
kzawora-intel Nov 6, 2024
5812cb6
Oct 28 rebase - attempt 2 (#467)
kzawora-intel Nov 6, 2024
40882f3
Merge commit 'a5fda50a10641e47c0c290907f30ef2add6d4e7a' into HEAD
kzawora-intel Nov 6, 2024
8e62377
format.sh
kzawora-intel Nov 6, 2024
5eb7f3d
Nov 6 rebase (sans vllm-project#6143) (#468)
kzawora-intel Nov 6, 2024
0a17a2e
Fix missed conflict (#469)
kzawora-intel Nov 6, 2024
b91403a
Merge commit 'a02a50e' into HEAD
kzawora-intel Nov 6, 2024
843ae37
Merge commit '6a585a2' into HEAD
kzawora-intel Nov 6, 2024
60b981e
Align fork with HPU upstream code (#465)
michalkuligowski Nov 6, 2024
3c39626
The output tensor from sampling is the input_tokens to the (#471)
tzielinski-habana Nov 6, 2024
66a67fc
gptq hpu support added
maktukmak Oct 23, 2024
f9cf700
row vs column paralel fix
maktukmak Oct 30, 2024
cbcba5d
g_idx check added
maktukmak Nov 1, 2024
a6ab053
exllama state removed
maktukmak Nov 4, 2024
9b323b5
removed custom ops check
maktukmak Nov 4, 2024
7077a99
format fixes
maktukmak Nov 4, 2024
4ef6b7e
Merge branch 'gptq_hpu' of https://github.com/maktukmak/vllm-fork int…
maktukmak Nov 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 30 additions & 2 deletions vllm/_core_ext.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,38 @@ def is_ieee_754(self) -> bool:
not self._finite_values_only

def __str__(self) -> str:
raise NotImplementedError
"""
naming generally follows: https://github.com/jax-ml/ml_dtypes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1# Start with capital letter.
2# This cannot be merged as is, because this class, ScalarType is widely used and will make it difficult to upstream. Use derived class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this class from the original repo because it blocked me from testing the feature. The class is defined here: https://github.com/vllm-project/vllm/blob/fb2716d64117aaa6c36b97b09765aa10a89e2fe5/vllm/scalar_type.py#L19

Let me know if there is a better way,

Copy link

@michalkuligowski michalkuligowski Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Vllm was rebased and this methods do have those defitnitions now (its now under scalar_type.py name). Please rebase

for floating point types (leading f) the scheme is:
`float<size_bits>_e<exponent_bits>m<mantissa_bits>[flags]`
flags:
- no-flags: means it follows IEEE 754 conventions
- f: means finite values only (no infinities)
- n: means nans are supported (non-standard encoding)
for integer types the scheme is:
`[u]int<size_bits>[b<bias>]`
- if bias is not present it means its zero
"""
if self.is_floating_point():
ret = "float" + str(self.size_bits) + "_e" + str(
self.exponent) + "m" + str(self.mantissa)

if not self.is_ieee_754():
if self._finite_values_only:
ret = ret + "f"
if self.nan_repr != NanRepr.NONE:
ret = ret + "n"

return ret
else:
ret = ("int" if self.is_signed() else "uint") + str(
self.size_bits)
if self.has_bias():
ret = ret + "b" + str(self.bias)
return ret

def __repr__(self) -> str:
raise NotImplementedError
return "ScalarType." + self.__str__()
maktukmak marked this conversation as resolved.
Show resolved Hide resolved

# __len__ needs to be defined (and has to throw TypeError) for pytorch's
# opcheck to work.
Expand Down
15 changes: 14 additions & 1 deletion vllm/_custom_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

logger = init_logger(__name__)

if current_platform.is_hpu():
import habana_frameworks.torch.core as htcore
convert_from_uint4 = torch.ops.hpu.convert_from_uint4

if not current_platform.is_tpu() and not current_platform.is_hpu():
try:
import vllm._C
Expand Down Expand Up @@ -266,7 +270,16 @@ def awq_gemm(input: torch.Tensor, qweight: torch.Tensor, qzeros: torch.Tensor,
return torch.ops._C.awq_gemm(input, qweight, qzeros, scales, split_k_iters)


# gptq
def gptq_hpu_gemm(a: torch.Tensor, b_q_weight: torch.Tensor,
maktukmak marked this conversation as resolved.
Show resolved Hide resolved
b_gptq_qzeros: torch.Tensor, b_gptq_scales: torch.Tensor,
b_g_idx: torch.Tensor, use_exllama: bool,
bit: int) -> torch.Tensor:

weight = convert_from_uint4(b_q_weight, b_gptq_scales, b_gptq_qzeros,
a.dtype)
return torch.matmul(a, weight)


def gptq_gemm(a: torch.Tensor, b_q_weight: torch.Tensor,
b_gptq_qzeros: torch.Tensor, b_gptq_scales: torch.Tensor,
b_g_idx: torch.Tensor, use_exllama: bool,
Expand Down
4 changes: 2 additions & 2 deletions vllm/model_executor/layers/linear.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
"CompressedTensorsLinearMethod", "AWQMarlinLinearMethod",
"AWQLinearMethod", "GPTQMarlinLinearMethod", "Fp8LinearMethod",
"MarlinLinearMethod", "QQQLinearMethod", "GPTQMarlin24LinearMethod",
"TPUInt8LinearMethod", "GPTQLinearMethod", "FBGEMMFp8LinearMethod",
"ModelOptFp8LinearMethod", "IPEXAWQLinearMethod"
"TPUInt8LinearMethod", "GPTQLinearMethod", "GPTQHPULinearMethod",
"FBGEMMFp8LinearMethod", "ModelOptFp8LinearMethod", "IPEXAWQLinearMethod"
]


Expand Down
2 changes: 2 additions & 0 deletions vllm/model_executor/layers/quantization/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from vllm.model_executor.layers.quantization.fp8 import Fp8Config
from vllm.model_executor.layers.quantization.gguf import GGUFConfig
from vllm.model_executor.layers.quantization.gptq import GPTQConfig
from vllm.model_executor.layers.quantization.gptq_hpu import GPTQHPUConfig
from vllm.model_executor.layers.quantization.gptq_marlin import (
GPTQMarlinConfig)
from vllm.model_executor.layers.quantization.gptq_marlin_24 import (
Expand Down Expand Up @@ -46,6 +47,7 @@
"gptq_marlin": GPTQMarlinConfig,
"awq_marlin": AWQMarlinConfig,
"gptq": GPTQConfig,
"gptq_hpu": GPTQHPUConfig,
"compressed-tensors": CompressedTensorsConfig,
"bitsandbytes": BitsAndBytesConfig,
"inc": INCConfig,
Expand Down
291 changes: 291 additions & 0 deletions vllm/model_executor/layers/quantization/gptq_hpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
from fractions import Fraction
maktukmak marked this conversation as resolved.
Show resolved Hide resolved
from typing import Any, Dict, List, Optional

import torch
from torch.nn.parameter import Parameter

from vllm import _custom_ops as ops
from vllm.model_executor.layers.linear import LinearBase, LinearMethodBase
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig)
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
from vllm.model_executor.parameter import (ChannelQuantScaleParameter,
GroupQuantScaleParameter,
PackedColumnParameter,
PackedvLLMParameter,
RowvLLMParameter)


class GPTQHPUConfig(QuantizationConfig):
"""Config class for GPTQ.

Reference: https://arxiv.org/abs/2210.17323
"""

def __init__(
self,
weight_bits: int,
group_size: int,
desc_act: bool,
lm_head_quantized: bool,
) -> None:
self.weight_bits = weight_bits
self.group_size = group_size
self.desc_act = desc_act
self.lm_head_quantized = lm_head_quantized
self.pack_factor = Fraction(32, self.weight_bits)
if self.weight_bits not in [2, 3, 4, 8]:
raise ValueError(
"Currently, only 2/3/4/8-bit weight quantization is "
f"supported for GPTQ, but got {self.weight_bits} bits.")

def __repr__(self) -> str:
return (f"GPTQHPUConfig(weight_bits={self.weight_bits}, "
f"group_size={self.group_size}, "
f"desc_act={self.desc_act}),"
f"lm_head_quantized={self.lm_head_quantized}")

@classmethod
def get_name(cls) -> str:
return "gptq_hpu"

@classmethod
def get_supported_act_dtypes(cls) -> List[torch.dtype]:
return [torch.bfloat16]

@classmethod
# Need to figure it out
def get_min_capability(cls) -> int:
return 0

@classmethod
def get_config_filenames(cls) -> List[str]:
return ["quantize_config.json"]

@classmethod
def from_config(cls, config: Dict[str, Any]) -> "GPTQHPUConfig":
weight_bits = cls.get_from_keys(config, ["bits"])
group_size = cls.get_from_keys(config, ["group_size"])
desc_act = cls.get_from_keys(config, ["desc_act"])
lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
default=False)
return cls(weight_bits, group_size, desc_act, lm_head_quantized)

@classmethod
def override_quantization_method(cls, hf_quant_cfg,
user_quant) -> Optional[str]:

is_valid_user_quant = user_quant == "gptq_hpu"

if is_valid_user_quant:
return cls.get_name()

return None

def get_quant_method(self, layer: torch.nn.Module,
prefix: str) -> Optional["GPTQHPULinearMethod"]:
if (isinstance(layer, LinearBase) or
(isinstance(layer, ParallelLMHead) and self.lm_head_quantized)):
return GPTQHPULinearMethod(self)
return None

def get_scaled_act_names(self) -> List[str]:
return []


class GPTQHPULinearMethod(LinearMethodBase):
"""Linear method for GPTQ.

Args:
quant_config: The GPTQ quantization config.
"""

def __init__(self, quant_config: GPTQHPUConfig):
self.quant_config = quant_config

def create_weights(
self,
layer: torch.nn.Module,
input_size_per_partition: int,
output_partition_sizes: List[int],
input_size: int,
output_size: int,
params_dtype: torch.dtype,
**extra_weight_attrs,
):
del output_size # Unused.
maktukmak marked this conversation as resolved.
Show resolved Hide resolved
weight_loader = extra_weight_attrs.get("weight_loader")
if input_size_per_partition % self.quant_config.group_size != 0:
raise ValueError(
"The input size is not aligned with the quantized "
"weight shape. This can be caused by too large "
"tensor parallel size.")
output_size_per_partition = sum(output_partition_sizes)
if (output_size_per_partition % self.quant_config.pack_factor.numerator
!= 0):
raise ValueError(
"The output size is not aligned with the quantized "
"weight shape. This can be caused by too large "
"tensor parallel size.")

if self.quant_config.group_size != -1:
group_size = self.quant_config.group_size
else:
group_size = input_size
scale_and_zero_size = input_size // group_size
scale_and_zero_input_dim = None

qweight = PackedvLLMParameter(
data=torch.empty(
input_size_per_partition // self.quant_config.pack_factor,
output_size_per_partition,
dtype=torch.int32,
),
input_dim=0,
output_dim=1,
packed_dim=0,
packed_factor=self.quant_config.pack_factor,
weight_loader=weight_loader)

g_idx = RowvLLMParameter(data=torch.tensor(
[
i // self.quant_config.group_size
for i in range(input_size_per_partition)
],
dtype=torch.int32,
),
input_dim=0,
weight_loader=weight_loader)
qzeros_args = {
"data":
torch.empty(
scale_and_zero_size,
output_size_per_partition // self.quant_config.pack_factor,
dtype=torch.int32,
),
"weight_loader":
weight_loader
}
weight_scale_args = {
"data":
torch.empty(
scale_and_zero_size,
output_size_per_partition,
dtype=params_dtype,
),
"weight_loader":
weight_loader
}
if scale_and_zero_input_dim is None:
scales = ChannelQuantScaleParameter(output_dim=1,
**weight_scale_args)
qzeros = PackedColumnParameter(
output_dim=1,
packed_dim=1,
packed_factor=self.quant_config.pack_factor,
**qzeros_args)

else:
scales = GroupQuantScaleParameter(output_dim=1,
input_dim=0,
**weight_scale_args)
qzeros = PackedvLLMParameter(
input_dim=0,
output_dim=1,
packed_dim=1,
packed_factor=self.quant_config.pack_factor,
**qzeros_args)

layer.register_parameter("qweight", qweight)
layer.register_parameter("g_idx", g_idx)
layer.register_parameter("qzeros", qzeros)
layer.register_parameter("scales", scales)

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:

self.wf = torch.tensor(list(range(0, 32,
self.quant_config.weight_bits)),
dtype=torch.int32).unsqueeze(0)
weight = self.unpack_weight_from_cuda_old_format(layer)
layer.qweight.data = self.pack_tensor(weight).to('hpu')

zeros = self.unpack_zeros_from_cuda_old_format(layer).cpu()
layer.qzeros.data = self.pack_tensor(zeros).to('hpu')

# TODO: Support group indexing and remove the check
columns = layer.qweight.shape[0]
if self.quant_config.group_size > 0:
g_idx_trivial = [
i // self.quant_config.group_size for i in range(columns)
]
else:
g_idx_trivial = [0] * columns
g_idx_trivial = torch.tensor(g_idx_trivial, dtype=torch.int32)
assert torch.equal(
layer.g_idx,
g_idx_trivial), "Non-trivial tensor g_idx is not supported"

# for torch.compile
layer.qweight = Parameter(layer.qweight.data, requires_grad=False)
layer.qzeros = Parameter(layer.qzeros.data, requires_grad=False)
layer.g_idx = Parameter(layer.g_idx.data, requires_grad=False)
layer.scales = Parameter(layer.scales.data, requires_grad=False)

def apply(self,
layer: torch.nn.Module,
x: torch.Tensor,
bias: Optional[torch.Tensor] = None) -> torch.Tensor:

out_shape = x.shape[:-1]
if hasattr(layer, 'output_size_per_partition'):
out_shape += (layer.output_size_per_partition, )
else:
out_shape += (layer.output_size, )

reshaped_x = x.reshape(-1, x.shape[-1])

output = ops.gptq_hpu_gemm(reshaped_x, layer.qweight, layer.qzeros,
layer.scales, layer.g_idx, None,
self.quant_config.weight_bits)
if bias is not None:
output.add_(bias)
return output.reshape(out_shape)

def pack_tensor(self, input, bits=4):
normal = input.to(torch.int32)
q = torch.sum(torch.bitwise_left_shift(
normal.reshape(normal.shape[0], -1, (32 // bits)),
self.wf.unsqueeze(0)),
dim=-1).to(torch.int32)

return q

def unpack_zeros_from_cuda_old_format(self, layer):

bits = self.quant_config.weight_bits
zeros = torch.bitwise_right_shift(
torch.unsqueeze(layer.qzeros.to('cpu'),
2).expand(-1, -1, 32 // bits),
self.wf.unsqueeze(0),
).to(torch.int16 if bits == 8 else torch.int8)

zeros = zeros + 1
zeros = torch.bitwise_and(zeros, (2**bits) - 1).to(
layer.scales.dtype) # NOTE: It appears that casting here
#after the `zeros = zeros + 1` is important.
zeros = zeros.reshape(-1, zeros.shape[1] * zeros.shape[2])
return zeros

def unpack_weight_from_cuda_old_format(self, layer):

qweight = layer.qweight.cpu()
bits = self.quant_config.weight_bits

weight = torch.bitwise_right_shift(
torch.unsqueeze(qweight, 1).expand(-1, 32 // bits, -1),
self.wf.unsqueeze(-1),
).to(torch.int16 if bits == 8 else torch.int8)
weight = torch.bitwise_and(weight, (2**bits) - 1)
weight = weight.reshape(
(weight.shape[0] * weight.shape[1], weight.shape[2]))
return weight
Loading