forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ Support #421
Closed
Closed
GPTQ Support #421
Changes from 6 commits
Commits
Show all changes
362 commits
Select commit
Hold shift + click to select a range
29061ed
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend…
sfc-gh-zhwang 831540c
[Model] Support E5-V (#9576)
DarkLight1337 51c24c9
[Build] Fix `FetchContent` multiple build issue (#9596)
ProExpertProg 2394962
[Hardware][XPU] using current_platform.is_xpu (#9605)
MengqingCao 3ff57eb
[Model] Initialize Florence-2 language backbone support (#9555)
Isotr0py c18e1a3
[VLM] Enable overriding whether post layernorm is used in vision enco…
DarkLight1337 31a08f5
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs…
alex-jw-brooks e7116c0
[Bugfix] Fix `_init_vision_model` in NVLM_D model (#9611)
DarkLight1337 dbdd3b5
[misc] comment to avoid future confusion about baichuan (#9620)
youkaichao e5ac6a4
[Bugfix] Fix divide by zero when serving Mamba models (#9617)
tlrmchlsmth fd0e2cf
[Misc] Separate total and output tokens in benchmark_throughput.py (#…
mgoin 9013e24
[torch.compile] Adding torch compile annotations to some models (#9614)
CRZbulabula 150b779
[Frontend] Enable Online Multi-image Support for MLlama (#9393)
alex-jw-brooks fc6c274
[Model] Add Qwen2-Audio model support (#9248)
faychu d1fbc94
gptq hpu support added
maktukmak b548d7a
[CI/Build] Add bot to close stale issues and PRs (#9436)
russellb bb01f29
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched mul…
mgoin b7df53c
[Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
mgoin 33bab41
[Bugfix]: Make chat content text allow type content (#9358)
vrdn-23 056a68c
[XPU] avoid triton import for xpu (#9440)
yma11 836e8ef
[Bugfix] Fix PP for ChatGLM and Molmo (#9422)
DarkLight1337 3770071
[V1][Bugfix] Clean up requests when aborted (#9629)
WoosukKwon 4fdc581
[core] simplify seq group code (#9569)
youkaichao 8a02cd0
[torch.compile] Adding torch compile annotations to some models (#9639)
CRZbulabula 295a061
[Kernel] add kernel for FATReLU (#9610)
jeejeelee ad6f780
[torch.compile] expanding support and fix allgather compilation (#9637)
CRZbulabula b979143
[Doc] Move additional tips/notes to the top (#9647)
DarkLight1337 f584549
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA m…
litianjian de662d3
Increase operation per run limit for "Close inactive issues and PRs" …
hmellor d27cfbf
[torch.compile] Adding torch compile annotations to some models (#9641)
CRZbulabula c866e00
[CI/Build] Fix VLM test failures when using transformers v4.46 (#9666)
DarkLight1337 722d46e
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#…
alex-jw-brooks e26d37a
[Log][Bugfix] Fix default value check for `image_url.detail` (#9663)
mgoin 5944909
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
charlifu c91ed47
[Bugfix] Remove xformers requirement for Pixtral (#9597)
mgoin 9f7b4ba
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #…
khluu a6f3721
[Model] add a lora module for granite 3.0 MoE models (#9673)
willmj 9645b9f
[V1] Support sliding window attention (#9679)
WoosukKwon f603353
Update README_GAUDI about fp8 calibration procedure (#423)
afierka-intel a5136ec
Set vllm-hpu-extension to 341a77f (#428)
madamczykhabana a926d14
Create scorecard.yml
rozhukov 5b7f685
Contiguous PA (#424)
mfylcek e3ae2eb
Revert "Contiguous PA" (#432)
madamczykhabana 93609a2
Enable Dynamic MoE for Mixtral on 1.19.0 (#425)
tpawlows ca0d922
[Bugfix] Fix compressed_tensors_moe bad config.strategy (#9677)
mgoin 228cfbd
[Doc] Improve quickstart documentation (#9256)
rafvasq 6567e13
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding (…
tjohnson31415 067e77f
[Bugfix] Steaming continuous_usage_stats default to False (#9709)
samos123 5cbdccd
[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9…
MengqingCao 55137e8
Fix: MI100 Support By Bypassing Custom Paged Attention (#9560)
MErkinSag 07e981f
[Frontend] Bad words sampling parameter (#9717)
Alvant 6650e6a
[Model] Add classification Task with Qwen2ForSequenceClassification …
kakao-kevin-us 67a6882
[Misc] SpecDecodeWorker supports profiling (#9719)
Abatom 8549c82
[core] cudagraph output with tensor weak reference (#9724)
youkaichao 3cb07a3
[Misc] Upgrade to pytorch 2.5 (#9588)
bnellnm e130c40
Fix cache management in "Close inactive issues and PRs" actions workf…
hmellor 34a9941
[Bugfix] Fix load config when using bools (#9533)
madt2709 4e2d95e
[Hardware][ROCM] using current_platform.is_rocm (#9642)
wangshuai09 32176fe
[torch.compile] support moe models (#9632)
youkaichao feb92fb
Fix beam search eos (#9627)
robertgshaw2-neuralmagic 2adb440
[Bugfix] Fix ray instance detect issue (#9439)
yma11 3a55e77
Support long contexts with LoRA (#418)
SanjuCSudhakaran 4fd5c4c
Add HPU specific changes to benchmark_latency.py (#436)
kdamaszk 3e06110
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel 96e0d6f
Rebase fix
kzawora-intel ebebbbb
fix ci fails
kzawora-intel 4c0caa5
fix ci again
kzawora-intel 72a2856
formatting
kzawora-intel 8b0e4f2
[CI/Build] Adopt Mergify for auto-labeling PRs (#9259)
russellb 2a38e6f
sarkar/Add htrandom generator for hpu (#246)
ssarkar2 5f8d807
[Model][VLM] Add multi-video support for LLaVA-Onevision (#8905)
litianjian aa0addb
Adding "torch compile" annotations to moe models (#9758)
CRZbulabula 97b61bf
[misc] avoid circular import (#9765)
youkaichao 76ed534
[torch.compile] add deepseek v2 compile (#9775)
youkaichao c5d7fb9
[Doc] fix third-party model example (#9771)
russellb 7a4df5f
[Model][LoRA]LoRA support added for Qwen (#9622)
jeejeelee e74f2d4
[Doc] Specify async engine args in docs (#9726)
DarkLight1337 eae3d48
[Bugfix] Use temporary directory in registry (#9721)
DarkLight1337 3e135ae
Fix one_hot bug in torch compile mode (#427)
yuwenzho 3203bd9
HPU: offload logits processing to CPU (#358)
madamczykhabana 2fa54e2
Lora layers (#435)
rsshaik1 1dcdb37
initial works on enabling automatic prefix caching (#162)
huijjj ef7865b
[Frontend] re-enable multi-modality input in the new beam search impl…
FerdinandZhong 09500f7
[Model] Add BNB quantization support for Mllama (#9720)
Isotr0py 78e947a
Multi step scheduling (#441)
tzielinski-habana 622b7ab
[Hardware] using current_platform.seed_everything (#9785)
wangshuai09 74fc2d7
[Misc] Add metrics for request queue time, forward time, and execute …
Abatom 08600dd
Fix the log to correct guide user to install modelscope (#9793)
tastelikefeet 0f43387
[Bugfix] Use host argument to bind to interface (#9798)
svenseeberg 0ce7798
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) (#9801)
yannicks1 ac3d748
[Model] Add LlamaEmbeddingModel as an embedding Implementation of Ll…
jsato8094 ab6f981
[CI][Bugfix] Skip chameleon for transformers 4.46.1 (#9808)
mgoin 7585ec9
[CI/Build] mergify: fix rules for ci/build label (#9804)
russellb 0ad216f
[MISC] Set label value to timestamp over 0, to keep track of recent h…
coolkp 67bdf8e
[Bugfix][Frontend] Guard against bad token ids (#9634)
joerunde 882a1ad
[Model] tool calling support for ibm-granite/granite-20b-functioncall…
wseaton 8d77241
[Docs] Add notes about Snowflake Meetup (#9814)
simon-mo bc73e98
[Bugfix] Fix prefix strings for quantized VLMs (#9772)
mgoin 1ab6f6b
[core][distributed] fix custom allreduce in pytorch 2.5 (#9815)
youkaichao 64cb1cd
Update README.md (#9819)
LiuXiaoxuanPKU 226688b
[Bugfix][VLM] Make apply_fp8_linear work with >2D input (#9812)
mgoin 62fac4b
[ci/build] Pin CI dependencies version with pip-compile (#9810)
khluu 04a3ae0
[Bugfix] Fix multi nodes TP+PP for XPU (#8884)
yma11 7b0365e
[Doc] Add the DCO to CONTRIBUTING.md (#9803)
russellb ff5ed6e
[torch.compile] rework compile control with piecewise cudagraph (#9715)
youkaichao 6aa6020
[Misc] Specify minimum pynvml version (#9827)
jeejeelee 211fe91
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
WoosukKwon a821717
Add fp8 test to jenkins CI (#429)
afierka-intel 79dc102
Enable FusedSDPA prefill by default (#447)
kzawora-intel 2f7f963
Contiguous PA (#433)
mfylcek 94858b5
Fix default value for FSDPA (#448)
madamczykhabana d3257b2
Fix performance of top_p and top_k calculations (#449)
kdamaszk cc98f1e
[CI/Build] VLM Test Consolidation (#9372)
alex-jw-brooks 81f09cf
[Model] Support math-shepherd-mistral-7b-prm model (#9697)
Went-Liang 9ff4511
[Misc] Add chunked-prefill support on FlashInfer. (#9781)
elfiegg 3b3f1e7
[Bugfix][core] replace heartbeat with pid check (#9818)
joerunde 4272c16
row vs column paralel fix
maktukmak 33d2577
[Doc] link bug for multistep guided decoding (#9843)
joerunde c787f2d
[Neuron] Update Dockerfile.neuron to fix build failure (#9822)
hbikki c2cd1a2
[doc] update pp support (#9853)
youkaichao 00d91c8
[CI/Build] Simplify exception trace in api server tests (#9787)
CRZbulabula 64384bb
[torch.compile] upgrade tests (#9858)
youkaichao abbfb61
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_to…
gcalmettes 890ca36
Revert "[Bugfix] Use host argument to bind to interface (#9798)" (#9852)
khluu d087bf8
[Model] Support quantization of Qwen2VisionTransformer (#9817)
mgoin 3ea2dc2
[Misc] Remove deprecated arg for cuda graph capture (#9864)
ywang96 5608e61
[Doc] Update Qwen documentation (#9869)
jeejeelee d42c2a2
Reduce block fragmentation (#426)
yangw1234 16b8f7a
[CI/Build] Add Model Tests for Qwen2-VL (#9846)
alex-jw-brooks 6643aa6
Create scorecard.yml (#431)
rozhukov 77f7ef2
[CI/Build] Adding a forced docker system prune to clean up space (#9849)
Alexei-V-Ivanov-AMD 55650c8
[Bugfix] Fix `illegal memory access` error with chunked prefill, pref…
sasha0552 9fb12f7
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (…
mzusman b63c64d
[ci/build] Configure dependabot to update pip dependencies (#9811)
khluu 031a799
[Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
joerunde 96e0c9c
[torch.compile] directly register custom op (#9896)
youkaichao 37a4947
[Bugfix] Fix layer skip logic with bitsandbytes (#9887)
mgoin 566cd27
[torch.compile] rework test plans (#9866)
youkaichao 93a76dd
[Model] Support bitsandbytes for MiniCPMV (#9891)
mgoin 2b5bf20
[torch.compile] Adding torch compile annotations to some models (#9876)
CRZbulabula d3aa2a8
[Doc] Update multi-input support (#9906)
DarkLight1337 06386a6
[Frontend] Chat-based Embeddings API (#9759)
DarkLight1337 30a2e80
[CI/Build] Add Model Tests for PixtralHF (#9813)
mgoin ba0d892
[Frontend] Use a proper chat template for VLM2Vec (#9912)
DarkLight1337 1dd4cb2
[Bugfix] Fix edge cases for MistralTokenizer (#9625)
tjohnson31415 4581d2c
[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
andrejonasson aff1fd8
[torch.compile] use interpreter with stable api from pytorch (#9889)
youkaichao 598b6d7
[Bugfix/Core] Flashinfer k_scale and v_scale (#9861)
pavanimajety 48a90dc
g_idx check added
maktukmak 18bd758
[1/N] pass the complete config from engine to executor (#9933)
youkaichao 27cd36e
[Bugfix] PicklingError on RayTaskError (#9934)
GeneDer d151fde
[ci/build] Bump the patch-update group with 10 updates (#9897)
dependabot[bot] 6c0b7f5
[Core][VLM] Add precise multi-modal placeholder tracking (#8346)
petersalas d522034
[ci/build] Have dependabot ignore pinned dependencies (#9935)
khluu a78dd33
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder m…
sroy745 af7380d
[torch.compile] fix cpu broken code (#9947)
youkaichao eed92f1
[Docs] Update Granite 3.0 models in supported models table (#9930)
njhill 1d4cfe2
[Doc] Updated tpu-installation.rst with more details (#9926)
mikegre-google e893795
[2/N] executor pass the complete config to worker/modelrunner (#9938)
youkaichao d6459b4
[V1] Fix `EngineArgs` refactor on V1 (#9954)
robertgshaw2-neuralmagic 74b529c
[bugfix] fix chatglm dummy_data_for_glmv (#9955)
youkaichao cea808f
[3/N] model runner pass the whole config to model (#9958)
youkaichao 1b73ab2
[CI/Build] Quoting around > (#9956)
nokados ae5279a
[torch.compile] Adding torch compile to vision-language models (#9946)
CRZbulabula 3bb4bef
[bugfix] fix tsts (#9959)
youkaichao 1f1b6d6
[V1] Support per-request seed (#9945)
njhill 5459772
[Model] Add support for H2OVL-Mississippi models (#9747)
cooleel 91c9ebb
[V1] Fix Configs (#9971)
robertgshaw2-neuralmagic c49f040
[Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
jeejeelee 0cc72b9
Enable HPUGraphs for lora long-contexts tests
SanjuCSudhakaran b67feb1
[Bugfix]Using the correct type hints (#9885)
gshtras 24ba4d4
[CI] Add Llama2 to torch compile tests (#446)
anko-intel 4dbcbbe
[Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
zhengy001 ea4aded
[Bugfix] Fix E2EL mean and median stats (#9984)
daitran2k1 1bb808a
Enable HPUGraphs for lora long-contexts tests (#454)
vivekgoe ccb5376
[Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
MengqingCao ac6b8f1
[Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
chaunceyjiang 8d72bb2
[4/N] make quant config first-class citizen (#9978)
youkaichao fb2716d
[Misc]Reduce BNB static variable (#9987)
jeejeelee 603a661
[Model] factoring out MambaMixer out of Jamba (#8993)
mzusman 1b8e7d4
exllama state removed
maktukmak c305f09
removed custom ops check
maktukmak 2ea889a
format fixes
maktukmak 1c45f4c
[CI] Basic Integration Test For TPU (#9968)
robertgshaw2-neuralmagic 5208dc7
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests ru…
hissu-hyvarinen 6e056bc
[Doc] Update VLM doc about loading from local files (#9999)
ywang96 04cef2c
[Bugfix] Fix `MQLLMEngine` hanging (#9973)
robertgshaw2-neuralmagic 9a5664d
[Misc] Refactor benchmark_throughput.py (#9779)
lk-chen ac04a97
[Frontend] Add max_tokens prometheus metric (#9881)
tomeras91 d93478b
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
bnellnm 2094062
[4.5/N] bugfix for quant config in speculative decode (#10007)
youkaichao 8f0a9ca
[Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
mgoin 04bbf38
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
tlrmchlsmth bbc3619
[Core] Make encoder-decoder inputs a nested structure to be more comp…
DarkLight1337 ad23318
[Bugfix] Fixup Mamba (#10004)
tlrmchlsmth ac12d53
Fix SchedulerConfig params (#459)
ldurejko 653e56c
Tensor parallelism for multi-step scheduling (#457)
tzielinski-habana 7a83b1a
[BugFix] Lazy import ray (#10021)
GeneDer 93dee88
[Misc] vllm CLI flags should be ordered for better user readability (…
chaunceyjiang 1033c3e
Set tokenizers version to <0.20.2 (#460)
madamczykhabana 5e56d88
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel 18f00d7
Merge remote-tracking branch 'upstream/main' into private/kzawora/oct…
kzawora-intel d397ba5
fix hpu execution
kzawora-intel 4c0647f
format.sh
kzawora-intel c41788f
fix type checks
kzawora-intel 5952d81
[Frontend] Fix tcp port reservation for api server (#10012)
russellb cd34029
Refactor TPU requirements file and pin build dependencies (#10010)
richardsliu 09d3550
[Misc] Add logging for CUDA memory (#10027)
yangalan123 731aec5
[CI/Build] Limit github CI jobs based on files changed (#9928)
russellb a53046b
[Model] Support quantization of PixtralHFTransformer for PixtralHF (#…
mgoin d2e8033
[Feature] Update benchmark_throughput.py to support image input (#9851)
lk-chen b9c64c0
[Misc] Modify BNB parameter name (#9997)
jeejeelee 0246246
[CI] Prune tests/models/decoder_only/language/* tests (#9940)
mgoin 235366f
[CI] Prune back the number of tests in tests/kernels/* (#9932)
mgoin ca9844b
[bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
youkaichao 43300bd
[Bugfix] Properly propagate trust_remote_code settings (#10047)
zifeitong 966e316
[Bugfix] Fix pickle of input when async output processing is on (#9931)
wallashss 0c63c34
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (…
llsj14 c4cacba
[v1] reduce graph capture time for piecewise cudagraph (#10059)
youkaichao 82bfc38
[Misc] Sort the list of embedding models (#10037)
DarkLight1337 ffc0f2b
[Model][OpenVINO] Fix regressions from #8346 (#10045)
petersalas 2bcbae7
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken …
tjohnson31415 ea928f6
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
arakowsk-amd 9d59b75
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input…
zifeitong 4089985
[V1] Integrate Piecewise CUDA graphs (#10058)
WoosukKwon 4be3a45
[distributed] add function to create ipc buffers directly (#10064)
youkaichao 21063c1
[CI/Build] drop support for Python 3.8 EOL (#8464)
aarnphm a5fda50
[CI/Build] Fix large_gpu_mark reason (#10070)
Isotr0py a02a50e
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
kzawora-intel 6a585a2
[Hotfix] Fix ruff errors (#10073)
WoosukKwon c3c0e90
[BugFix][Habana_main][Multistep]Fix multistep deepcopy overhead (#452)
xuechendi dc5cdfb
Set vllm-hpu-extension to 0063520 (#455)
madamczykhabana 7578f3b
Oct 28 rebase (#439)
kzawora-intel 07a6441
Revert "Oct 28 rebase" (#466)
kzawora-intel 5812cb6
Oct 28 rebase - attempt 2 (#467)
kzawora-intel 40882f3
Merge commit 'a5fda50a10641e47c0c290907f30ef2add6d4e7a' into HEAD
kzawora-intel 8e62377
format.sh
kzawora-intel 5eb7f3d
Nov 6 rebase (sans vllm-project#6143) (#468)
kzawora-intel 0a17a2e
Fix missed conflict (#469)
kzawora-intel b91403a
Merge commit 'a02a50e' into HEAD
kzawora-intel 843ae37
Merge commit '6a585a2' into HEAD
kzawora-intel 60b981e
Align fork with HPU upstream code (#465)
michalkuligowski 3c39626
The output tensor from sampling is the input_tokens to the (#471)
tzielinski-habana 66a67fc
gptq hpu support added
maktukmak f9cf700
row vs column paralel fix
maktukmak cbcba5d
g_idx check added
maktukmak a6ab053
exllama state removed
maktukmak 9b323b5
removed custom ops check
maktukmak 7077a99
format fixes
maktukmak 4ef6b7e
Merge branch 'gptq_hpu' of https://github.com/maktukmak/vllm-fork int…
maktukmak File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
from fractions import Fraction | ||
maktukmak marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from typing import Any, Dict, List, Optional | ||
|
||
import torch | ||
from torch.nn.parameter import Parameter | ||
|
||
from vllm import _custom_ops as ops | ||
from vllm.model_executor.layers.linear import LinearBase, LinearMethodBase | ||
from vllm.model_executor.layers.quantization.base_config import ( | ||
QuantizationConfig) | ||
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead | ||
from vllm.model_executor.parameter import (ChannelQuantScaleParameter, | ||
GroupQuantScaleParameter, | ||
PackedColumnParameter, | ||
PackedvLLMParameter, | ||
RowvLLMParameter) | ||
|
||
|
||
class GPTQHPUConfig(QuantizationConfig): | ||
"""Config class for GPTQ. | ||
|
||
Reference: https://arxiv.org/abs/2210.17323 | ||
""" | ||
|
||
def __init__( | ||
self, | ||
weight_bits: int, | ||
group_size: int, | ||
desc_act: bool, | ||
lm_head_quantized: bool, | ||
) -> None: | ||
self.weight_bits = weight_bits | ||
self.group_size = group_size | ||
self.desc_act = desc_act | ||
self.lm_head_quantized = lm_head_quantized | ||
self.pack_factor = Fraction(32, self.weight_bits) | ||
if self.weight_bits not in [2, 3, 4, 8]: | ||
raise ValueError( | ||
"Currently, only 2/3/4/8-bit weight quantization is " | ||
f"supported for GPTQ, but got {self.weight_bits} bits.") | ||
|
||
def __repr__(self) -> str: | ||
return (f"GPTQHPUConfig(weight_bits={self.weight_bits}, " | ||
f"group_size={self.group_size}, " | ||
f"desc_act={self.desc_act})," | ||
f"lm_head_quantized={self.lm_head_quantized}") | ||
|
||
@classmethod | ||
def get_name(cls) -> str: | ||
return "gptq_hpu" | ||
|
||
@classmethod | ||
def get_supported_act_dtypes(cls) -> List[torch.dtype]: | ||
return [torch.bfloat16] | ||
|
||
@classmethod | ||
# Need to figure it out | ||
def get_min_capability(cls) -> int: | ||
return 0 | ||
|
||
@classmethod | ||
def get_config_filenames(cls) -> List[str]: | ||
return ["quantize_config.json"] | ||
|
||
@classmethod | ||
def from_config(cls, config: Dict[str, Any]) -> "GPTQHPUConfig": | ||
weight_bits = cls.get_from_keys(config, ["bits"]) | ||
group_size = cls.get_from_keys(config, ["group_size"]) | ||
desc_act = cls.get_from_keys(config, ["desc_act"]) | ||
lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"], | ||
default=False) | ||
return cls(weight_bits, group_size, desc_act, lm_head_quantized) | ||
|
||
@classmethod | ||
def override_quantization_method(cls, hf_quant_cfg, | ||
user_quant) -> Optional[str]: | ||
|
||
is_valid_user_quant = user_quant == "gptq_hpu" | ||
|
||
if is_valid_user_quant: | ||
return cls.get_name() | ||
|
||
return None | ||
|
||
def get_quant_method(self, layer: torch.nn.Module, | ||
prefix: str) -> Optional["GPTQHPULinearMethod"]: | ||
if (isinstance(layer, LinearBase) or | ||
(isinstance(layer, ParallelLMHead) and self.lm_head_quantized)): | ||
return GPTQHPULinearMethod(self) | ||
return None | ||
|
||
def get_scaled_act_names(self) -> List[str]: | ||
return [] | ||
|
||
|
||
class GPTQHPULinearMethod(LinearMethodBase): | ||
"""Linear method for GPTQ. | ||
|
||
Args: | ||
quant_config: The GPTQ quantization config. | ||
""" | ||
|
||
def __init__(self, quant_config: GPTQHPUConfig): | ||
self.quant_config = quant_config | ||
|
||
def create_weights( | ||
self, | ||
layer: torch.nn.Module, | ||
input_size_per_partition: int, | ||
output_partition_sizes: List[int], | ||
input_size: int, | ||
output_size: int, | ||
params_dtype: torch.dtype, | ||
**extra_weight_attrs, | ||
): | ||
del output_size # Unused. | ||
maktukmak marked this conversation as resolved.
Show resolved
Hide resolved
|
||
weight_loader = extra_weight_attrs.get("weight_loader") | ||
if input_size_per_partition % self.quant_config.group_size != 0: | ||
raise ValueError( | ||
"The input size is not aligned with the quantized " | ||
"weight shape. This can be caused by too large " | ||
"tensor parallel size.") | ||
output_size_per_partition = sum(output_partition_sizes) | ||
if (output_size_per_partition % self.quant_config.pack_factor.numerator | ||
!= 0): | ||
raise ValueError( | ||
"The output size is not aligned with the quantized " | ||
"weight shape. This can be caused by too large " | ||
"tensor parallel size.") | ||
|
||
if self.quant_config.group_size != -1: | ||
group_size = self.quant_config.group_size | ||
else: | ||
group_size = input_size | ||
scale_and_zero_size = input_size // group_size | ||
scale_and_zero_input_dim = None | ||
|
||
qweight = PackedvLLMParameter( | ||
data=torch.empty( | ||
input_size_per_partition // self.quant_config.pack_factor, | ||
output_size_per_partition, | ||
dtype=torch.int32, | ||
), | ||
input_dim=0, | ||
output_dim=1, | ||
packed_dim=0, | ||
packed_factor=self.quant_config.pack_factor, | ||
weight_loader=weight_loader) | ||
|
||
g_idx = RowvLLMParameter(data=torch.tensor( | ||
[ | ||
i // self.quant_config.group_size | ||
for i in range(input_size_per_partition) | ||
], | ||
dtype=torch.int32, | ||
), | ||
input_dim=0, | ||
weight_loader=weight_loader) | ||
qzeros_args = { | ||
"data": | ||
torch.empty( | ||
scale_and_zero_size, | ||
output_size_per_partition // self.quant_config.pack_factor, | ||
dtype=torch.int32, | ||
), | ||
"weight_loader": | ||
weight_loader | ||
} | ||
weight_scale_args = { | ||
"data": | ||
torch.empty( | ||
scale_and_zero_size, | ||
output_size_per_partition, | ||
dtype=params_dtype, | ||
), | ||
"weight_loader": | ||
weight_loader | ||
} | ||
if scale_and_zero_input_dim is None: | ||
scales = ChannelQuantScaleParameter(output_dim=1, | ||
**weight_scale_args) | ||
qzeros = PackedColumnParameter( | ||
output_dim=1, | ||
packed_dim=1, | ||
packed_factor=self.quant_config.pack_factor, | ||
**qzeros_args) | ||
|
||
else: | ||
scales = GroupQuantScaleParameter(output_dim=1, | ||
input_dim=0, | ||
**weight_scale_args) | ||
qzeros = PackedvLLMParameter( | ||
input_dim=0, | ||
output_dim=1, | ||
packed_dim=1, | ||
packed_factor=self.quant_config.pack_factor, | ||
**qzeros_args) | ||
|
||
layer.register_parameter("qweight", qweight) | ||
layer.register_parameter("g_idx", g_idx) | ||
layer.register_parameter("qzeros", qzeros) | ||
layer.register_parameter("scales", scales) | ||
|
||
def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
|
||
self.wf = torch.tensor(list(range(0, 32, | ||
self.quant_config.weight_bits)), | ||
dtype=torch.int32).unsqueeze(0) | ||
weight = self.unpack_weight_from_cuda_old_format(layer) | ||
layer.qweight.data = self.pack_tensor(weight).to('hpu') | ||
|
||
zeros = self.unpack_zeros_from_cuda_old_format(layer).cpu() | ||
layer.qzeros.data = self.pack_tensor(zeros).to('hpu') | ||
|
||
# TODO: Support group indexing and remove the check | ||
columns = layer.qweight.shape[0] | ||
if self.quant_config.group_size > 0: | ||
g_idx_trivial = [ | ||
i // self.quant_config.group_size for i in range(columns) | ||
] | ||
else: | ||
g_idx_trivial = [0] * columns | ||
g_idx_trivial = torch.tensor(g_idx_trivial, dtype=torch.int32) | ||
assert torch.equal( | ||
layer.g_idx, | ||
g_idx_trivial), "Non-trivial tensor g_idx is not supported" | ||
|
||
# for torch.compile | ||
layer.qweight = Parameter(layer.qweight.data, requires_grad=False) | ||
layer.qzeros = Parameter(layer.qzeros.data, requires_grad=False) | ||
layer.g_idx = Parameter(layer.g_idx.data, requires_grad=False) | ||
layer.scales = Parameter(layer.scales.data, requires_grad=False) | ||
|
||
def apply(self, | ||
layer: torch.nn.Module, | ||
x: torch.Tensor, | ||
bias: Optional[torch.Tensor] = None) -> torch.Tensor: | ||
|
||
out_shape = x.shape[:-1] | ||
if hasattr(layer, 'output_size_per_partition'): | ||
out_shape += (layer.output_size_per_partition, ) | ||
else: | ||
out_shape += (layer.output_size, ) | ||
|
||
reshaped_x = x.reshape(-1, x.shape[-1]) | ||
|
||
output = ops.gptq_hpu_gemm(reshaped_x, layer.qweight, layer.qzeros, | ||
layer.scales, layer.g_idx, None, | ||
self.quant_config.weight_bits) | ||
if bias is not None: | ||
output.add_(bias) | ||
return output.reshape(out_shape) | ||
|
||
def pack_tensor(self, input, bits=4): | ||
normal = input.to(torch.int32) | ||
q = torch.sum(torch.bitwise_left_shift( | ||
normal.reshape(normal.shape[0], -1, (32 // bits)), | ||
self.wf.unsqueeze(0)), | ||
dim=-1).to(torch.int32) | ||
|
||
return q | ||
|
||
def unpack_zeros_from_cuda_old_format(self, layer): | ||
|
||
bits = self.quant_config.weight_bits | ||
zeros = torch.bitwise_right_shift( | ||
torch.unsqueeze(layer.qzeros.to('cpu'), | ||
2).expand(-1, -1, 32 // bits), | ||
self.wf.unsqueeze(0), | ||
).to(torch.int16 if bits == 8 else torch.int8) | ||
|
||
zeros = zeros + 1 | ||
zeros = torch.bitwise_and(zeros, (2**bits) - 1).to( | ||
layer.scales.dtype) # NOTE: It appears that casting here | ||
#after the `zeros = zeros + 1` is important. | ||
zeros = zeros.reshape(-1, zeros.shape[1] * zeros.shape[2]) | ||
return zeros | ||
|
||
def unpack_weight_from_cuda_old_format(self, layer): | ||
|
||
qweight = layer.qweight.cpu() | ||
bits = self.quant_config.weight_bits | ||
|
||
weight = torch.bitwise_right_shift( | ||
torch.unsqueeze(qweight, 1).expand(-1, 32 // bits, -1), | ||
self.wf.unsqueeze(-1), | ||
).to(torch.int16 if bits == 8 else torch.int8) | ||
weight = torch.bitwise_and(weight, (2**bits) - 1) | ||
weight = weight.reshape( | ||
(weight.shape[0] * weight.shape[1], weight.shape[2])) | ||
return weight |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1# Start with capital letter.
2# This cannot be merged as is, because this class, ScalarType is widely used and will make it difficult to upstream. Use derived class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this class from the original repo because it blocked me from testing the feature. The class is defined here: https://github.com/vllm-project/vllm/blob/fb2716d64117aaa6c36b97b09765aa10a89e2fe5/vllm/scalar_type.py#L19
Let me know if there is a better way,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Vllm was rebased and this methods do have those defitnitions now (its now under scalar_type.py name). Please rebase