An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).
- 2024-06-29 πππ v0.9.1 released. With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
- 2024-06-20 β¨ GPTQModel v0.9.0 released. Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
We want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang.
GPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.
We will backport bug fixes to AutoGPTQ on a case-by-case basis.
- π Added
DeepSeek-V2
Model Support - π Added
DeepSeek-V2-Lite
Model Support - π Added
ChatGLM
Model Support - π Added
MiniCPM
Model Support - π Added
Phi-3
Model Support - π Added
Qwen2MoE
Model Support - π Added
DBRX
Model Support (Converted Model) - π BITBLAS format/inference support from Microsoft
- π
Sym=False
Support. AutoGPTQ has unusablesym=false
. (Re-quant required) - π
lm_head
module quant inference support for further VRAM reduction. - π Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
- π Better quality quants as measured by PPL. (Test config: defaults +
sym=True
+FORMAT.GPTQ
, TinyLlama) - π Model weights sharding support
- π Security: hash check of model weights on load
- β¨ Alert users of sub-optimal calibration data. Most new users get this part horribly wrong.
- πΎ Removed non-working, partially working, or fully deprecated features: Peft, ROCM, AWQ Gemm inference, Triton v1 (replaced by v2), Fused Attention (Replaced by Marlin/Exllama).
- πΎ
Fixed packing Performance regression on high core-count systems.Backported to AutoGPTQ - πΎ
Fixed crash on H100.Backported to AutoGPTQ - β¨ Many thousands of lines of refactor/cleanup.
- β¨ Added CI workflow for validation of future PRs and prevent code regressions.
- β¨ Added perplexity unit-test to prevent against model quant quality regressions.
- πΎ De-bloated 271K lines of which 250K was caused by a single dataset used only by an example.
- πΎ De-bloat the number of args presented in public
.from_quantized()
/.from_pretrained()
api - β¨ Shorter and more concise public api/internal vars. No need to mimic HF style for verbose class names.
- β¨ Everything that did not pass unit-tests have been removed from repo.
lm_head
quantization support by integrating with Intel/AutoRound.- Customizable callback in Per-Layer quantization.
- Add Qbits (cpu inference) support from Intel/Qbits.
- Add back ROCM/AMD support once everything is validated.
- Store quant loss stat and apply diffs to new quant for quality control.
- Add Tests for every single supported model.
Model | |||||||
---|---|---|---|---|---|---|---|
Baichuan | β | DeepSeek-V2-Lite | π | LongLLaMA | β | Phi-3 | π |
Bloom | β | Falon | β | MiniCPM | π | Qwen | β |
ChatGLM | π | GPTBigCod | β | Mistral | β | Qwen2MoE | π |
CodeGen | β | GPTNeoX | β | Mixtral | β | RefinedWeb | β |
Cohere | β | GPT-2 | β | MOSS | β | StableLM | β |
DBRX Converted | π | GPT-J | β | MPT | β | StarCoder2 | β |
Deci | β | InternLM | β | OPT | β | XVERSE | β |
DeepSeek-V2 | π | Llama | β | Phi | β | Yi | β |
We aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis.
GPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU.
WSL on Windows should work as well.
ROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo.
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# compile and install
pip install -vvv --no-build-isolation .
pip install gptq-model --no-build-isolation
warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
Below is an example for the simplest use of gptqmodel
to quantize a model and inference after quantization:
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig
pretrained_model_dir = "facebook/opt-125m"
quant_output_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
calibration_dataset = [
tokenizer(
"The world is a wonderful place full of beauty and love."
)
]
quant_config = QuantizeConfig(
bits=4, # 4-bit
group_size=128, # 128 is good balance between quality and performance
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)
# quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(calibration_dataset)
# save quantized model
model.save_quantized(quant_output_dir)
# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_output_dir)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))
For more advanced features of model quantization, please reference to this script
Read the gptqmodel/models/llama.py
code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
You can use tasks defined in gptqmodel.eval_tasks
to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in π€ transformers and in this project.
Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from gptqmodel import GPTQModel, QuantizeConfig
from gptqmodel.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = GPTQModel.from_pretrained(MODEL, QuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())
# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
tutorials provide step-by-step guidance to integrate gptqmodel
with your own project and some best practice principles.
examples provide plenty of example scripts to use gptqmodel
in different ways.
Currently, gptqmodel
supports: LanguageModelingTask
, SequenceClassificationTask
and TextSummarizationTask
; more Tasks will come soon!
GPTQModel will use Marlin, Exllama v2, Triton/CUDA kernels in that order for maximum inference performance.
- Elias Frantar, Saleh Ashkboos, Torsten Hoefler and Dan Alistarh: for creating GPTQ and Marlin.
- PanQiWei: for creation of AutoGPTQ which this project code is based upon.
- FXMarty: for maintaining and support of AutoGPTQ.
- Qwopqwop200: for quantization code used in this project adapted from GPTQ-for-LLaMa.
- Turboderp: for releasing Exllama v1 and Exllama v2 kernels adapted for use in this project.
- FpgaMiner: for GPTQ-Triton kernels used in GPTQ-for-LLaMa which is adapted into this project.