Skip to content

Commit

Permalink
Merge branch 'ModelCloud:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
ZX-ModelCloud authored Dec 13, 2024
2 parents 443db44 + 95dedbe commit d529374
Show file tree
Hide file tree
Showing 22 changed files with 257 additions and 290 deletions.
55 changes: 26 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,26 @@
</p>

## News
* 12/12/2024 1.4.1-dev: Added Qwen2-VL model support. `mse` quantization property exposed in `QuantizeConfig`.
* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status.
* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime.

* 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency.
* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests.
* 11/11/2024 🚀 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api.
* 10/29/2024 🚀 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.

<details>

<summary>Archived News:</summary>
* 10/12/2024 ✨ [1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.

* 10/11/2024 ✨ [1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.
* 10/08/2024 ✨ [1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.

* 09/26/2024 ✨ [1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.
* 09/26/2024 ✨ [1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.

<details>

<summary>Archived News:</summary>
* 09/26/2024 ✨ [1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
* 09/18/2024 ✨ [1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
* 08/16/2024 ✨ [1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
Expand Down Expand Up @@ -56,18 +59,19 @@ GPTQModel started out as a major refractor (fork) of AutoGTQP but has now morphe
Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production level inference speed in both token latency and rps. GPTQ has the optimal blend of quality and inference speed you would want to use in a real-world production system.

## Features
* 🚀 Extensive model support for: `Olmo2`, `Hymba`, `GLM`, `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 3.5`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX`.
* ✨ 100% CI coverage for all supported models including quality/ppl regression.
* 🚀 Extensive model support for: `Llama 1-3.3`, `Qwen2-VL`, `Olmo2`, `Hymba`, `GLM`, `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 1-4`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Qwen2MoE`, `DBRX`.
* 💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
*`Dynamic`/Mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together.
* 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) 4bit quantization/inference support on CPU (`avx512_vnni`) and Intel/XPU.
* 🚀 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.
* 🚀 Asymmetric `Sym=False` support via `FORMAT.GPTQ_V2`.
* 🚀`lm_head` module quant inference support for further VRAM reduction (auto-round only).
* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) 4bit quantization/inference support on CPU (recent Intel/AMD) and Intel/XPU.
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.
* [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization.
* Asymmetric `Sym=False` support via `FORMAT.GPTQ_V2`.
* `lm_head` module quant inference support for further VRAM reduction (auto-round only).
* 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
* 🚀 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
* 🚀 Model weights sharding support
* 🚀 Security: hash check of model weights on load
* Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
* Model weights sharding support
* Security: hash check of model weights on load
* 🚀 Over 50% faster PPL calculations (OPT)
* 🚀 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)

Expand All @@ -79,8 +83,8 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
## Model Support: 🚀 (Added by GPTQModel)
| Model | | | | | | | | |
| ---------------- | --- | -------------- | --- | ---------------- | --- | ---------- | --- | --- |
| Baichuan || Falcon || Llama 1/2/3 || OLMo2 | 🚀 | |
| Bloom || Gemma 2 | 🚀 | Llama 3.2 Vision | 🚀 | Phi/Phi-3 | 🚀 | |
| Baichuan || Falcon || Llama 1-3.3 || OLMo2 | 🚀 | |
| Bloom || Gemma 2 | 🚀 | Llama 3.2 Vision | 🚀 | Phi 1-4 | 🚀 | |
| ChatGLM | 🚀 | GPTBigCod || LongLLaMA || Qwen || |
| CodeGen || GPTNeoX || MiniCPM3 || Qwen2MoE | 🚀 | |
| Cohere || GPT-2 || Mistral || Qwen2VL | 🚀 | |
Expand All @@ -90,15 +94,15 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
| DeepSeek-V2-Lite | 🚀 | Hymba | 🚀 | MPT || XVERSE || |
| EXAONE 3.0 | 🚀 | InternLM 1/2.5 | 🚀 | OPT || Yi || |

## HW Accelerator Requirements
## Kernel and HW Accelerator Support

GPTQModel is validated for Linux x86_64 with the following devices:

| Device | | |
| ---------------- | --- | -------------- |
| Nvidia GPU || Ampere or Higher |
| Intel/AMD CPU || `avx512_vnni` or `amx` |
| Intel XPU || Intel Arc + Datacenter Max |
| Device | | Optimized Arch | Kernels |
| ---------------- | --- | -------------- | -------------- |
| Nvidia GPU || Ampere or Higher | Marlin, Exllama V2, Exallma V1, Triton, DyanamicCuda, Torch |
| Intel/AMD CPU || `avx512` or `amx` | IPEX, Torch |
| Intel XPU || Intel Arc + Datacenter Max | IPEX, Torch |

## Install

Expand Down Expand Up @@ -199,13 +203,6 @@ lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.L
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
```


### Which kernel is used by default?

* `GPU`: Marlin, Exllama v2, Exllama v1, DynamicCuda, Torch kernels in that order for maximum inference performance. Optional Microsoft/BITBLAS kernel can be toggled.
* `CPU`: Intel/IPEX kernel
* `XPU`: Intel/IPEX kernel

## Citation
```
@misc{gptqmodel,
Expand Down
2 changes: 1 addition & 1 deletion format/format.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ cd "$(dirname "$0")" || exit
# force ruff/isort to be same version as setup.py
pip install -U ruff==0.4.9 isort==5.13.2

ruff check ../gptqmodel ../examples ../tests ../setup.py --fix
ruff check ../gptqmodel/models ../gptqmodel/nn_modules ../gptqmodel/quantization ../gptqmodel/utils ../gptqmodel/__init__.py ../examples ../tests ../setup.py --fix
ruff_status=$?

isort -l 119 -e ../
Expand Down
1 change: 1 addition & 0 deletions gptqmodel/integration/integration_vllm.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
def patch_vllm():
from vllm.model_executor.layers import quantization
from vllm.model_executor.layers.quantization import gptq_marlin

from .src.vllm import gptq_marlin as gptqmodel_marlin

quantization.QUANTIZATION_METHODS["gptq_marlin"] = gptqmodel_marlin.GPTQMarlinConfig
Expand Down
23 changes: 9 additions & 14 deletions gptqmodel/integration/src/optimum/gptq/quantizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,35 +21,30 @@
from typing import Any, Dict, List, Optional, Tuple, Union

import torch
from gptqmodel.integration.src.optimum.utils.import_utils import is_gptqmodel_available
from optimum.gptq.constants import GPTQ_CONFIG
from optimum.gptq.data import get_dataset, prepare_dataset
from optimum.gptq.utils import get_block_name_with_pattern, get_device, get_layers, get_preceding_modules, get_seqlen
from optimum.utils import is_accelerate_available, is_auto_gptq_available
from optimum.utils.modeling_utils import recurse_getattr
from optimum.version import __version__ as optimum_version
from packaging import version
from torch import nn
from tqdm.auto import tqdm
from transformers import AutoTokenizer
from transformers.pytorch_utils import Conv1D
from transformers.utils.quantization_config import QuantizationMethod

from optimum.utils import is_accelerate_available, is_auto_gptq_available
from optimum.utils.modeling_utils import recurse_getattr
from optimum.gptq.constants import GPTQ_CONFIG
from optimum.gptq.data import get_dataset, prepare_dataset
from optimum.gptq.utils import get_block_name_with_pattern, get_device, get_layers, get_preceding_modules, get_seqlen
from optimum.version import __version__ as optimum_version

from gptqmodel.integration.src.optimum.utils.import_utils import is_gptqmodel_available

if is_accelerate_available():
from accelerate import (
cpu_offload_with_hook,
load_checkpoint_and_dispatch,
)
from accelerate import cpu_offload_with_hook, load_checkpoint_and_dispatch
from accelerate.hooks import remove_hook_from_module

if is_auto_gptq_available():
from auto_gptq import __version__ as autogptq_version
from auto_gptq import exllama_set_max_input_length
from auto_gptq.modeling._utils import autogptq_post_init as gptq_post_init
from auto_gptq.quantization import GPTQ
from auto_gptq.utils.import_utils import dynamically_import_QuantLinear as hf_select_quant_linear
from auto_gptq import __version__ as autogptq_version

if is_gptqmodel_available():
from gptqmodel import exllama_set_max_input_length
Expand Down
10 changes: 2 additions & 8 deletions gptqmodel/integration/src/optimum/utils/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,8 @@
from typing import Any, Callable, Dict, Iterable, Optional, Tuple

import torch

from optimum.utils import (
is_accelerate_available,
is_auto_gptq_available,
is_diffusers_available,
is_sentence_transformers_available,
is_timm_available,
)
from optimum.utils import (is_accelerate_available, is_auto_gptq_available, is_diffusers_available,
is_sentence_transformers_available, is_timm_available)

# Copyright 2022 The HuggingFace Team. All rights reserved.
#
Expand Down
11 changes: 2 additions & 9 deletions gptqmodel/integration/src/optimum/utils/testing_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,9 @@
from typing import Any, Callable, Dict, Iterable, Optional, Tuple

import torch

from optimum.utils import (
is_accelerate_available,
is_auto_gptq_available,
is_diffusers_available,
is_sentence_transformers_available,
is_timm_available,
)

from gptqmodel.integration.src.optimum.utils.import_utils import is_datasets_available, is_gptqmodel_available
from optimum.utils import (is_accelerate_available, is_auto_gptq_available, is_diffusers_available,
is_sentence_transformers_available, is_timm_available)

# Used to test the hub
USER = "__DUMMY_OPTIMUM_USER__"
Expand Down
17 changes: 5 additions & 12 deletions gptqmodel/integration/src/peft/tuners/adalora/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,16 @@
import warnings

import torch
from peft.tuners.adalora import RankAllocator, AdaLoraLayer, SVDQuantLinear, SVDLinear
from transformers.pytorch_utils import Conv1D

from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
from peft.import_utils import is_bnb_4bit_available, is_bnb_available
from peft.tuners.adalora import AdaLoraLayer, RankAllocator, SVDLinear, SVDQuantLinear
from peft.tuners.lora import LoraConfig, LoraModel
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import (
TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING,
_freeze_adapter,
_get_submodules,
get_auto_gptq_quant_linear,
get_quantization_config,
)
from peft.utils import (TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING, _freeze_adapter,
_get_submodules, get_auto_gptq_quant_linear, get_quantization_config)
from peft.utils.integrations import gather_params_ctx
from transformers.pytorch_utils import Conv1D

from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
from ...import_utils import is_gptqmodel_available


Expand Down Expand Up @@ -155,7 +149,6 @@ def _create_new_module(lora_config, adapter_name, target, device_map, **kwargs):
# avoid eager bnb import
if is_bnb_available():
import bitsandbytes as bnb

from peft.tuners.adalora.bnb import SVDLinear8bitLt
if is_bnb_4bit_available():
from peft.tuners.adalora.bnb import SVDLinear4bit
Expand Down
6 changes: 2 additions & 4 deletions gptqmodel/integration/src/peft/tuners/lora/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,12 @@
from typing import Any, Optional

import torch

from gptqmodel.integration.src.peft.import_utils import is_gptqmodel_available
from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
from peft.tuners.lora.layer import LoraLayer
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import get_auto_gptq_quant_linear

from gptqmodel.integration.src.peft.import_utils import is_gptqmodel_available
from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear


class QuantLinear(torch.nn.Module, LoraLayer):
def __init__(
Expand Down
29 changes: 8 additions & 21 deletions gptqmodel/integration/src/peft/tuners/lora/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,28 +23,7 @@
from typing import Literal, Optional

import torch
from torch import nn
from tqdm import tqdm

from peft.import_utils import is_bnb_4bit_available, is_bnb_available
from peft.tuners.tuners_utils import (
BaseTuner,
BaseTunerLayer,
check_target_module_exists,
onload_layer,
replicate_layers,
)
from peft.utils import (
TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING,
ModulesToSaveWrapper,
_freeze_adapter,
_get_submodules,
get_peft_model_state_dict,
get_quantization_config,
)
from peft.utils.merge_utils import dare_linear, dare_ties, magnitude_prune, task_arithmetic, ties
from peft.utils.other import get_pattern_key

from peft.tuners.lora.aqlm import dispatch_aqlm
from peft.tuners.lora.awq import dispatch_awq
from peft.tuners.lora.config import LoraConfig
Expand All @@ -54,6 +33,14 @@
from peft.tuners.lora.layer import Conv2d, LoraLayer, dispatch_default
from peft.tuners.lora.torchao import dispatch_torchao
from peft.tuners.lora.tp_layer import dispatch_megatron
from peft.tuners.tuners_utils import (BaseTuner, BaseTunerLayer, check_target_module_exists,
onload_layer, replicate_layers)
from peft.utils import (TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING, ModulesToSaveWrapper,
_freeze_adapter, _get_submodules, get_peft_model_state_dict, get_quantization_config)
from peft.utils.merge_utils import dare_linear, dare_ties, magnitude_prune, task_arithmetic, ties
from peft.utils.other import get_pattern_key
from torch import nn
from tqdm import tqdm


def _adapter_names_pre_forward_hook(target, args, kwargs, adapter_names):
Expand Down
Loading

0 comments on commit d529374

Please sign in to comment.