Merge branch 'ModelCloud:main' into main

ZX-ModelCloud · Dec 13, 2024 · d529374 · d529374
2 parents 443db44 + 95dedbe
commit d529374
Show file tree

Hide file tree

Showing 22 changed files with 257 additions and 290 deletions.
diff --git a/README.md b/README.md
@@ -9,23 +9,26 @@
 </p>
 
 ## News
-* 12/12/2024 1.4.1-dev: Added Qwen2-VL model support. `mse` quantization property exposed in `QuantizeConfig`. 
+* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. 
 * 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime. 
 
 * 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg. 
 * 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency. 
 * 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests. 
 * 11/11/2024 🚀 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api. 
 * 10/29/2024 🚀 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage. 
+
+<details>
+
+<summary>Archived News:</summary>
 * 10/12/2024 ✨ [1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.
+
 * 10/11/2024 ✨ [1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.
 * 10/08/2024 ✨ [1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.
+
 * 09/26/2024 ✨ [1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.
 * 09/26/2024 ✨ [1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
 
-<details>
-
-<summary>Archived News:</summary>
 * 09/26/2024 ✨ [1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. 
 * 09/18/2024 ✨ [1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
 * 08/16/2024 ✨ [1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. 
@@ -56,18 +59,19 @@ GPTQModel started out as a major refractor (fork) of AutoGTQP but has now morphe
 Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production level inference speed in both token latency and rps. GPTQ has the optimal blend of quality and inference speed you would want to use in a real-world production system. 
 
 ## Features
-* 🚀 Extensive model support for: `Olmo2`, `Hymba`, `GLM`, `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 3.5`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX`.
-* ✨ 100% CI coverage for all supported models including quality/ppl regression.
+* 🚀 Extensive model support for: `Llama 1-3.3`, `Qwen2-VL`, `Olmo2`, `Hymba`, `GLM`, `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 1-4`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Qwen2MoE`, `DBRX`.
+* 💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
+* ✨ `Dynamic`/Mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. 
 * 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized model where format = `FORMAT.GPTQ` 
-* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) 4bit quantization/inference support on CPU (`avx512_vnni`) and Intel/XPU. 
-* 🚀 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
-* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference. 
-* 🚀 Asymmetric `Sym=False` support via `FORMAT.GPTQ_V2`. 
-* 🚀`lm_head` module quant inference support for further VRAM reduction (auto-round only). 
+* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) 4bit quantization/inference support on CPU (recent Intel/AMD) and Intel/XPU. 
+* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.
+* ✨ [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization.
+* ✨ Asymmetric `Sym=False` support via `FORMAT.GPTQ_V2`. 
+* ✨ `lm_head` module quant inference support for further VRAM reduction (auto-round only). 
 * 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
-* 🚀 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
-* 🚀 Model weights sharding support
-* 🚀 Security: hash check of model weights on load
+* ✨ Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
+* ✨ Model weights sharding support
+* ✨ Security: hash check of model weights on load
 * 🚀 Over 50% faster PPL calculations (OPT)
 * 🚀 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)
 
@@ -79,8 +83,8 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
 ## Model Support:  🚀 (Added by GPTQModel) 
 | Model            |     |                |     |                  |     |            |     |     |
 | ---------------- | --- | -------------- | --- | ---------------- | --- | ---------- | --- | --- |
-| Baichuan         | ✅   | Falcon         | ✅   | Llama 1/2/3      | ✅   | OLMo2      | 🚀  |     |
-| Bloom            | ✅   | Gemma 2        | 🚀  | Llama 3.2 Vision | 🚀  | Phi/Phi-3  | 🚀  |     |
+| Baichuan         | ✅   | Falcon         | ✅   | Llama 1-3.3      | ✅   | OLMo2      | 🚀  |     |
+| Bloom            | ✅   | Gemma 2        | 🚀  | Llama 3.2 Vision | 🚀  | Phi 1-4  | 🚀  |     |
 | ChatGLM          | 🚀  | GPTBigCod      | ✅   | LongLLaMA        | ✅   | Qwen       | ✅   |     |
 | CodeGen          | ✅   | GPTNeoX        | ✅   | MiniCPM3         | ✅   | Qwen2MoE   | 🚀  |     |
 | Cohere           | ✅   | GPT-2          | ✅   | Mistral          | ✅   | Qwen2VL    | 🚀  |     |
@@ -90,15 +94,15 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
 | DeepSeek-V2-Lite | 🚀  | Hymba          | 🚀  | MPT              | ✅   | XVERSE     | ✅   |     |
 | EXAONE 3.0       | 🚀  | InternLM 1/2.5 | 🚀  | OPT              | ✅   | Yi         | ✅   |     |
 
-## HW Accelerator Requirements
+## Kernel and HW Accelerator Support 
 
 GPTQModel is validated for Linux x86_64 with the following devices:
 
-| Device           |     |                | 
-| ---------------- | --- | -------------- | 
-| Nvidia GPU     | ✅   | Ampere or Higher |
-| Intel/AMD CPU  | ✅   | `avx512_vnni` or `amx` |
-| Intel XPU  | ✅   |   Intel Arc + Datacenter Max |
+| Device           |     |  Optimized Arch              |  Kernels |
+| ---------------- | --- | -------------- | -------------- | 
+| Nvidia GPU     | ✅   | Ampere or Higher | Marlin, Exllama V2, Exallma V1, Triton, DyanamicCuda, Torch |
+| Intel/AMD CPU  | ✅   | `avx512` or `amx` | IPEX, Torch |
+| Intel XPU  | ✅   |   Intel Arc + Datacenter Max | IPEX, Torch |
 
 ## Install
 
@@ -199,13 +203,6 @@ lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.L
 evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
 ```
 
-
-### Which kernel is used by default?
-
-* `GPU`: Marlin, Exllama v2, Exllama v1, DynamicCuda, Torch kernels in that order for maximum inference performance. Optional Microsoft/BITBLAS kernel can be toggled.
-* `CPU`: Intel/IPEX kernel
-* `XPU`: Intel/IPEX kernel
-
 ## Citation
 ```
 @misc{gptqmodel,

diff --git a/format/format.sh b/format/format.sh
@@ -5,7 +5,7 @@ cd "$(dirname "$0")" || exit
 # force ruff/isort to be same version as setup.py
 pip install -U ruff==0.4.9 isort==5.13.2
 
-ruff check ../gptqmodel ../examples ../tests ../setup.py --fix
+ruff check ../gptqmodel/models ../gptqmodel/nn_modules ../gptqmodel/quantization ../gptqmodel/utils ../gptqmodel/__init__.py ../examples ../tests ../setup.py --fix
 ruff_status=$?
 
 isort -l 119 -e ../

diff --git a/gptqmodel/integration/integration_vllm.py b/gptqmodel/integration/integration_vllm.py
@@ -1,6 +1,7 @@
 def patch_vllm():
     from vllm.model_executor.layers import quantization
     from vllm.model_executor.layers.quantization import gptq_marlin
+
     from .src.vllm import gptq_marlin as gptqmodel_marlin
 
     quantization.QUANTIZATION_METHODS["gptq_marlin"] = gptqmodel_marlin.GPTQMarlinConfig

diff --git a/gptqmodel/integration/src/optimum/gptq/quantizer.py b/gptqmodel/integration/src/optimum/gptq/quantizer.py
@@ -21,35 +21,30 @@
 from typing import Any, Dict, List, Optional, Tuple, Union
 
 import torch
+from gptqmodel.integration.src.optimum.utils.import_utils import is_gptqmodel_available
+from optimum.gptq.constants import GPTQ_CONFIG
+from optimum.gptq.data import get_dataset, prepare_dataset
+from optimum.gptq.utils import get_block_name_with_pattern, get_device, get_layers, get_preceding_modules, get_seqlen
+from optimum.utils import is_accelerate_available, is_auto_gptq_available
+from optimum.utils.modeling_utils import recurse_getattr
+from optimum.version import __version__ as optimum_version
 from packaging import version
 from torch import nn
 from tqdm.auto import tqdm
 from transformers import AutoTokenizer
 from transformers.pytorch_utils import Conv1D
 from transformers.utils.quantization_config import QuantizationMethod
 
-from optimum.utils import is_accelerate_available, is_auto_gptq_available
-from optimum.utils.modeling_utils import recurse_getattr
-from optimum.gptq.constants import GPTQ_CONFIG
-from optimum.gptq.data import get_dataset, prepare_dataset
-from optimum.gptq.utils import get_block_name_with_pattern, get_device, get_layers, get_preceding_modules, get_seqlen
-from optimum.version import __version__ as optimum_version
-
-from gptqmodel.integration.src.optimum.utils.import_utils import is_gptqmodel_available
-
 if is_accelerate_available():
-    from accelerate import (
-        cpu_offload_with_hook,
-        load_checkpoint_and_dispatch,
-    )
+    from accelerate import cpu_offload_with_hook, load_checkpoint_and_dispatch
     from accelerate.hooks import remove_hook_from_module
 
 if is_auto_gptq_available():
+    from auto_gptq import __version__ as autogptq_version
     from auto_gptq import exllama_set_max_input_length
     from auto_gptq.modeling._utils import autogptq_post_init as gptq_post_init
     from auto_gptq.quantization import GPTQ
     from auto_gptq.utils.import_utils import dynamically_import_QuantLinear as hf_select_quant_linear
-    from auto_gptq import __version__ as autogptq_version
 
 if is_gptqmodel_available():
     from gptqmodel import exllama_set_max_input_length

diff --git a/gptqmodel/integration/src/optimum/utils/import_utils.py b/gptqmodel/integration/src/optimum/utils/import_utils.py
@@ -24,14 +24,8 @@
 from typing import Any, Callable, Dict, Iterable, Optional, Tuple
 
 import torch
-
-from optimum.utils import (
-    is_accelerate_available,
-    is_auto_gptq_available,
-    is_diffusers_available,
-    is_sentence_transformers_available,
-    is_timm_available,
-)
+from optimum.utils import (is_accelerate_available, is_auto_gptq_available, is_diffusers_available,
+                           is_sentence_transformers_available, is_timm_available)
 
 # Copyright 2022 The HuggingFace Team. All rights reserved.
 #

diff --git a/gptqmodel/integration/src/optimum/utils/testing_utils.py b/gptqmodel/integration/src/optimum/utils/testing_utils.py
@@ -24,16 +24,9 @@
 from typing import Any, Callable, Dict, Iterable, Optional, Tuple
 
 import torch
-
-from optimum.utils import (
-    is_accelerate_available,
-    is_auto_gptq_available,
-    is_diffusers_available,
-    is_sentence_transformers_available,
-    is_timm_available,
-)
-
 from gptqmodel.integration.src.optimum.utils.import_utils import is_datasets_available, is_gptqmodel_available
+from optimum.utils import (is_accelerate_available, is_auto_gptq_available, is_diffusers_available,
+                           is_sentence_transformers_available, is_timm_available)
 
 # Used to test the hub
 USER = "__DUMMY_OPTIMUM_USER__"

diff --git a/gptqmodel/integration/src/peft/tuners/adalora/model.py b/gptqmodel/integration/src/peft/tuners/adalora/model.py
@@ -15,22 +15,16 @@
 import warnings
 
 import torch
-from peft.tuners.adalora import RankAllocator, AdaLoraLayer, SVDQuantLinear, SVDLinear
-from transformers.pytorch_utils import Conv1D
-
+from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
 from peft.import_utils import is_bnb_4bit_available, is_bnb_available
+from peft.tuners.adalora import AdaLoraLayer, RankAllocator, SVDLinear, SVDQuantLinear
 from peft.tuners.lora import LoraConfig, LoraModel
 from peft.tuners.tuners_utils import BaseTunerLayer
-from peft.utils import (
-    TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING,
-    _freeze_adapter,
-    _get_submodules,
-    get_auto_gptq_quant_linear,
-    get_quantization_config,
-)
+from peft.utils import (TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING, _freeze_adapter,
+                        _get_submodules, get_auto_gptq_quant_linear, get_quantization_config)
 from peft.utils.integrations import gather_params_ctx
+from transformers.pytorch_utils import Conv1D
 
-from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
 from ...import_utils import is_gptqmodel_available
 
 
@@ -155,7 +149,6 @@ def _create_new_module(lora_config, adapter_name, target, device_map, **kwargs):
         # avoid eager bnb import
         if is_bnb_available():
             import bitsandbytes as bnb
-
             from peft.tuners.adalora.bnb import SVDLinear8bitLt
         if is_bnb_4bit_available():
             from peft.tuners.adalora.bnb import SVDLinear4bit

diff --git a/gptqmodel/integration/src/peft/tuners/lora/gptq.py b/gptqmodel/integration/src/peft/tuners/lora/gptq.py
@@ -15,14 +15,12 @@
 from typing import Any, Optional
 
 import torch
-
+from gptqmodel.integration.src.peft.import_utils import is_gptqmodel_available
+from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
 from peft.tuners.lora.layer import LoraLayer
 from peft.tuners.tuners_utils import BaseTunerLayer
 from peft.utils import get_auto_gptq_quant_linear
 
-from gptqmodel.integration.src.peft.import_utils import is_gptqmodel_available
-from gptqmodel.integration.src.peft.utils import get_gptqmodel_quant_linear
-
 
 class QuantLinear(torch.nn.Module, LoraLayer):
     def __init__(

diff --git a/gptqmodel/integration/src/peft/tuners/lora/model.py b/gptqmodel/integration/src/peft/tuners/lora/model.py
@@ -23,28 +23,7 @@
 from typing import Literal, Optional
 
 import torch
-from torch import nn
-from tqdm import tqdm
-
 from peft.import_utils import is_bnb_4bit_available, is_bnb_available
-from peft.tuners.tuners_utils import (
-    BaseTuner,
-    BaseTunerLayer,
-    check_target_module_exists,
-    onload_layer,
-    replicate_layers,
-)
-from peft.utils import (
-    TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING,
-    ModulesToSaveWrapper,
-    _freeze_adapter,
-    _get_submodules,
-    get_peft_model_state_dict,
-    get_quantization_config,
-)
-from peft.utils.merge_utils import dare_linear, dare_ties, magnitude_prune, task_arithmetic, ties
-from peft.utils.other import get_pattern_key
-
 from peft.tuners.lora.aqlm import dispatch_aqlm
 from peft.tuners.lora.awq import dispatch_awq
 from peft.tuners.lora.config import LoraConfig
@@ -54,6 +33,14 @@
 from peft.tuners.lora.layer import Conv2d, LoraLayer, dispatch_default
 from peft.tuners.lora.torchao import dispatch_torchao
 from peft.tuners.lora.tp_layer import dispatch_megatron
+from peft.tuners.tuners_utils import (BaseTuner, BaseTunerLayer, check_target_module_exists,
+                                      onload_layer, replicate_layers)
+from peft.utils import (TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING, ModulesToSaveWrapper,
+                        _freeze_adapter, _get_submodules, get_peft_model_state_dict, get_quantization_config)
+from peft.utils.merge_utils import dare_linear, dare_ties, magnitude_prune, task_arithmetic, ties
+from peft.utils.other import get_pattern_key
+from torch import nn
+from tqdm import tqdm
 
 
 def _adapter_names_pre_forward_hook(target, args, kwargs, adapter_names):