Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into basemodel_add_Modality
Browse files Browse the repository at this point in the history
  • Loading branch information
ZX-ModelCloud committed Dec 20, 2024
2 parents 1456e59 + d90adde commit e8012bb
Show file tree
Hide file tree
Showing 21 changed files with 75 additions and 102 deletions.
1 change: 1 addition & 0 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ on:
default: '10'

env:
DEBUG: 1
CUDA_DEVICE_ORDER: PCI_BUS_ID
CUDA_VISIBLE_DEVICES: 0
TORCH_CUDA_ARCH_LIST: '8.9'
Expand Down
6 changes: 4 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Apache License
Copyright 2024- ModelCloud.ai. All rights reserved.

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

Expand Down Expand Up @@ -198,4 +200,4 @@
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
limitations under the License.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
</p>

## News
* 12/16/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage.
* 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage.
* 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added.
* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status.
* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime.
Expand Down Expand Up @@ -70,7 +70,7 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
* 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) hardware accelerated quantization/inference for CPU [`avx`, `amx`, `xmx`] and Intel GPU [`Arc` + `Datacenter Max`].
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.
*[Intel/AutoRound](https://github.com/intel/auto-round) support for potentially higher quality quantization.
*[Intel/AutoRound](https://github.com/intel/auto-round) alternative gptq-inference compatible quantization method.
* ✨ Asymmetric `Sym=False` support.
*`lm_head` module quant inference support for further VRAM reduction (auto-round only).
* 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
Expand Down Expand Up @@ -210,12 +210,13 @@ evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL
## Citation
```
@misc{gptqmodel,
author = {ModelCloud.ai},
author = {ModelCloud.ai and [email protected]},
title = {GPTQModel},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
note = {Contact: [email protected]}
}
@article{frantar-gptq,
Expand Down
3 changes: 2 additions & 1 deletion gptqmodel/integration/integration_vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ def get_quantization_config(quant: str) -> Type[quantization.QuantizationConfig]
from vllm.model_executor.layers.quantization.awq import AWQConfig
from vllm.model_executor.layers.quantization.awq_marlin import AWQMarlinConfig
from vllm.model_executor.layers.quantization.bitsandbytes import BitsAndBytesConfig
from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import CompressedTensorsConfig # noqa: E501
from vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors import \
CompressedTensorsConfig # noqa: E501
from vllm.model_executor.layers.quantization.deepspeedfp import DeepSpeedFPConfig
from vllm.model_executor.layers.quantization.experts_int8 import ExpertsInt8Config
from vllm.model_executor.layers.quantization.fbgemm_fp8 import FBGEMMFp8Config
Expand Down
8 changes: 3 additions & 5 deletions gptqmodel/integration/src/optimum/gptq/quantizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@

import torch
from gptqmodel.integration.src.optimum.utils.import_utils import is_gptqmodel_available
from .utils import nested_move_to
from optimum.gptq.constants import GPTQ_CONFIG
from optimum.gptq.data import get_dataset, prepare_dataset
from optimum.gptq.utils import get_block_name_with_pattern, get_device, get_layers, get_preceding_modules, get_seqlen
Expand All @@ -35,11 +34,10 @@
from transformers.pytorch_utils import Conv1D
from transformers.utils.quantization_config import QuantizationMethod

from .utils import nested_move_to

if is_accelerate_available():
from accelerate import (
cpu_offload_with_hook,
load_checkpoint_and_dispatch,
)
from accelerate import cpu_offload_with_hook, load_checkpoint_and_dispatch
from accelerate.hooks import remove_hook_from_module

if is_auto_gptq_available():
Expand Down
8 changes: 2 additions & 6 deletions gptqmodel/models/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from huggingface_hub import list_repo_files
from transformers import AutoConfig

from ..nn_modules.qlinear.ipex import HAS_IPEX
from ..quantization import QUANT_CONFIG_FILENAME
from ..utils import BACKEND, EVAL
from ..utils.logger import setup_logger
Expand Down Expand Up @@ -57,6 +58,7 @@
from .definitions.mpt import MPTGPTQ
from .definitions.olmo2 import Olmo2GPTQ
from .definitions.opt import OPTGPTQ
from .definitions.ovis import OvisGPTQ
from .definitions.phi import PhiGPTQ
from .definitions.phi3 import Phi3GPTQ
from .definitions.qwen import QwenGPTQ
Expand All @@ -68,7 +70,6 @@
from .definitions.starcoder2 import Starcoder2GPTQ
from .definitions.xverse import XverseGPTQ
from .definitions.yi import YiGPTQ
from .definitions.ovis import OvisGPTQ


logger = setup_logger()
Expand Down Expand Up @@ -125,11 +126,6 @@
"ovis": OvisGPTQ,
}

HAS_IPEX = False
try:
HAS_IPEX = True
except Exception:
pass

class GPTQModel:
def __init__(self):
Expand Down
2 changes: 1 addition & 1 deletion gptqmodel/models/definitions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from .mpt import MPTGPTQ
from .olmo2 import Olmo2GPTQ
from .opt import OPTGPTQ
from .ovis import OvisGPTQ
from .phi import PhiGPTQ
from .phi3 import Phi3GPTQ
from .qwen import QwenGPTQ
Expand All @@ -43,4 +44,3 @@
from .starcoder2 import Starcoder2GPTQ
from .xverse import XverseGPTQ
from .yi import YiGPTQ
from .ovis import OvisGPTQ
6 changes: 3 additions & 3 deletions gptqmodel/models/definitions/ovis.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import copy
import logging

from ..base import BaseGPTQModel
import torch

from ...utils.calibration_dataset import batched
from ...utils.image import fetch_image
from ...utils.model import MODALITY
import torch

from ..base import BaseGPTQModel


class OvisGPTQ(BaseGPTQModel):
Expand Down
7 changes: 7 additions & 0 deletions gptqmodel/nn_modules/qlinear/dynamic_cuda.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# License: GPTQModel/licenses/LICENSE.apache
from typing import Tuple, Optional

import torch

Expand Down Expand Up @@ -67,6 +68,12 @@ def __init__(

assert infeatures % 64 == 0 and outfeatures % 64 == 0

@classmethod
def validate(cls, **args) -> Tuple[bool, Optional[Exception]]:
if gptqmodel_cuda_import_exception is not None:
return False, gptqmodel_cuda_import_exception
return cls._validate(**args)

def forward(self, x: torch.Tensor):
out_shape = x.shape[:-1] + (self.outfeatures,)
x = x.reshape(-1, x.shape[-1])
Expand Down
3 changes: 0 additions & 3 deletions gptqmodel/nn_modules/qlinear/ipex.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,6 @@ def __init__(

@classmethod
def validate(cls, **args) -> Tuple[bool, Optional[Exception]]:
if sys.platform != "linux":
return False, Exception("IPEX is only available on Linux platform.")

if not HAS_IPEX:
return False, IPEX_ERROR_LOG
return cls._validate(**args)
Expand Down
1 change: 1 addition & 0 deletions gptqmodel/quantization/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

from ..utils.logger import setup_logger


logger = setup_logger()

FORMAT_FIELD_CODE = "format"
Expand Down
3 changes: 2 additions & 1 deletion gptqmodel/utils/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@
from ..nn_modules.qlinear.ipex import IPEXQuantLinear
from ..nn_modules.qlinear.torch import TorchQuantLinear
from ..quantization import FORMAT, QuantizeConfig
from ..quantization.config import dynamic_get
from .backend import BACKEND
from .importer import select_quant_linear
from .logger import setup_logger
from .progress import ProgressBar
from .torch import torch_empty_cache
from ..quantization.config import dynamic_get


logger = setup_logger()

Expand Down
2 changes: 1 addition & 1 deletion gptqmodel/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.4.5-dev"
__version__ = "1.4.6-dev"
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ transformers>=4.46.3
threadpoolctl>=3.5.0
packaging>=24.2
setuptools>=75.5.0
device-smi==0.3.2
device-smi==0.3.3
sentencepiece>=0.2.0
protobuf>=5.29.1
8 changes: 6 additions & 2 deletions tests/models/model_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# -- end do not touch
from pathlib import Path # noqa: E402


sys.path.insert(0, f"{str(Path(__file__).resolve().parent.parent)}/models") # noqa: E402
import contextlib # noqa: E402
import shutil # noqa: E402
import tempfile # noqa: E402
Expand All @@ -16,14 +20,14 @@
import torch.cuda # noqa: E402
from datasets import load_dataset # noqa: E402
from lm_eval.utils import make_table # noqa: E402
from ovis.image_to_test_dataset import get_calib_dataset # noqa: E402
from transformers import AutoTokenizer # noqa: E402

from gptqmodel import BACKEND, GPTQModel # noqa: E402
from gptqmodel.nn_modules.qlinear import BaseQuantLinear # noqa: E402
from gptqmodel.quantization import FORMAT # noqa: E402
from gptqmodel.quantization.config import QuantizeConfig # noqa: E402
from gptqmodel.utils.eval import lm_eval # noqa: E402
from ovis.image_to_test_dataset import get_calib_dataset # noqa: E402
from gptqmodel.utils.torch import torch_empty_cache # noqa: E402


Expand Down Expand Up @@ -132,7 +136,7 @@ def quantModel(self, model_id_or_path, trust_remote_code=False, torch_dtype="aut

is_quantized = model.quantized
if not is_quantized:
model.quantize(calibration_dataset)
model.quantize(calibration_dataset, batch_size=4)

self.check_kernel(model, self.KERNEL_QUANT)

Expand Down
28 changes: 15 additions & 13 deletions tests/models/test_ovis_1_6_llama.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import os.path
import tempfile

from PIL import Image
import torch
from gptqmodel import GPTQModel
from model_test import ModelTest
from PIL import Image

from gptqmodel import GPTQModel


class TestOvis1_6_Llama(ModelTest):
NATIVE_MODEL_ID = "/monster/data/model/Ovis1.6-Llama3.2-3B"
Expand Down Expand Up @@ -42,17 +44,17 @@ def test_ovis_1_6(self):

# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
gen_kwargs = {
"max_new_tokens": 1024,
"do_sample": False,
"top_p": None,
"top_k": None,
"temperature": None,
"repetition_penalty": None,
"eos_token_id": model.generation_config.eos_token_id,
"pad_token_id": text_tokenizer.pad_token_id,
"use_cache": True
}
output_ids = \
model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
Expand Down
2 changes: 1 addition & 1 deletion tests/test_asym_gptq_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


class Test(ModelTest):
NATIVE_MODEL_ID = "ModelCloud/Llama3.2-1B-Instruct" # "meta-llama/Llama-3.2-1B-Instruct"
NATIVE_MODEL_ID = "/monster/data/model/Llama-3.2-1B-Instruct" # "meta-llama/Llama-3.2-1B-Instruct"
NATIVE_ARC_CHALLENGE_ACC = 0.3567
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3805
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.36
Expand Down
2 changes: 1 addition & 1 deletion tests/test_dynamic.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,4 @@ def test_skip_module(self):

print(f"generate_str: {generate_str}")

self.assertIn("paris", generate_str.lower())
self.assertIn("paris", generate_str.lower())
1 change: 1 addition & 0 deletions tests/test_evalplus.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# -- do not touch
import os


os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# -- end do not touch

Expand Down
2 changes: 1 addition & 1 deletion tests/test_ipex_xpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def test(self):
)
tokenizer = self.load_tokenizer(self.NATIVE_MODEL_ID)
calibration_dataset = self.load_dataset(tokenizer)
origin_model.quantize(calibration_dataset)
origin_model.quantize(calibration_dataset, backend=BACKEND.IPEX)
with tempfile.TemporaryDirectory() as tmpdir:
origin_model.save(tmpdir)

Expand Down
Loading

0 comments on commit e8012bb

Please sign in to comment.