From 3c2bf862cefec6625c6b4b49319ff86ed9bb968d Mon Sep 17 00:00:00 2001 From: minhthuc Date: Tue, 8 Oct 2024 18:20:09 +0200 Subject: [PATCH] update doc AWQ quantization --- docs/quantization.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/docs/quantization.md b/docs/quantization.md index 296c57000..ae79ee6b9 100644 --- a/docs/quantization.md +++ b/docs/quantization.md @@ -165,18 +165,26 @@ In this mode, all model weights are stored in BF16 and all layers are run with t ### 4-bit AWQ -The compute type would be `int32_float16` - **Supported on:** * NVIDIA GPU with Compute Capability >= 7.5 +CTranslate2 internally handles the compute type for AWQ quantization. In this mode, all model weights are stored in half precision and all layers are run in half precision. Other parameters like scale and zero are stored in ``int32``. -For example, +**Steps to use AWQ Quantization:** + +* Download a AWQ quantized model from Hugging Face for example (TheBloke/Llama-2-7B-AWQ){https://huggingface.co/TheBloke/Llama-2-7B-AWQ} or quantize your own model with using this (AutoAWQ example){https://casper-hansen.github.io/AutoAWQ/examples/}. +* Convert AWQ Quantized model to Ctranslate2 model: ```bash ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model ``` -We have to quantize the model with AWQ first, then convert it to CT2 format. \ No newline at end of file +* Run inference as usual with Ctranslate2: +```bash +model = ctranslate2.Generator('ct2_model', device='cuda') +outputs = model.generate_batch([tokens]) +``` + +Currently, CTranslate2 only supports the GEMM and GEMV kernels for AWQ quantization. \ No newline at end of file