Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc AWQ quantization #1795

Merged
merged 1 commit into from
Oct 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,18 +165,26 @@ In this mode, all model weights are stored in BF16 and all layers are run with t

### 4-bit AWQ

The compute type would be `int32_float16`

**Supported on:**

* NVIDIA GPU with Compute Capability >= 7.5

CTranslate2 internally handles the compute type for AWQ quantization.
In this mode, all model weights are stored in half precision and all layers are run in half precision. Other parameters like scale and zero are stored in ``int32``.

For example,
**Steps to use AWQ Quantization:**

* Download a AWQ quantized model from Hugging Face for example (TheBloke/Llama-2-7B-AWQ){https://huggingface.co/TheBloke/Llama-2-7B-AWQ} or quantize your own model with using this (AutoAWQ example){https://casper-hansen.github.io/AutoAWQ/examples/}.

* Convert AWQ Quantized model to Ctranslate2 model:
```bash
ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model
```

We have to quantize the model with AWQ first, then convert it to CT2 format.
* Run inference as usual with Ctranslate2:
```bash
model = ctranslate2.Generator('ct2_model', device='cuda')
outputs = model.generate_batch([tokens])
```

Currently, CTranslate2 only supports the GEMM and GEMV kernels for AWQ quantization.
Loading