Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
run_qa_no_trainer_ptq.py	run_qa_no_trainer_ptq.py
utils_qa.py	utils_qa.py

Model Optimization Using Post Training Quantization (PTQ)

FMS Model Optimizer supports quantization of models which will enable the utilization of reduced-precision numerical format and specialized hardware to accelerate inference performance (i.e., make "calling a model" faster).

This is an example of block sequential PTQ. Unlike quantization-aware training (QAT) which simply trains the whole quantized model based on task loss, PTQ only trains one block at a time. Note that the "block" here could be a single layer, a transformer block, or a residual block. In this example we chose to use "transformer block" as it will provide better accuracy. Furthermore, instead of using the task loss, PTQ relies on the MSE loss based on the differences between the original FP32 output and the quantized output of the block. The benefit of PTQ is that it requires much less computational resource and possibly shorter tuning time. One potential drawback is that the accuracy could be lower than that can be achieved by QAT, but in many cases PTQ can be comparable with QAT.

Requirements

FMS Model Optimizer requirements
The inferencing step requires Nvidia GPUs with compute capability > 8.0 (A100 family or higher)
NVIDIA cutlass package (Need to clone the source, not pip install). Preferably place in user's home directory: cd ~ && git clone https://github.com/NVIDIA/cutlass.git
Ninja
PyTorch 2.3.1 (as newer version will cause issue for the custom CUDA kernel)

QuickStart

Note

This example is based on the HuggingFace Transformers Question answering example. Unlike our QAT example, which utilizes the training loop of the original code, our PTQ function will control the loop and the program will end before entering the original loop. Make sure the model doesn't get "tuned" twice!

There are three main steps to try out the example as follows:

1. Fine-tune a model with 16-bit floating point (FP16) precision:

export CUDA_VISIBLE_DEVICES=0

python run_qa_no_trainer_ptq.py \
  --model_name_or_path google-bert/bert-base-uncased \
  --dataset_name squad \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ./fp16_ft_squad/ \
  --with_tracking \
  --report_to tensorboard \
  --attn_impl eager

Tip

The script can take up to 20 mins to run (on a single A100). By default, it is configured for detailed logging. You can disable the logging by removing the with_tracking and report_to flags in the script.

2. Apply PTQ on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):

python run_qa_no_trainer_ptq.py \
  --model_name_or_path ./fp16_ft_squad \
  --dataset_name squad \
  --per_device_train_batch_size 12 \
  --seed 0 \
  --do_ptq \
  --ptq_nbatch 128 \
  --ptq_batchsize 12 \
  --ptq_nouterloop 1000 \
  --ptq_coslr WA \
  --ptq_lrw 1e-05 \
  --ptq_lrcv_w 0.001 \
  --ptq_lrcv_a 0.001 \
  --output_dir ./ptq_on_fp16ft \
  --with_tracking \
  --report_to tensorboard

Tip

The model_name_or_path from this section should match output_dir the previous section (step 1)

3. Compare the accuracy and inference speed of 16-bit floating point (FP16) and 8-bit integer (INT8) precision models:

Note

All parameters are default, except for batch size and do_lowering

export TOKENIZERS_PARALLELISM=false

python run_qa_no_trainer_ptq.py \
  --model_name_or_path ./ptq_on_fp16ft \
  --dataset_name squad \
  --per_device_train_batch_size 128 \
  --per_device_eval_batch_size 128 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --attn_impl eager \
  --do_lowering

Checkout Example Test Results to compare against your results.

Example Test Results

The table below shows results obtained for the conditions listed:

model	ptq_nbatch	Nouterloop	F1 score	PTQ tuning time (min.)
BERT	128	500	81.5	~10
	128	1000	85.08	~16
	128	2000	86.78	~25
	128	3000	87.63	~35
	1000	2000	86.82	~44
	1000	3000	87.50	~54

Nouterloop and ptq_nbatch are PTQ specific hyper-parameter. Above experiments were run on v100 machine.

Code Walk-through

In this section, we will deep dive into what happens during the example steps.

There are three parts to the example:

1. Fine-tune a model with 16-bit floating point (FP16) precision

Fine-tunes a BERT model on the question answering dataset, SQuAD. This step is based on the HuggingFace Transformers Question answering example. It was modified to collect additional training information in case we would like to tweak the hyper-parameters later.

2. Apply Quantization using PTQ

For INT8 quantization, we can achieve comparable accuracy with FP16 by using quantization-aware training (QAT) or post-training quantization (PTQ) techniques. In this example we use PTQ.

In a nutshell, PTQ simply quantizes the weight and activation tensors in a block sequential manner, at each block optimizes for quantization errors. (i.e. quantization and optimization together happen block by block / one block at a time starting from 1st in a sequential manner)

from fms_mo import qmodel_prep, qconfig_init

# Create a config dict using a default recipe and CLI args
# If same item exists in both, args take precedence over recipe.
qcfg = qconfig_init(recipe = 'ptq_int8', args=args)
qcfg["tb_writer"] = accelerator.get_tracker("tensorboard", unwrap=True)
qcfg["loader.batchsize"] = args.per_device_train_batch_size


# Prepare a list of "ready-to-run" data for calibration
exam_inp = [{k:v for k,v in next(iter(train_dataloader)).items() if 'position' not in k}
        for _ in range(qcfg['qmodel_calibration']) ]

ptq_mod_candidates = list( model.bert.encoder.layer )
qmodel_prep(model, exam_inp, qcfg, optimizer, use_dynamo = True)
calib_PTQ_lm(qcfg, model, train_dataloader, ptq_mod_candidates)

logger.info(f"--- Accuracy of {args.model_name_or_path} before QAT/PTQ")

3. Evaluate Inference Accuracy and Speed

Note

This step will compile an external kernel for INT matmul, which currently only works with PyTorch 2.3.1.

Here is an example code snippet used for evaluation:

from fms_mo.modules.linear import QLinear, QLinearINT8Deploy
# ...

# Only need 1 batch (not a list) this time, will be used by `torch.compile` as well.
exam_inp = next(iter(train_dataloader))

qcfg = qconfig_init(recipe = 'qat_int8', args=args)
qcfg['qmodel_calibration'] = 0 # <----------- NOTE 1
qmodel_prep(model, exam_inp, qcfg, optimizer, use_dynamo = True,
            ckpt_reload=args.model_name_or_path) # <----------- NOTE 2

# ----------- NOTE 3
mod2swap = [n for n,m in model.named_modules() if isinstance(m, QLinear)]
for name in mod2swap:
    parent_name, module_name = _parent_name(name)
    parent_mod = model.get_submodule(parent_name)
    qmod = getattr(parent_mod, module_name)
    setattr(parent_mod, module_name, QLinearINT8Deploy.from_fms_mo(qmod))

# ...

with torch.no_grad():
    model = torch.compile(model) #, mode='reduce-overhead') # <----- NOTE 4
    model(**exam_inp)

# ...

return # Stop the run here, no further training loop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTQ_INT8

PTQ_INT8

README.md

Model Optimization Using Post Training Quantization (PTQ)

Requirements

QuickStart

1. Fine-tune a model with 16-bit floating point (FP16) precision:

2. Apply PTQ on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):

3. Compare the accuracy and inference speed of 16-bit floating point (FP16) and 8-bit integer (INT8) precision models:

Example Test Results

Code Walk-through

Files

PTQ_INT8

Directory actions

More options

Directory actions

More options

Latest commit

History

PTQ_INT8

Folders and files

parent directory

README.md

Model Optimization Using Post Training Quantization (PTQ)

Requirements

QuickStart

1. Fine-tune a model with 16-bit floating point (FP16) precision:

2. Apply PTQ on the fine-tuned model, which converts the precision data to 8-bit integer (INT8):

3. Compare the accuracy and inference speed of 16-bit floating point (FP16) and 8-bit integer (INT8) precision models:

Example Test Results

Code Walk-through