In this example, we provide the inference benchmarking script run_llm.py
for EleutherAI/gpt-j-6B, decapoda-research/llama-7b-hf, EleutherAI/gpt-neox-20b and databricks/dolly-v2-3b etc. You can also refer to link to do LLM inference with cpp graph to get better performance, but it may have constrain on batched inference.
Note: The default search algorithm is beam search with num_beams = 4
# Create Environment (conda)
conda create -n llm python=3.9 -y
conda install mkl mkl-include -y
conda install gperftools jemalloc==5.2.1 -c conda-forge -y
pip install -r requirements.txt
# if you want to run gpt-j model, please install transformers==4.27.4
pip install transformers==4.27.4
# for other models, please install transformers==4.34.1:
pip install transformers==4.34.1
Note: Suggest use transformers no higher than 4.34.1
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
# IOMP
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
# Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
The fp32 model is from Hugging Face EleutherAI/gpt-j-6B, decapoda-research/llama-7b-hf, decapoda-research/llama-13b-hf, databricks/dolly-v2-3b, [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b, and gpt-j int8 model has been publiced on Intel/gpt-j-6B-pytorch-int8-static.
python optimize_llm.py --model=EleutherAI/gpt-j-6B --dtype=(fp32|bf16) --output_model=<path to engine model>
# int8
wget https://huggingface.co/Intel/gpt-j-6B-pytorch-int8-static/resolve/main/pytorch_model.bin -O <path to int8_model.pt>
python optimize_llm.py --model=EleutherAI/gpt-j-6B --dtype=int8 --output_model=<path to ir> --pt_file=<path to int8_model.pt>
- When the input dtype is fp32 or bf16, the model will be downloaded if it does not exist.
- When the input dtype is int8, the int8 trace model should exist.
We support inference with FP32/BF16/INT8 Neural Engine model.
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_llm.py --max-new-tokens 32 --input-tokens 32 --batch-size 1 --model <model name> --model_path <path to engine model>
Neural Engine also supports weight compression to fp8_4e3m
, fp8_5e2m
and int8
only when running bf16 graph. If you want to try, please add arg --weight_type
, like:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_llm.py --max-new-tokens 32 --input-tokens 32 --batch-size 1 --model_path <path to bf16 engine model> --model <model name> --weight_type=fp8_5e2m