From 24205ba234270e9c9cf3e3a526f3295fd0188dfd Mon Sep 17 00:00:00 2001 From: Nir David Date: Thu, 19 Dec 2024 15:29:47 +0200 Subject: [PATCH] add more documentation changes --- docs/source/quantization/inc.rst | 63 +++++++++++++++----------------- 1 file changed, 29 insertions(+), 34 deletions(-) diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst index 4996fe0b6ad47..72eeb84e1b352 100644 --- a/docs/source/quantization/inc.rst +++ b/docs/source/quantization/inc.rst @@ -6,44 +6,40 @@ FP8 INC vLLM supports FP8 (8-bit floating point) weight and activation quantization using INC (Intel Neural Compressor) on hardware acceleration of Intel Gaudi (HPU). Currently, quantization is supported only for Llama models. -Please visit the Intel Gaudi documentation of `Run Inference Using FP8 `_. +Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to: +`Supported Modules\Supported Functions\Custom Patched Modules `_. -In order to run inference it is required to have measurements/scales files: +.. note:: + Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the `vllm-hpu-extention `_ package. -Obtain Measurements -------------------- +.. note:: + ``QUANT_CONFIG`` is an environment variable that points to the measurement or quantization `JSON config file `_. + The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference. -To obtain measurement files: -* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file `_ with MEASURE mode. -* Pass ``quantization=inc`` as parameter to the ``LLM`` object. -* Call ``shutdown_inc`` and ``shutdown`` methods of the ``model_executor`` at the end of the run. +Run Online Inference Using FP8 +------------------------------- -.. code-block:: python +Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command: - from vllm import LLM - llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc") - ... - # Call llm.generate on the required prompts and sampling params. - ... - llm.llm_engine.model_executor.shutdown_inc() - llm.llm_engine.model_executor.shutdown() +.. code-block:: bash -Run Inference Using FP8 ------------------------ + export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json + vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8 -Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to: -`Supported Modules `_. -`Supported Functions `_. - -In order to run inference it requires to have Scales which located in scale files according to the `JSON config file `_ ``dump_stats_path``. -If none exist, they can be generated during inference run using the measurement files (should be located in the same folder). - -To run inference (and obtain scale files): -* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file `_ with QUANTIZE mode. -* Pass ``quantization=inc`` as parameter to the ``LLM`` object. -* Pass ``fp8_inc`` as KV cache data type: - * Offline inference: pass ``kv_cache_dtype=fp8_inc`` as parameter to the ``LLM`` object. - * Online inference: pass ``--kv-cache-dtype=fp8_inc`` as command line parameter. +.. tip:: + If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments, as it causes a dramatic performance drop. + +.. tip:: + When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use these two environment variables: + ``VLLM_ENGINE_ITERATION_TIMEOUT_S`` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes. + ``VLLM_RPC_TIMEOUT`` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes. + +Run Offline Inference Using FP8 +------------------------------- + +To run offline inference (after completing the model calibration process): +* Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode. +* Pass ``quantization=inc`` and ``kv_cache_dtype=fp8_inc`` as parameters to the ``LLM`` object. * Call shutdown method of the model_executor at the end of the run. .. code-block:: python @@ -58,12 +54,11 @@ To run inference (and obtain scale files): Specifying Device for the Model's Weights Uploading --------------------------------------------------- -It is possible to load the unquantized weights on a different device before quantizing them, -and moving to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory. +It is possible to load the unquantized weights on a different device before quantizing them, then moving them to the device on which the model will run. +This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory. To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode. .. code-block:: python from vllm import LLM llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc", weights_load_device="cpu") -