From 24205ba234270e9c9cf3e3a526f3295fd0188dfd Mon Sep 17 00:00:00 2001
From: Nir David <ndavid@habana.ai>
Date: Thu, 19 Dec 2024 15:29:47 +0200
Subject: [PATCH] add more documentation changes

---
 docs/source/quantization/inc.rst | 63 +++++++++++++++-----------------
 1 file changed, 29 insertions(+), 34 deletions(-)

diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst
index 4996fe0b6ad47..72eeb84e1b352 100644
--- a/docs/source/quantization/inc.rst
+++ b/docs/source/quantization/inc.rst
@@ -6,44 +6,40 @@ FP8 INC
 vLLM supports FP8 (8-bit floating point) weight and activation quantization using INC (Intel Neural Compressor) on hardware acceleration of Intel Gaudi (HPU).
 Currently, quantization is supported only for Llama models.
 
-Please visit the Intel Gaudi documentation of `Run Inference Using FP8  <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html>`_.
+Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to:
+`Supported Modules\Supported Functions\Custom Patched Modules <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-modules>`_.
 
-In order to run inference it is required to have measurements/scales files:
+.. note::
+    Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the `vllm-hpu-extention <https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md>`_ package.
 
-Obtain Measurements
--------------------
+.. note::
+    ``QUANT_CONFIG`` is an environment variable that points to the measurement or quantization `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options>`_.
+    The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference.
 
-To obtain measurement files:
-* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with MEASURE mode.
-* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
-* Call ``shutdown_inc`` and ``shutdown`` methods of the ``model_executor`` at the end of the run.
+Run Online Inference Using FP8
+-------------------------------
 
-.. code-block:: python
+Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
 
-    from vllm import LLM
-    llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc")
-    ...
-    # Call llm.generate on the required prompts and sampling params.
-    ...
-    llm.llm_engine.model_executor.shutdown_inc()
-    llm.llm_engine.model_executor.shutdown()
+.. code-block:: bash
 
-Run Inference Using FP8
------------------------
+    export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
+    vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8
 
-Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to:
-`Supported Modules <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-modules>`_.
-`Supported Functions <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-functions>`_.
-
-In order to run inference it requires to have Scales which located in scale files according to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ ``dump_stats_path``.
-If none exist, they can be generated during inference run using the measurement files (should be located in the same folder).
-
-To run inference (and obtain scale files):
-* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with QUANTIZE mode.
-* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
-* Pass ``fp8_inc`` as KV cache data type:
-   * Offline inference: pass ``kv_cache_dtype=fp8_inc`` as parameter to the ``LLM`` object. 
-   * Online inference: pass ``--kv-cache-dtype=fp8_inc`` as command line parameter.
+.. tip::
+    If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments, as it causes a dramatic performance drop.
+
+.. tip::
+    When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use these two environment variables:
+    ``VLLM_ENGINE_ITERATION_TIMEOUT_S`` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
+    ``VLLM_RPC_TIMEOUT`` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
+
+Run Offline Inference Using FP8
+-------------------------------
+
+To run offline inference (after completing the model calibration process):
+* Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode.
+* Pass ``quantization=inc`` and ``kv_cache_dtype=fp8_inc`` as parameters to the ``LLM`` object.
 * Call shutdown method of the model_executor at the end of the run.
 
 .. code-block:: python
@@ -58,12 +54,11 @@ To run inference (and obtain scale files):
 Specifying Device for the Model's Weights Uploading
 ---------------------------------------------------
 
-It is possible to load the unquantized weights on a different device before quantizing them,
-and moving to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
+It is possible to load the unquantized weights on a different device before quantizing them, then moving them to the device on which the model will run.
+This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
 To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode.
 
 .. code-block:: python
 
     from vllm import LLM
     llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc", weights_load_device="cpu")
-