some more CR fixes

HabanaAI · Dec 23, 2024 · 5dde45e · 5dde45e
1 parent eb72885
commit 5dde45e
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst
@@ -27,10 +27,10 @@ Once you've completed the model calibration process and collected the measuremen
     vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8
 
 .. tip::
-    If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments, as it causes a dramatic performance drop.
+    If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop.
 
 .. tip::
-    When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use these two environment variables:
+    When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables:
     ``VLLM_ENGINE_ITERATION_TIMEOUT_S`` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
     ``VLLM_RPC_TIMEOUT`` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
 
@@ -56,7 +56,7 @@ Specifying Device for the Model's Weights Uploading
 
 It is possible to load the unquantized weights on a different device before quantizing them, then moving them to the device on which the model will run.
 This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
-To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode.
+To set the device to upload weights, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter when running online inference:
 
 .. code-block:: python