From 41953d5a308b0dbb6f0f799a5e2593037b383a1d Mon Sep 17 00:00:00 2001 From: Nir David Date: Mon, 23 Dec 2024 11:50:39 +0200 Subject: [PATCH] some more CR fixes --- docs/source/quantization/inc.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst index b1b68321ea656..4d9020f3186c1 100644 --- a/docs/source/quantization/inc.rst +++ b/docs/source/quantization/inc.rst @@ -27,10 +27,10 @@ Once you've completed the model calibration process and collected the measuremen vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8 .. tip:: - If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments, as it causes a dramatic performance drop. + If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop. .. tip:: - When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use these two environment variables: + When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables: ``VLLM_ENGINE_ITERATION_TIMEOUT_S`` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes. ``VLLM_RPC_TIMEOUT`` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes. @@ -56,7 +56,7 @@ Specifying Device for the Model's Weights Uploading It is possible to load the unquantized weights on a different device before quantizing them, then moving them to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory. -To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode. +To set the device to upload weights, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter when running online inference: .. code-block:: python