Tests in this CI directory can be run manually to provide extensive testing.
Before the Triton 23.10 release, you can launch the Triton 23.09 container
nvcr.io/nvidia/tritonserver:23.09-py3
and add the directory
/opt/tritonserver/backends/tensorrtllm
within the container following the
instructions in Option 3 Build via CMake.
Run the testing within the Triton container.
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/<test directory>
bash -x ./test.sh
These two tests are ran in the L0_backend_trtllm test. Below are the instructions to run the tests manually.
Follow the instructions in the Create the model repository section to prepare the model repository.
Follow the instructions in the Modify the model configuration section to modify the model configuration based on the needs.
End to end test script sends
requests to the deployed ensemble
model.
Ensemble model is ensembled by three models: preprocessing
, tensorrt_llm
and postprocessing
:
- "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
- "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
- "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
The end to end latency includes the total latency of the three parts of an ensemble model.
cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>
Expected outputs
[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms
benchmark_core_model script
sends requests directly to the deployed tensorrt_llm
model, the benchmark_core_model
latency indicates the inference latency of TensorRT-LLM, not including the
pre/post-processing latency which is usually handled by a third-party library
such as HuggingFace.
cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>
Expected outputs
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms
Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.