Note: currently, this doc is for onboarding the new developers. Will have a separate README in the future for general audiences.
It's recommended to create a new folder before cloning the repository. The final file structure will look like as follows:
<parent-folder>/
|--- lmcache-test/
|--- LMCache/
|--- lmcache-vllm/
# Create conda environment
conda create -n lmcache python=3.10
conda activate lmcache
# Clone github repository
git clone [email protected]:LMCache/lmcache-tests.git
cd lmcache-tests
# Run the installation script
bash prepare_environment.sh
The following command line runs the test test_lmcache_local_cpu
defined in tests/tests.py
and write the output results to the output folder (outputs/test_lmcache_local_cpu.csv
).
python3 main.py tests/tests.py -f test_lmcache_local_cpu -o outputs/
To process the result, please run
cd outputs/
python3 process_result.py
Then, a pdf file test_lmcache_local_cpu.pdf
will be created.
You can also monitor the following files to check the status of the bootstrapped vllm process.
For stderr:
tail -f /tmp/{user_name}-8000-stderr.log
For stdout:
tail -f /tmp/{user_name}-8000-stdout.log
main.py
is the entrypoint to execute the test functions:
usage: main.py [-h] [-f FILTER] [-l] [-o OUTPUT_DIR] [-m MODEL] [-p PORT1 PORT2] filepath
Execute all functions in a given Python file.
positional arguments:
filepath The Python file to execute functions from (include subfolders if any).
options:
-h, --help show this help message and exit
-f FILTER, --filter FILTER Pattern to filter which functions to execute.
-l, --list List all functions in the module without executing them.
-o OUTPUT_DIR, --output-dir OUTPUT_DIR The directory to put the output file.
-m MODEL, --model MODEL The models of vllm for every functions.
-p PORT1 PORT2, --ports PORT1 PORT2 Two ports required for the experiment.
Here are some examples:
# Run all the test functions defined in 'tests/tests.py' and save the output to 'outputs/'
python3 main.py tests/tests.py -o outputs/
# List the tests in 'tests/tests.py'
python3 main.py tests/tests.py -l
# Run some specific tests that match the given pattern (e.g., containing 'cachegen')
python3 main.py tests/tests.py -f cachegen
# Run all the test functions defined in 'tests/tests.py' with llama
python3 main.py tests/tests.py -m "meta-llama/Llama-3.1-8B-Instruct"
In general, each test function should output the results as a csv file, where the file name is the same as function name but with a csv
suffix.
There should be multiple columns in the CSV:
expr_id
: the id of the experimenttimestamp
: the timestamp of the request in the workloadengine_id
: the id of the serving engine instance to which the request is sentrequest_id
: the id of the request in the workloadTTFT
: the time-to-first-token of this requestthroughput
: the number of output tokens per second of this requestlatency
: the time the whole response completescontext_len
: the context length of this requestquery_len
: the query length of this requestgpu_memory
: the gpu memory used before this experiment is finished
Some example codes of how to parse the output CSV can be found in outputs/process_result.py
.
There are various test functions available that can be specified using -f / --filter. These functions serve the following purposes:
- Local cpu storage backend
test_lmcache_local_cpu
: compares scenarios with and without lmcache.test_multi_turn
: tests the performance of saving decode KV Cache with a multi-turn conversation.test_vary_length_workload
: tests the performance of partial prefill by changing the workload length of different requests.test_chunk_prefill
: tests the performance of chunked prefill.test_cache_compatibility
: tests the compatibility of prefix caching between lmcache and vllm.test_lmcache_local_distributed
: tests the performance of tensor parallelism.test_local_cpu_experimental
: tests the codes inLMCache/lmcache/experimental/
.
- Remote cpu storage backend
test_lmcache_remote_cachegen
: compares scenarios whether retrieval is pipelined or not with cachegen for transmission.test_lmcache_cachegen_distributed
: tests the performance of tensor parallelism with cachegen for transmission.test_lmcache_remote_safetensor
: compares scenarios whether retrieval is pipelined or not with safetensor for transmission.test_lmcache_safetensor_distributed
: tests the performance of tensor parallelism with safetensor for transmission.
- Other storage backend
test_lmcache_local_gpu
: local gpu storage backend.test_lmcache_local_disk
: local disk storage backend.test_local_disk_experimental
: tests the codes inLMCache/lmcache/experimental/
with local disk storage backend.test_lmcache_remote_disk
: remote disk storage backend.test_lmcache_redis_sentinel
: Redis instance storage backend.
- Offline tests
offline_test
: tests partial prefll and full prefill in a single batch.
Request
: A request is a single prompt containing some context and a user query.Engine
: Means "serving engine". An engine is an LLM serving engine process that can process requests through the OpenAI API interface.Workload
: A workload is a list of requests at different timestamps. Usually, a workload is associated with only one engine.Experiment
: An experiment runs multiple engines simultaneously and sends the workloads generated from the same configuration to the serving engines.Test case
: A test case contains a list of experiments. The goal is to compare the performance of different engines under different workloads.Test function
: A test function wraps the test cases and saves the output CSV. In most cases, a single test function only contains one test case. Different test functions aim to test different functionalities of the system.
Test case configuration:
Test case configuration controls the experiments to run. The configuration-related code can be found in config.py.
Currently, we support the following configurations:
- Workload configuration: Specifies the QPS, context length, query length, and total duration of the workload.
- vLLM configuration: Controls the command line arguments used to start vLLM.
- LMCache configuration: Points to a LMCache configuration file.
During the experiment, workload configuration will be used to generate the workloads, and vLLM configuration + LMCache configuration will be used to start the engine.
Workload generator:
The workload generator takes in a workload configuration and generates the workload (i.e., a list of requests at different timestamps) as the output. The code for the workload generator can be found in workload.py.
By design, there could be multiple different kinds of workload generators for different use cases, such as chatbot, QA, or RAG.
The class Usecase
is used to specify which workload generator to create during runtime.
Currently, we only support a DUMMY
use case where the requests in the generated workload only contain dummy texts and questions.
The workload generator, once initialized with a configuration, only provides a single method: generate(self) -> Workload
.
Engine bootstrapper:
The engine bootstrapper pulls up the serving engine based on the configurations (vLLM configuration + LMCache configuration). Currently, we only support starting vLLM (with or without LMCache) from the terminal. We will support docker-based engines in the future. The code can be found in bootstrapper.py
The engine bootstrapper supports the following methods:
start(self)
: Non-blocking function to start the serving engine.wait_until_ready(self, timeout=60) -> bool
: Block until the serving engine is ready or it's dead or timeout is reached. Returns true if it's ready, otherwise false.is_healthy(self) -> bool
: Non-block function to check if the engine is alive or not.close(self)
: blocking function to shutdown the serving engine
Experiment runner:
The experiment runner takes in one workload config and
- Create the
$N$ workload generators from the workload config. - Generate
$N$ workloads. - Bootstrap all the serving engines. (This requires that the engine config is correct in a way that multiple serving engines can run at the same time).
- Send the requests at the given timestamps to the engines.
- Collect the TTFT, throughput, and the GPU memory usage.
- Return the results in a pandas dataframe.
The code can be found in driver.py
(WIP)