Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Commit

Permalink
Update toplevel readme
Browse files Browse the repository at this point in the history
Signed-off-by: Sihan Wang <[email protected]>
  • Loading branch information
sihanwang41 committed Jan 5, 2024
1 parent b98c024 commit b70d33f
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,11 @@ a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/lates
- Fully supporting multi-GPU & multi-node model deployments.
- Offering high performance features like continuous batching, quantization and streaming.
- Providing a REST API that is similar to OpenAI's to make it easy to migrate and cross test them.
- Supporting multiple LLM backends out of the box, including [vLLM](https://github.com/vllm-project/vllm) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).

In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.

RayLLM supports continuous batching and quantization by integrating with [vLLM](https://github.com/vllm-project/vllm). Continuous batching allows you to get much better throughput and latency than static batching. Quantization allows you to deploy compressed models with cheaper hardware requirements and lower inference costs. See [quantization guide](models/continuous_batching/quantization/README.md) for more details on running quantized models on RayLLM.
RayLLM supports continuous batching and quantization by integrating with [vLLM](https://github.com/vllm-project/vllm). Continuous batching allows you to get much better throughput and latency than static batching. Quantization allows you to deploy compressed models with cheaper hardware requirements and lower inference costs. See [quantization guide](models/continuous_batching/quantization/README.md) for more details on running quantized models on RayLLM.

RayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling
and multi-node deployments. RayLLM can scale to zero and create
Expand Down Expand Up @@ -368,4 +369,4 @@ Feel free to post an issue first to get our feedback on a proposal first, or jus

We use `pre-commit` hooks to ensure that all code is formatted correctly.
Make sure to `pip install pre-commit` and then run `pre-commit install`.
You can also run `./format` to run the hooks manually.
You can also run `./format` to run the hooks manually.

0 comments on commit b70d33f

Please sign in to comment.