Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Commit

Permalink
Merge pull request #67 from YQ-Wang/ray-doc-1
Browse files Browse the repository at this point in the history
  • Loading branch information
richardliaw authored Oct 4, 2023
2 parents 5d220c6 + daedb66 commit 445ad9b
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 49 deletions.
61 changes: 24 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Try it now: [🦜🔍 Ray Aviary Explorer 🦜🔍](http://aviary.anyscale.com/)

RayLLM (formerly known as Aviary) is an LLM serving solution that makes it easy to deploy and manage
a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It does this by:
a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It does this by:

- Providing an extensive suite of pre-configured open source LLMs, with defaults that work out of the box.
- Supporting Transformer models hosted on [Hugging Face Hub](http://hf.co) or present on local disk.
Expand All @@ -16,26 +16,24 @@ a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/lates
- Offering high performance features like continuous batching, quantization and streaming.
- Providing a REST API that is similar to OpenAI's to make it easy to migrate and cross test them.

In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.
In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.

RayLLM supports continuous batching by integrating with [vLLM](https://github.com/vllm-project/vllm). Continuous batching allows you to get much better throughput and latency than static batching.

RayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling
RayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling
and multi-node deployments. RayLLM can scale to zero and create
new model replicas (each composed of multiple GPU workers) in response to demand.


# Getting started

## Deploying RayLLM
## Deploying RayLLM

The guide below walks you through the steps required for deployment of RayLLM on Ray Serve.

### Locally

We highly recommend using the official `anyscale/aviary` Docker image to run RayLLM. Manually installing RayLLM is currently not a supported use-case due to specific dependencies required, some of which are not available on pip.


```shell
cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

Expand Down Expand Up @@ -81,12 +79,13 @@ ray attach deploy/ray/aviary-cluster.yaml
serve run serve_configs/amazon--LightGPT.yaml
```

You can deploy any model in the `models` directory of this repo,
You can deploy any model in the `models` directory of this repo,
or define your own model YAML file and run that instead.

### On Kubernetes

For Kubernetes deployments, please see our extensive documentation for [deploying Ray Serve on KubeRay](https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html).

## Query your models

Once the models are deployed, you can install a client outside of the Docker container to query the backend.
Expand All @@ -97,7 +96,7 @@ pip install "aviary @ git+https://github.com/ray-project/ray-llm.git"

You can query your RayLLM deployment in many ways.

In all cases start out by doing:
In all cases start out by doing:

```shell
export ENDPOINT_URL="http://localhost:8000/v1"
Expand All @@ -107,7 +106,7 @@ This is because your deployment is running locally, but you can also access remo

### Using curl

You can use curl at the command line to query your deployed LLM:
You can use curl at the command line to query your deployed LLM:

```shell
% curl $ENDPOINT_URL/chat/completions \
Expand All @@ -118,6 +117,7 @@ You can use curl at the command line to query your deployed LLM:
"temperature": 0.7
}'
```

```text
{
"id":"meta-llama/Llama-2-7b-chat-hf-308fc81f-746e-4682-af70-05d35b2ee17d",
Expand Down Expand Up @@ -186,8 +186,7 @@ with s.post(url, json=body, stream=True) as response:
### Using the OpenAI SDK

RayLLM uses an OpenAI-compatible API, allowing us to use the OpenAI
SDK to access our deployments. To do so, we need to set the `OPENAI_API_BASE` env var.

SDK to access our deployments. To do so, we need to set the `OPENAI_API_BASE` env var.

```shell
export OPENAI_API_BASE=http://localhost:8000/v1
Expand All @@ -213,7 +212,6 @@ chat_completion = openai.ChatCompletion.create(
print(chat_completion)
```


# RayLLM Reference

## Installing RayLLM
Expand Down Expand Up @@ -256,9 +254,9 @@ serve run aviary.frontend.app:app --non-blocking

You will be able to access it at `http://localhost:8000/frontend` in your browser.

To just use the Gradio frontend without Ray Serve, you can start it
To just use the Gradio frontend without Ray Serve, you can start it
with `python aviary/frontend/app.py`. In that case, the Gradio interface should be accessible at `http://localhost:7860` in your browser.
If running the frontend yourself is not an option, you can still use
If running the frontend yourself is not an option, you can still use
[our hosted version](http://aviary.anyscale.com/) for your experiments.

Note that the frontend will not dynamically update the list of models should they change in the backend. In order for the frontend to update, you will need to restart it.
Expand All @@ -274,7 +272,6 @@ Ray documentation.

RayLLM uses the Ray Serve CLI that allows you to interact with deployed models.


```shell
# Start a new model in Ray Serve from provided configuration
serve run serve_configs/<model_config_path>
Expand All @@ -289,11 +286,10 @@ serve config
serve shutdown
```


## RayLLM Model Registry

You can easily add new models by adding two configuration files.
To learn more about how to customize or add new models,
To learn more about how to customize or add new models,
see the [Model Registry](models/README.md).

# Frequently Asked Questions
Expand All @@ -311,22 +307,12 @@ Run multiple models at once by aggregating the Serve configs for different model

applications:
- name: router
route_prefix: /
import_path: aviary.backend:router_application
route_prefix: /
args:
models:
amazon/LightGPT: ./models/continuous_batching/amazon--LightGPT.yaml
meta-llama/Llama-2-7b-chat-hf: ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
- name: amazon--LightGPT
route_prefix: /amazon--LightGPT
import_path: aviary.backend:llm_application
args:
model: "./models/continuous_batching/amazon--LightGPT.yaml"
- name: meta-llama--Llama-2-7b-chat-hf
route_prefix: /meta-llama--Llama-2-7b-chat-hf
import_path: aviary.backend:llm_application
args:
model: "./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml"
- ./models/continuous_batching/amazon--LightGPT.yaml
- ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
```
The config includes both models in the `model` argument for the `router`. Additionally, the Serve configs for both model applications are included. Save this unified config file to the `serve_configs/` folder.
Expand All @@ -344,18 +330,19 @@ All our default model configurations enforce a model to be deployed on one node
## My deployment isn't starting/working correctly, how can I debug?

There can be several reasons for the deployment not starting or not working correctly. Here are some things to check:

1. You might have specified an invalid model id.
2. Your model may require resources that are not available on the cluster. A common issue is that the model requires Ray custom resources (eg. `accelerator_type_a10`) in order to be scheduled on the right node type, while your cluster is missing those custom resources. You can either modify the model configuration to remove those custom resources or better yet, add them to the node configuration of your Ray cluster. You can debug this issue by looking at Ray Autoscaler logs ([monitor.log](https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#system-component-logs)).
3. Your model is a gated Hugging Face model (eg. meta-llama). In that case, you need to set the `HUGGING_FACE_HUB_TOKEN` environment variable cluster-wide. You can do that either in the Ray cluster configuration or by setting it before running `serve run`
4. Your model may be running out of memory. You can usually spot this issue by looking for keywords related to "CUDA", "memory" and "NCCL" in the replica logs or `serve run` output. In that case, consider reducing the `max_batch_prefill_tokens` and `max_batch_total_tokens` (if applicable). See models/README.md for more information on those parameters.

In general, [Ray Dashboard](https://docs.ray.io/en/latest/serve/monitoring.html#ray-dashboard) is a useful debugging tool, letting you monitor your Ray Serve / LLM application and access Ray logs.

A good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy _a_ model.
A good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy _a_ model.

### How do I write a program that accesses both OpenAI and your hosted model at the same time?
### How do I write a program that accesses both OpenAI and your hosted model at the same time?

The OpenAI `create()` commands allow you to specify the `API_KEY` and `API_BASE`. So you can do something like this.
The OpenAI `create()` commands allow you to specify the `API_KEY` and `API_BASE`. So you can do something like this.

```python
# Call your self-hosted model running on the local host:
Expand All @@ -367,10 +354,10 @@ OpenAI.ChatCompletion.create(api_key="OPENAI_API_KEY", ...)

## Getting Help and Filing Bugs / Feature Requests

We are eager to help you get started with RayLLM. You can get help on:
We are eager to help you get started with RayLLM. You can get help on:

- Via Slack -- fill in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) to sign up.
- Via [Discuss](https://discuss.ray.io/c/llms-generative-ai/27).
- Via Slack -- fill in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) to sign up.
- Via [Discuss](https://discuss.ray.io/c/llms-generative-ai/27).

For bugs or for feature requests, please submit them [here](https://github.com/ray-project/ray-llm/issues/new).

Expand All @@ -381,4 +368,4 @@ Feel free to post an issue first to get our feedback on a proposal first, or jus

We use `pre-commit` hooks to ensure that all code is formatted correctly.
Make sure to `pip install pre-commit` and then run `pre-commit install`.
You can also run `./format` to run the hooks manually.
You can also run `./format` to run the hooks manually.
20 changes: 8 additions & 12 deletions docs/kuberay/deploy-on-eks.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Deploy RayLLM on Amazon EKS using KubeRay

* Note that this document will be extended to include Ray autoscaling and the deployment of multiple models in the near future.

# Part 1: Set up a Kubernetes cluster on Amazon EKS

## Step 1: Create a Kubernetes cluster on Amazon EKS

Follow the first two steps in this [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#)
Expand Down Expand Up @@ -100,7 +102,9 @@ kubectl apply -f ray-cluster.aviary-eks.yaml
```

Something is worth noticing:

* The `tolerations` for workers must match the taints on the GPU node group.

```yaml
# Please add the following taints to the GPU node.
tolerations:
Expand All @@ -109,7 +113,9 @@ Something is worth noticing:
value: "worker"
effect: "NoSchedule"
```
* Update `rayStartParams.resources` for Ray scheduling. The `OpenAssistant--falcon-7b-sft-top1-696.yaml` file uses both `accelerator_type_cpu` and `accelerator_type_a10`.

```yaml
# Ray head: The Ray head has a Pod resource limit of 2 CPUs.
rayStartParams:
Expand Down Expand Up @@ -239,23 +245,13 @@ If this process takes longer, follow the instructions in [the RayService trouble
```yaml
serveConfigV2: |
applications:
- name: amazon--LightGPT
import_path: aviary.backend:llm_application
route_prefix: /amazon--LightGPT
args:
model: "./models/continuous_batching/amazon--LightGPT.yaml"
- name: OpenAssistant--falcon-7b-sft-top1-696
import_path: aviary.backend:llm_application
route_prefix: /OpenAssistant--falcon-7b-sft-top1-696
args:
model: "./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml"
- name: router
import_path: aviary.backend:router_application
route_prefix: /
args:
models:
amazon/LightGPT: ./models/continuous_batching/amazon--LightGPT.yaml
OpenAssistant/falcon-7b-sft-top1-696: ./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml
- ./models/continuous_batching/amazon--LightGPT.yaml
- ./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml
```
In the YAML file, we use the `serveConfigV2` field to configure two LLM Serve applications, one for LightGPT and one for Falcon-7B.
Expand Down

0 comments on commit 445ad9b

Please sign in to comment.