Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

[DOC] Add instructions to install and run RayLLM backend locally #151

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions docs/run-backend-locally.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Install and Run RayLLM Backend Locally

## Install latest Ray Serve

```bash
pip install ray[serve]
```

## Install RayLLM Backend from PR #149 branch (TODO: update link when PR merged)

```bash
git clone https://github.com/xwu99/ray-llm && cd ray-llm && git checkout support-vllm-cpu
```

Install for running on CPU device:
```bash
pip install -e .[backend] --extra-index-url https://download.pytorch.org/whl/cpu
```

Install for running on GPU device:
```bash
pip install -e .[backend]
```

## (Optional) Additional steps to install vllm from source for CPU device

### Install GCC (>=12.3)

```bash
conda install -y -c conda-forge gxx=12.3 gxx_linux-64=12.3 libxcrypt
```

### Install latest vLLM (>= 0.4.1) on CPU

```bash
MAX_JOBS=8 VLLM_TARGET_DEVICE=cpu pip install -v git+https://github.com/vllm-project/vllm --extra-index-url https://download.pytorch.org/whl/cpu
```

## Test Run

### Run on CPU device

Start Ray from the directory of the code:

```bash
OMP_NUM_THREADS=32 ray start --head
```

To start serving:

__Notice: Please change dtype to "bfloat16" for performance if you run on 4th generation Xeon Scalable (codename "Sapphire Rapids") or later CPU, otherwise use "float32" for compatibility.__

```bash
serve run ./serve_configs/cpu/meta-llama--Llama-2-7b-chat-hf.yaml
```

### Run on GPU device

Start Ray from the directory of the code:

```bash
ray start --head
```

To start serving:

__Notice: Please change "accelerator_type_a10" to match your GPU type__

```bash
serve run ./serve_configs/meta-llama--Llama-2-7b-chat-hf.yaml
```

### Query

Export the endpoint URL:

```bash
export ENDPOINT_URL="http://localhost:8000/v1"
```

Send a POST request for chat completions:

```bash
curl -X POST $ENDPOINT_URL/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
```

### List Models

Set the API base and key:

```bash
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="not-a-key"
```

List available RayLLM models:

```bash
rayllm models
```

## Caveat

- The current working directory is where `ray start --head` runs, therefore if you use relative paths to define `models/*` files, the path should be relative to where Ray starts.
- When switching Conda environments, you need to restart Ray.