Merge pull request #67 from YQ-Wang/ray-doc-1

ray-project · Oct 4, 2023 · 445ad9b · 445ad9b
2 parents 5d220c6 + daedb66
commit 445ad9b
Show file tree

Hide file tree

Showing 2 changed files with 32 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 Try it now: [🦜🔍 Ray Aviary Explorer 🦜🔍](http://aviary.anyscale.com/)
 
 RayLLM (formerly known as Aviary) is an LLM serving solution that makes it easy to deploy and manage
-a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It does this by: 
+a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). It does this by:
 
 - Providing an extensive suite of pre-configured open source LLMs, with defaults that work out of the box.
 - Supporting Transformer models hosted on [Hugging Face Hub](http://hf.co) or present on local disk.
@@ -16,26 +16,24 @@ a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/lates
 - Offering high performance features like continuous batching, quantization and streaming.
 - Providing a REST API that is similar to OpenAI's to make it easy to migrate and cross test them.
 
-In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more. 
+In addition to LLM serving, it also includes a CLI and a web frontend (Aviary Explorer) that you can use to compare the outputs of different models directly, rank them by quality, get a cost and latency estimate, and more.
 
 RayLLM supports continuous batching by integrating with [vLLM](https://github.com/vllm-project/vllm). Continuous batching allows you to get much better throughput and latency than static batching.
 
-RayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling 
+RayLLM leverages [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), which has native support for autoscaling
 and multi-node deployments. RayLLM can scale to zero and create
 new model replicas (each composed of multiple GPU workers) in response to demand.
 
-
 # Getting started
 
-## Deploying RayLLM 
+## Deploying RayLLM
 
 The guide below walks you through the steps required for deployment of RayLLM on Ray Serve.
 
 ### Locally
 
 We highly recommend using the official `anyscale/aviary` Docker image to run RayLLM. Manually installing RayLLM is currently not a supported use-case due to specific dependencies required, some of which are not available on pip.
 
-
 ```shell
 cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}
 
@@ -81,12 +79,13 @@ ray attach deploy/ray/aviary-cluster.yaml
 serve run serve_configs/amazon--LightGPT.yaml
 ```
 
-You can deploy any model in the `models` directory of this repo, 
+You can deploy any model in the `models` directory of this repo,
 or define your own model YAML file and run that instead.
 
 ### On Kubernetes
 
 For Kubernetes deployments, please see our extensive documentation for [deploying Ray Serve on KubeRay](https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html).
+
 ## Query your models
 
 Once the models are deployed, you can install a client outside of the Docker container to query the backend.
@@ -97,7 +96,7 @@ pip install "aviary @ git+https://github.com/ray-project/ray-llm.git"
 
 You can query your RayLLM deployment in many ways.
 
-In all cases start out by doing: 
+In all cases start out by doing:
 
 ```shell
 export ENDPOINT_URL="http://localhost:8000/v1"
@@ -107,7 +106,7 @@ This is because your deployment is running locally, but you can also access remo
 
 ### Using curl
 
-You can use curl at the command line to query your deployed LLM: 
+You can use curl at the command line to query your deployed LLM:
 
 ```shell
 % curl $ENDPOINT_URL/chat/completions \
@@ -118,6 +117,7 @@ You can use curl at the command line to query your deployed LLM:
     "temperature": 0.7
   }'
 ```
+
 ```text
 {
   "id":"meta-llama/Llama-2-7b-chat-hf-308fc81f-746e-4682-af70-05d35b2ee17d",
@@ -186,8 +186,7 @@ with s.post(url, json=body, stream=True) as response:
 ### Using the OpenAI SDK
 
 RayLLM uses an OpenAI-compatible API, allowing us to use the OpenAI
-SDK to access our deployments. To do so, we need to set the `OPENAI_API_BASE` env var. 
-
+SDK to access our deployments. To do so, we need to set the `OPENAI_API_BASE` env var.
 
 ```shell
 export OPENAI_API_BASE=http://localhost:8000/v1
@@ -213,7 +212,6 @@ chat_completion = openai.ChatCompletion.create(
 print(chat_completion)
 ```
 
-
 # RayLLM Reference
 
 ## Installing RayLLM
@@ -256,9 +254,9 @@ serve run aviary.frontend.app:app --non-blocking
 
 You will be able to access it at `http://localhost:8000/frontend` in your browser.
 
-To just use the Gradio frontend without Ray Serve, you can start it 
+To just use the Gradio frontend without Ray Serve, you can start it
 with `python aviary/frontend/app.py`. In that case, the Gradio interface should be accessible at `http://localhost:7860` in your browser.
-If running the frontend yourself is not an option, you can still use 
+If running the frontend yourself is not an option, you can still use
 [our hosted version](http://aviary.anyscale.com/) for your experiments.
 
 Note that the frontend will not dynamically update the list of models should they change in the backend. In order for the frontend to update, you will need to restart it.
@@ -274,7 +272,6 @@ Ray documentation.
 
 RayLLM uses the Ray Serve CLI that allows you to interact with deployed models.
 
-
 ```shell
 # Start a new model in Ray Serve from provided configuration
 serve run serve_configs/<model_config_path>
@@ -289,11 +286,10 @@ serve config
 serve shutdown
 ```
 
-
 ## RayLLM Model Registry
 
 You can easily add new models by adding two configuration files.
-To learn more about how to customize or add new models, 
+To learn more about how to customize or add new models,
 see the [Model Registry](models/README.md).
 
 # Frequently Asked Questions
@@ -311,22 +307,12 @@ Run multiple models at once by aggregating the Serve configs for different model
 
 applications:
 - name: router
-  route_prefix: /
   import_path: aviary.backend:router_application
+  route_prefix: /
   args:
     models:
-      amazon/LightGPT: ./models/continuous_batching/amazon--LightGPT.yaml
-      meta-llama/Llama-2-7b-chat-hf: ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
-- name: amazon--LightGPT
-  route_prefix: /amazon--LightGPT
-  import_path: aviary.backend:llm_application
-  args:
-    model: "./models/continuous_batching/amazon--LightGPT.yaml"
-- name: meta-llama--Llama-2-7b-chat-hf
-  route_prefix: /meta-llama--Llama-2-7b-chat-hf
-  import_path: aviary.backend:llm_application
-  args:
-    model: "./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml"
+      - ./models/continuous_batching/amazon--LightGPT.yaml
+      - ./models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
 ```
 
 The config includes both models in the `model` argument for the `router`. Additionally, the Serve configs for both model applications are included. Save this unified config file to the `serve_configs/` folder.
@@ -344,18 +330,19 @@ All our default model configurations enforce a model to be deployed on one node
 ## My deployment isn't starting/working correctly, how can I debug?
 
 There can be several reasons for the deployment not starting or not working correctly. Here are some things to check:
+
 1. You might have specified an invalid model id.
 2. Your model may require resources that are not available on the cluster. A common issue is that the model requires Ray custom resources (eg. `accelerator_type_a10`) in order to be scheduled on the right node type, while your cluster is missing those custom resources. You can either modify the model configuration to remove those custom resources or better yet, add them to the node configuration of your Ray cluster. You can debug this issue by looking at Ray Autoscaler logs ([monitor.log](https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#system-component-logs)).
 3. Your model is a gated Hugging Face model (eg. meta-llama). In that case, you need to set the `HUGGING_FACE_HUB_TOKEN` environment variable cluster-wide. You can do that either in the Ray cluster configuration or by setting it before running `serve run`
 4. Your model may be running out of memory. You can usually spot this issue by looking for keywords related to "CUDA", "memory" and "NCCL" in the replica logs or `serve run` output. In that case, consider reducing the `max_batch_prefill_tokens` and `max_batch_total_tokens` (if applicable). See models/README.md for more information on those parameters.
 
 In general, [Ray Dashboard](https://docs.ray.io/en/latest/serve/monitoring.html#ray-dashboard) is a useful debugging tool, letting you monitor your Ray Serve / LLM application and access Ray logs.
 
-A good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy _a_ model. 
+A good sanity check is deploying the test model in tests/models/. If that works, you know you can deploy _a_ model.
 
-### How do I write a program that accesses both OpenAI and your hosted model at the same time? 
+### How do I write a program that accesses both OpenAI and your hosted model at the same time?
 
-The OpenAI `create()` commands allow you to specify the `API_KEY` and `API_BASE`. So you can do something like this. 
+The OpenAI `create()` commands allow you to specify the `API_KEY` and `API_BASE`. So you can do something like this.
 
 ```python
 # Call your self-hosted model running on the local host:
@@ -367,10 +354,10 @@ OpenAI.ChatCompletion.create(api_key="OPENAI_API_KEY", ...)
 
 ## Getting Help and Filing Bugs / Feature Requests
 
-We are eager to help you get started with RayLLM. You can get help on: 
+We are eager to help you get started with RayLLM. You can get help on:
 
-- Via Slack -- fill in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) to sign up. 
-- Via [Discuss](https://discuss.ray.io/c/llms-generative-ai/27). 
+- Via Slack -- fill in [this form](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) to sign up.
+- Via [Discuss](https://discuss.ray.io/c/llms-generative-ai/27).
 
 For bugs or for feature requests, please submit them [here](https://github.com/ray-project/ray-llm/issues/new).
 
@@ -381,4 +368,4 @@ Feel free to post an issue first to get our feedback on a proposal first, or jus
 
 We use `pre-commit` hooks to ensure that all code is formatted correctly.
 Make sure to `pip install pre-commit` and then run `pre-commit install`.
-You can also run `./format` to run the hooks manually.
+You can also run `./format` to run the hooks manually.
diff --git a/docs/kuberay/deploy-on-eks.md b/docs/kuberay/deploy-on-eks.md
@@ -1,7 +1,9 @@
 # Deploy RayLLM on Amazon EKS using KubeRay
+
 * Note that this document will be extended to include Ray autoscaling and the deployment of multiple models in the near future.
 
 # Part 1: Set up a Kubernetes cluster on Amazon EKS
+
 ## Step 1: Create a Kubernetes cluster on Amazon EKS
 
 Follow the first two steps in this [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#)
@@ -100,7 +102,9 @@ kubectl apply -f ray-cluster.aviary-eks.yaml
 ```
 
 Something is worth noticing:
+
 * The `tolerations` for workers must match the taints on the GPU node group.
+
     ```yaml
     # Please add the following taints to the GPU node.
     tolerations:
@@ -109,7 +113,9 @@ Something is worth noticing:
         value: "worker"
         effect: "NoSchedule"
     ```
+
 * Update `rayStartParams.resources` for Ray scheduling. The `OpenAssistant--falcon-7b-sft-top1-696.yaml` file uses both `accelerator_type_cpu` and `accelerator_type_a10`.
+
     ```yaml
     # Ray head: The Ray head has a Pod resource limit of 2 CPUs.
     rayStartParams:
@@ -239,23 +245,13 @@ If this process takes longer, follow the instructions in [the RayService trouble
 ```yaml
 serveConfigV2: |
     applications:
-    - name: amazon--LightGPT
-      import_path: aviary.backend:llm_application
-      route_prefix: /amazon--LightGPT
-      args:
-        model: "./models/continuous_batching/amazon--LightGPT.yaml"
-    - name: OpenAssistant--falcon-7b-sft-top1-696
-      import_path: aviary.backend:llm_application
-      route_prefix: /OpenAssistant--falcon-7b-sft-top1-696
-      args:
-        model: "./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml"
     - name: router
       import_path: aviary.backend:router_application
       route_prefix: /
       args:
         models:
-          amazon/LightGPT: ./models/continuous_batching/amazon--LightGPT.yaml
-          OpenAssistant/falcon-7b-sft-top1-696: ./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml
+          - ./models/continuous_batching/amazon--LightGPT.yaml
+          - ./models/continuous_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml
 ```
 
 In the YAML file, we use the `serveConfigV2` field to configure two LLM Serve applications, one for LightGPT and one for Falcon-7B.