FastAPI + Ray = <3
Let's take a FastAPI app and supercharge it with raycraft
from fastapi import FastAPI
simple_service = FastAPI()
@simple_service.post("/")
async def read_root() -> dict[str, str]:
return {"Hello": "World"}
You can now run it using raycraft using the RayCraftAPI instead of FastAPI with only two lines of code changes
+ from raycraft import RayCraftAPI
+ simple_service = RayCraftAPI()
@simple_service.post("/")
async def read_root() -> dict[str, str]:
return {"Hello": "World"}
Ok so an endpoint returning {"Hello": "World"} isn't going to be enough to serve as a basic example so let's try something more interesting and relevant to why you might want to use raycraft!
Let's say you build a translation service using the following fastAPI code:
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
def load_model():
return pipeline("translation_en_to_fr", model="t5-small")
@app.post("/")
async def translate(text: str):
model = load_model()
translated = model(text)[0]["translation_text"]
return {"translation": translated}
We can now build this app using raycraft with the same two lines of code changes
from raycraft import RayCraftAPI
from transformers import pipeline
app = RayCraftAPI()
def load_model():
return pipeline("translation_en_to_fr", model="t5-small")
def translate(text: str):
model = load_model()
translated = model(text)[0]["translation_text"]
return translated
@app.post("/")
async def translate(text: str):
return translate(text)
We then call the following command to run the app:
raycraft run demo:app
Ok now for the distributed part, let's say we want to run this app on 2 "replicas", each "replica" taking half a GPU, and we want to properly load balance between the replicas, we can do this by running the following command:
from raycraft import RayCraftAPI
from transformers import pipeline
app = RayCraftAPI(ray_actor_options={"num_gpus": 0.5}, num_replicas=2)
def load_model():
return pipeline("translation_en_to_fr", model="t5-small")
def translate(text: str):
model = load_model()
translated = model(text)[0]["translation_text"]
return translated
@app.post("/")
async def translate(text: str):
return translate(text)
To avoid loading the model on every request, we can load the model in the constructor of the app:
from raycraft import RayCraftAPI, App
from transformers import pipeline
app = RayCraftAPI(ray_actor_options={"num_gpus": 0.5}, num_replicas=2)
@app.init
def model():
return pipeline("translation_en_to_fr", model="t5-small")
def translate(app: App, text: str):
translated = app.model(text)[0]["translation_text"]
return translated
@app.post("/")
async def translate(app: App, text: str):
return translate(app, text)
RayCraft is a thin-layer built on top of Ray Serve adopting a functional interface to ease the migration from fastAPI apps.
With Ray Serve, you can now:
- Scale your app deployment to multiple replicas running on different machines
- Define the resources allocated to each replica including fractional GPUs
- Batch requests together to improve throughput
- Get fault tolerance and automatic retries
- Stream responses using websockets
- Compose different services together using RPC calls that are strictly typed and faster than http requests
Using poetry:
poetry add raycraft
Using pip:
pip install raycraft
- Streaming support using websockets
- Deployment guide