This repo provides the FAST API server code for hosting huggingface models on local machine or on a cluster.
git clone https://github.com/maharshi95/hf-fastapi.git
cd hf-fastapi
bash setup_env.sh
conda activate hf-fastapi
python -m hf_fastapi.serve --model-name {MODEL_NAME} --port {PORT}
conda activate hf-fastapi
slaunch --exp-name="hf-serve" --config="slurm_configs/med_gpu_nexus.json" \
hf_fastapi/serve.py -m "mistral-7b-inst" -p 8000
You can add a custom SLURM config file to the slurm_configs
directory and use it to submit the job.
An example of a SLURM config file is given below:
{
"account": "$SLURM_ACCOUNT",
"partition": "$SLURM_PARTITION",
"qos": "default",
"gres": "gpu:rtxa5000:1",
"time": "10:00:00",
"mem": "30G",
"ntasks-per-node": 1,
"cpus-per-task": 4
}
client/example.py contains an example of how to use the API.
from hf_client.client import HFClient
client = HFClient(host=HOST, port=PORT)
# Health check
resp = client.get_heartbeat()
print("Is alive?", resp.is_alive)
# Generate API
prompt = "Question: What is the meaning of life, the universe, and everything? Answer:"
resp = client.generate(prompt=prompt, max_new_tokens=50)
print(f'Input: "{resp.input_text}"')
print("Model:", resp.model_name)
print(f'Output: "{resp.generated_text.strip()}"')