-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868
Comments
Hi @njaramish, thanks for raising this issue. Is this consistently reproducible with every request and every model? Or intermittent based on certain models, certain queries, etc? Any details on the exact scenarios that cause this behavior would be great. |
Hi @rmccorm4, please find a reprex below (apologies in advance for how long it is). This is consistently reproducible for me, for every request and every model/container combination I tried -- I have yet to get a non-empty from the OpenAI-Compatible Frontend.
docker run -it --net=host --gpus '"device=0"' --rm \
-e TRTLLM_ORCHESTRATOR=1 \
nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
pip install /opt/tritonserver/python/triton*.whl
mv server server1
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
cd ../../..
git clone --depth 1 --branch v0.16.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone --depth 1 --branch v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/ model_repo/
mkdir raw_models
mkdir checkpoints
git config --global credential.helper store
huggingface-cli login
cd raw_models && git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
cd ..
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir raw_models/Llama-3.1-8B-Instruct --output_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming --dtype auto --tp_size 1
trtllm-build --checkpoint_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming --output_dir model_repo/tensorrt_llm/1 --gemm_plugin auto --workers 1 --gpt_attention_plugin auto --context_fmha enable --remove_input_padding enable --kv_cache_type paged --use_paged_context_fmha enable --max_beam_width 1 --max_seq_len 32768 --max_batch_size 8 --gather_all_token_logits
/opt/tritonserver/model-repo/postprocessing/config.pbtxt:
/opt/tritonserver/model_repo/tensorrt_llm/config.pbtxt:
/opt/tritonserver/model_repo/ensemble/config.pbtxt:
/opt/tritonserver/model_repo/tensorrt_llm_bls/config.pbtxt:
/opt/tritonserver/server/python/openai/openai_frontend/main.py --model-repository /opt/tritonserver/model_repo --tokenizer /opt/tritonserver/raw_models/Llama-3.1-8B-Instruct --openai-port 9000 --enable-kserve-frontends
curl \
-w '\nTotal: %{time_total}s\n' \
-s http://localhost:9000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "ensemble", "messages": [{"role": "user", "content": "What is machine learning?"}]}'
{"id":"cmpl-0c45b290-b7d6-11ef-b4ec-5ecbd0277ad4","choices":[{"finish_reason":"stop","index":0,"message":{"content":"","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1733931542,"model":"ensemble","system_fingerprint":null,"object":"chat.completion","usage":null}
Total: 0.372138s
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:9000/v1",
api_key="EMPTY",
)
model = "ensemble"
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": "Why is the sky blue?"},
],
stop = '<|eot_id|>',
)
print(completion.choices)
|
I faced the same problems described by @njaramish. @rmccorm4 do you have any guess about what the root cause of the problem might be? Thanks for the support. |
@rmccorm4 I've updated my reprex above for the 24.12 container and TensorRT-LLM v0.16.0 (including config files) -- still running into the same issue with the OpenAI frontend, where the content field of the response is empty. With the latest version of the container and frontend code, the KServe frontend now works. Please note #9 above (which is new), showing that the KServe frontend produces the correct response, while the OpenAI frontend returns an empty response (same deployment). Please let me know if there is anything else I can do to assist in debugging. Happy Holidays! |
@rmccorm4 Thank you very much for your hard work on recently released OpenAI-Compatible Frontend! I hope it's ok for me to tag you in this issue for visibility.
I am encountering a nearly identical issue as #7724 using the OpenAI-Compatible Frontend available today on
main
. Unfortunately, using a BLS model (which solved the original user's issue) still returns no content.I've tried different versions of
nvcr.io/nvidia/tritonserver:24.XX-trtllm-python-py3
, as well as different versions of TensorRT-LLM (0.15.0 & 0.16.0.dev.20240603) and different models (Qwen & Llama).In all cases, the OpenAI-Compatible Frontend returns no content, while generating text using Tritonserver's python bindings or a standalone Triton server launched with this script work as expected.
In the
ensemble
model specifically, it appears that preprocessor receives the correct prompt from the OpenAI-Compatible Frontend and passes the tokenized prompt to the tensorrt-llm model, but the tensorrt-llm model never passes any content to the postprocessor.Please let me know if any additional info would be helpful, or if there are any additional steps you'd like me to take to debug.
The text was updated successfully, but these errors were encountered: