No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

njaramish · 2024-12-11T03:06:43Z

@rmccorm4 Thank you very much for your hard work on recently released OpenAI-Compatible Frontend! I hope it's ok for me to tag you in this issue for visibility.

I am encountering a nearly identical issue as #7724 using the OpenAI-Compatible Frontend available today on main. Unfortunately, using a BLS model (which solved the original user's issue) still returns no content.

I've tried different versions of nvcr.io/nvidia/tritonserver:24.XX-trtllm-python-py3, as well as different versions of TensorRT-LLM (0.15.0 & 0.16.0.dev.20240603) and different models (Qwen & Llama).

In all cases, the OpenAI-Compatible Frontend returns no content, while generating text using Tritonserver's python bindings or a standalone Triton server launched with this script work as expected.

In the ensemble model specifically, it appears that preprocessor receives the correct prompt from the OpenAI-Compatible Frontend and passes the tokenized prompt to the tensorrt-llm model, but the tensorrt-llm model never passes any content to the postprocessor.

Please let me know if any additional info would be helpful, or if there are any additional steps you'd like me to take to debug.

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2024-12-11T03:36:20Z

Hi @njaramish, thanks for raising this issue.

Is this consistently reproducible with every request and every model? Or intermittent based on certain models, certain queries, etc? Any details on the exact scenarios that cause this behavior would be great.

njaramish · 2024-12-11T16:07:18Z

Hi @rmccorm4, please find a reprex below (apologies in advance for how long it is). This is consistently reproducible for me, for every request and every model/container combination I tried -- I have yet to get a non-empty from the OpenAI-Compatible Frontend.

Set up container and build TRTLLM engine

docker run -it --net=host --gpus '"device=0"' --rm \
  -e TRTLLM_ORCHESTRATOR=1 \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

pip install /opt/tritonserver/python/triton*.whl
mv server server1
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
cd ../../..
git clone --depth 1 --branch v0.16.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone --depth 1 --branch v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/ model_repo/
mkdir raw_models
mkdir checkpoints
git config --global credential.helper store
huggingface-cli login
cd raw_models && git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
cd ..
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir raw_models/Llama-3.1-8B-Instruct         --output_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming         --dtype auto         --tp_size 1         
trtllm-build --checkpoint_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming --output_dir model_repo/tensorrt_llm/1 --gemm_plugin auto --workers 1 --gpt_attention_plugin auto --context_fmha enable --remove_input_padding enable --kv_cache_type paged --use_paged_context_fmha enable --max_beam_width 1 --max_seq_len 32768 --max_batch_size 8 --gather_all_token_logits

Populate config.pbtxt files in the model repo:
/opt/tritonserver/model-repo/preprocessing/config.pbtxt:

name: "preprocessing"
backend: "python"
max_batch_size: 8
input [
    {
        name: "QUERY"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "DECODER_QUERY"
        data_type: TYPE_STRING
        dims: [ 1 ]
        optional: true
    },
    {
        name: "IMAGE_BYTES"
        data_type: TYPE_UINT8
        dims: [ -1, -1, -1, -1 ]
        optional: true
    },
    {
        name: "IMAGE_URL"
        data_type: TYPE_STRING
        dims: [ 1 ]
        optional: true
    },
    {
        name: "VIDEO_BYTES"
        data_type: TYPE_UINT8
        dims: [ -1, -1, -1, -1 ]
        optional: true
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "BAD_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "STOP_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WORDS"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WEIGHTS"
        data_type: TYPE_FP32
        dims: [ -1 ]
        optional: true
    },
    {
        name: "END_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
        optional: true
    },
    {
        name: "PAD_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
        optional: true
    },
    {
        name: "PROMPT_TABLE_EXTRA_ID"
        data_type: TYPE_UINT64
        dims: [ 1 ]
        optional: true
    }
]
output [
    {
        name: "INPUT_ID"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_INPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "DECODER_INPUT_ID"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_DECODER_INPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "BAD_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "STOP_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "EMBEDDING_BIAS"
        data_type: TYPE_FP32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "OUT_END_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "OUT_PAD_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "OUT_PROMPT_TABLE_EXTRA_IDS"
        data_type: TYPE_UINT64
        dims: [ -1 ]
    },
    {
        name: "PIXEL_VALUES"
        data_type: TYPE_FP16
        dims: [ -1, -1, -1, -1 ]
    },
    {
        name: "ASPECT_RATIO_IDS"
        data_type: TYPE_INT64
        dims: [ -1 ]
    },
    {
        name: "ASPECT_RATIO_MASK"
        data_type: TYPE_INT64
        dims: [ -1, -1 ]
    },
    {
        name: "CROSS_ATTENTION_MASK"
        data_type: TYPE_INT64
        dims: [ -1, -1, -1 ]
    },
    # Required for image postprocessing in the llava_onevision model
    {
        name: "IMAGE_SIZES"
        data_type: TYPE_INT64
        dims: [ 2 ]
    },
    # Indicates if the input is video in the llava_onevision model
    {
        name: "IS_VIDEO_INPUT"
        data_type: TYPE_BOOL
        dims: [ 1 ]
    }
]
parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/opt/tritonserver/raw_models/Llama-3.1-8B-Instruct"
  }
}

parameters {
  key: "add_special_tokens"
  value: {
    string_value: "True"
  }
}

parameters {
  key: "visual_model_path"
  value: {
    string_value: "${visual_model_path}"
  }
}

parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/opt/tritonserver/model_repo/tensorrt_llm/1"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

/opt/tritonserver/model-repo/postprocessing/config.pbtxt:

name: "postprocessing"
backend: "python"
max_batch_size: 8
dynamic_batching {}
input [
  {
    name: "TOKENS_BATCH"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "SEQUENCE_LENGTH"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/opt/tritonserver/raw_models/Llama-3.1-8B-Instruct"
  }
}

parameters {
  key: "skip_special_tokens"
  value: {
    string_value: "True"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

/opt/tritonserver/model_repo/tensorrt_llm/config.pbtxt:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 8

model_transaction_policy {
  decoupled: True
}

dynamic_batching {
    preferred_batch_size: [ 8 ]
    max_queue_delay_microseconds: 100
    default_queue_policy: { max_queue_size: 1000 }
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_input_features"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_output_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_table_extra_ids"
    data_type: TYPE_UINT64
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # cross_attention_mask shape `[bs, seq_len, num_images*num_tiles]`
  {
    name: "cross_attention_mask"
    data_type: TYPE_BOOL
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # skip_cross_attn_blocks shape `[bs, 1]`, only used in mllama
  {
    name: "skip_cross_attn_blocks"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  },
  {
    name: "kv_cache_alloc_new_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_reused_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_alloc_total_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/opt/tritonserver/model_repo/tensorrt_llm/1"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "unspecified"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "max_sequence_length"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "guarantee_no_evict"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "cross_kv_cache_fraction"
  value: {
    string_value: "${cross_kv_cache_fraction}"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "0"
  }
}
# kv_cache_onboard_blocks is for internal implementation.
parameters: {
  key: "kv_cache_onboard_blocks"
  value: {
    string_value: "True"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0"
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "top_k"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
      string_value: "1.0"
  }
}

/opt/tritonserver/model_repo/ensemble/config.pbtxt:

name: "ensemble"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_table_extra_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "embedding_bias_words"
    data_type: TYPE_STRING
    dims: [ -1 ]
    optional: true
  },
  {
    name: "embedding_bias_weights"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      input_map {
        key: "QUERY"
        value: "text_input"
      }
      input_map {
        key: "DECODER_QUERY"
        value: "decoder_text_input"
      }
      input_map {
        key: "REQUEST_OUTPUT_LEN"
        value: "max_tokens"
      }
      input_map {
        key: "BAD_WORDS_DICT"
        value: "bad_words"
      }
      input_map {
        key: "STOP_WORDS_DICT"
        value: "stop_words"
      }
      input_map {
        key: "EMBEDDING_BIAS_WORDS"
        value: "embedding_bias_words"
      }
      input_map {
        key: "EMBEDDING_BIAS_WEIGHTS"
        value: "embedding_bias_weights"
      }
      input_map {
        key: "END_ID"
        value: "end_id"
      }
      input_map {
        key: "PAD_ID"
        value: "pad_id"
      }
      input_map {
        key: "PROMPT_TABLE_EXTRA_ID"
        value: "prompt_table_extra_id"
      }
      output_map {
        key: "REQUEST_INPUT_LEN"
        value: "_REQUEST_INPUT_LEN"
      }
      output_map {
        key: "INPUT_ID"
        value: "_INPUT_ID"
      }
      output_map {
        key: "REQUEST_DECODER_INPUT_LEN"
        value: "_REQUEST_DECODER_INPUT_LEN"
      }
      output_map {
        key: "DECODER_INPUT_ID"
        value: "_DECODER_INPUT_ID"
      }
      output_map {
        key: "REQUEST_OUTPUT_LEN"
        value: "_REQUEST_OUTPUT_LEN"
      }
      output_map {
        key: "STOP_WORDS_IDS"
        value: "_STOP_WORDS_IDS"
      }
      output_map {
        key: "BAD_WORDS_IDS"
        value: "_BAD_WORDS_IDS"
      }
      output_map {
        key: "EMBEDDING_BIAS"
        value: "_EMBEDDING_BIAS"
      }
      output_map {
        key: "OUT_END_ID"
        value: "_PREPROCESSOR_END_ID"
      }
      output_map {
        key: "OUT_PAD_ID"
        value: "_PREPROCESSOR_PAD_ID"
      }
      output_map {
        key: "OUT_PROMPT_TABLE_EXTRA_IDS"
        value: "_OUT_PROMPT_TABLE_EXTRA_IDS"
      }
    },
    {
      model_name: "tensorrt_llm"
      model_version: -1
      input_map {
        key: "input_ids"
        value: "_INPUT_ID"
      }
      input_map {
        key: "decoder_input_ids"
        value: "_DECODER_INPUT_ID"
      }
      input_map {
        key: "input_lengths"
        value: "_REQUEST_INPUT_LEN"
      }
      input_map {
        key: "decoder_input_lengths"
        value: "_REQUEST_DECODER_INPUT_LEN"
      }
      input_map {
        key: "exclude_input_in_output"
        value: "exclude_input_in_output"
      }
      input_map {
        key: "request_output_len"
        value: "_REQUEST_OUTPUT_LEN"
      }
      input_map {
          key: "end_id"
          value: "_PREPROCESSOR_END_ID"
      }
      input_map {
          key: "pad_id"
          value: "_PREPROCESSOR_PAD_ID"
      }
      input_map {
          key: "embedding_bias"
          value: "_EMBEDDING_BIAS"
      }
      input_map {
          key: "runtime_top_k"
          value: "top_k"
      }
      input_map {
          key: "runtime_top_p"
          value: "top_p"
      }
      input_map {
          key: "temperature"
          value: "temperature"
      }
      input_map {
          key: "len_penalty"
          value: "length_penalty"
      }
      input_map {
          key: "repetition_penalty"
          value: "repetition_penalty"
      }
      input_map {
          key: "min_length"
          value: "min_length"
      }
      input_map {
          key: "presence_penalty"
          value: "presence_penalty"
      }
      input_map {
          key: "frequency_penalty"
          value: "frequency_penalty"
      }
      input_map {
          key: "random_seed"
          value: "random_seed"
      }
      input_map {
          key: "return_log_probs"
          value: "return_log_probs"
      }
      input_map {
          key: "return_context_logits"
          value: "return_context_logits"
      }
      input_map {
          key: "return_generation_logits"
          value: "return_generation_logits"
      }
      input_map {
          key: "num_return_sequences"
          value: "num_return_sequences"
      }
      input_map {
          key: "beam_width"
          value: "beam_width"
      }
      input_map {
          key: "streaming"
          value: "stream"
      }
      input_map {
        key: "prompt_embedding_table"
        value: "prompt_embedding_table"
      }
      input_map {
        key: "prompt_vocab_size"
        value: "prompt_vocab_size"
      }
      input_map {
        key: "stop_words_list"
        value: "_STOP_WORDS_IDS"
      }
      input_map {
        key: "bad_words_list"
        value: "_BAD_WORDS_IDS"
      }
      input_map {
        key: "prompt_table_extra_ids"
        value: "_OUT_PROMPT_TABLE_EXTRA_IDS"
      },
      input_map {
        key: "lora_task_id",
        value: "lora_task_id"
      },
      input_map {
        key: "lora_weights",
        value: "lora_weights"
      },
      input_map {
        key: "lora_config",
        value: "lora_config"
      },
      output_map {
        key: "output_ids"
        value: "_TOKENS_BATCH"
      }
      output_map {
        key: "sequence_length"
        value: "_SEQUENCE_LENGTH"
      },
      output_map {
        key: "cum_log_probs"
        value: "cum_log_probs"
      }
      output_map {
        key: "output_log_probs"
        value: "output_log_probs"
      },
      output_map {
        key: "context_logits"
        value: "context_logits"
      },
      output_map {
        key: "generation_logits"
        value: "generation_logits"
      },
      output_map {
        key: "batch_index"
        value: "batch_index"
      },
      output_map {
        key: "sequence_index"
        value: "sequence_index"
      }
    },
    {
      model_name: "postprocessing"
      model_version: -1
      input_map {
        key: "TOKENS_BATCH"
        value: "_TOKENS_BATCH"
      }
      input_map {
        key: "SEQUENCE_LENGTH"
        value: "_SEQUENCE_LENGTH"
      }
      output_map {
        key: "OUTPUT"
        value: "text_output"
      }
    }
  ]
}

/opt/tritonserver/model_repo/tensorrt_llm_bls/config.pbtxt:

name: "tensorrt_llm_bls"
backend: "python"
max_batch_size: 8

model_transaction_policy {
  decoupled: True
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "image_input"
    data_type: TYPE_FP16
    dims: [ -1, 3, -1, -1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_table_extra_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
      name: "embedding_bias_words"
      data_type: TYPE_STRING
      dims: [ -1 ]
      optional: true
  },
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
  },
  {
      name: "num_draft_tokens",
      data_type: TYPE_INT32,
      dims: [ 1 ]
      optional: true
  },
  {
      name: "use_draft_logits",
      data_type: TYPE_BOOL,
      dims: [ 1 ]
      reshape: { shape: [ ] }
      optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
        data_type: TYPE_UINT64
        dims: [ 1 ]
    reshape: { shape: [ ] }
        optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
        data_type: TYPE_FP16
        dims: [ -1, -1 ]
        optional: true
        allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
        data_type: TYPE_INT32
        dims: [ -1, 3 ]
        optional: true
        allow_ragged_batch: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}
parameters: {
  key: "tensorrt_llm_draft_model_name"
  value: {
    string_value: "${tensorrt_llm_draft_model_name}"
  }
}
parameters: {
  key: "multimodal_encoders_name"
  value: {
    string_value: "${multimodal_encoders_name}"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]

Populate test_bindings.py

import tritonserver
import numpy as np

# Constructing path to Model Repository
model_path = f"/opt/tritonserver/model_repo"

server_options = tritonserver.Options(
    server_id="ExampleServer",
    model_repository=model_path,
    log_error=True,
    log_warn=True,
    log_info=True,
    )

server = tritonserver.Server(server_options).start(wait_until_ready=True)

model = server.model("ensemble")

responses = model.infer(inputs={"text_input": [["What is machine learning?"]], \
        "return_context_logits": [[False]], \
        "return_log_probs": [[False]], \
        "max_tokens": np.array([[100]], dtype=np.int32)})

for response in responses:
    print(response.outputs['text_output'].to_bytes_array()[0].decode("utf-8"))

Test with test_bindings.py produces good results:

python3 /opt/tritonserver/test_bindings.py
...
 Machine learning is a subset of artificial intelligence (AI) that involves training algorithms to learn from data and make predictions or decisions based on t                                                hat data. Unlike traditional programming, which relies on explicit rules and instructions, machine learning algorithms can improve their performance over time                                                 by analyzing and adapting to new data.
Machine learning is a key technology behind many modern applications, including:
1. Image and speech recognition
2. Natural language processing (NLP)
3. Predictive maintenance
4. Recommendation systems
....

Launch OpenAI frontend with --enable-kserve-frontend ~~throws error~~ used to throw an error but works now:

/opt/tritonserver/server/python/openai/openai_frontend/main.py --model-repository /opt/tritonserver/model_repo --tokenizer /opt/tritonserver/raw_models/Llama-3.1-8B-Instruct --openai-port 9000 --enable-kserve-frontends

Test endpoint in separate shell. Content in response is always empty when querying "ensemble" and "tensorrt_llm_bls" models:

curl \
  -w '\nTotal: %{time_total}s\n' \
  -s http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "ensemble", "messages": [{"role": "user", "content": "What is machine learning?"}]}'
{"id":"cmpl-0c45b290-b7d6-11ef-b4ec-5ecbd0277ad4","choices":[{"finish_reason":"stop","index":0,"message":{"content":"","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1733931542,"model":"ensemble","system_fingerprint":null,"object":"chat.completion","usage":null}
Total: 0.372138s

Testing using OpenAI client also produces an empty response:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9000/v1",
    api_key="EMPTY",
)

model = "ensemble"
completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {"role": "user", "content": "Why is the sky blue?"},
    ],
    stop = '<|eot_id|>',
)

print(completion.choices)

Testing KServe front end produces good results:

curl \
  -w '\nTotal: %{time_total}s\n' \
  -s http://localhost:8000/v2/models/ensemble/generate \
  -H 'Content-Type: application/json' \
  -d '{"text_input": "What is machine learning?", "max_tokens": 10, "bad_words": [],"stop_words": []}'

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a subset of artificial intelligence (AI"}
Total: 0.071636s

MatteoPagliani · 2024-12-15T11:06:07Z

I faced the same problems described by @njaramish. @rmccorm4 do you have any guess about what the root cause of the problem might be? Thanks for the support.

njaramish · 2024-12-26T18:49:15Z

@rmccorm4 I've updated my reprex above for the 24.12 container and TensorRT-LLM v0.16.0 (including config files) -- still running into the same issue with the OpenAI frontend, where the content field of the response is empty.

With the latest version of the container and frontend code, the KServe frontend now works. Please note #9 above (which is new), showing that the KServe frontend produces the correct response, while the OpenAI frontend returns an empty response (same deployment).

Please let me know if there is anything else I can do to assist in debugging. Happy Holidays!

rmccorm4 self-assigned this Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

njaramish commented Dec 11, 2024

rmccorm4 commented Dec 11, 2024

njaramish commented Dec 11, 2024 •

edited

Loading

MatteoPagliani commented Dec 15, 2024

njaramish commented Dec 26, 2024

No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

Comments

njaramish commented Dec 11, 2024

rmccorm4 commented Dec 11, 2024

njaramish commented Dec 11, 2024 • edited Loading

MatteoPagliani commented Dec 15, 2024

njaramish commented Dec 26, 2024

njaramish commented Dec 11, 2024 •

edited

Loading