Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No content returned with OpenAI-Compatible Frontend Beta (ensemble & bls) #7868

Open
njaramish opened this issue Dec 11, 2024 · 4 comments
Open
Assignees

Comments

@njaramish
Copy link

@rmccorm4 Thank you very much for your hard work on recently released OpenAI-Compatible Frontend! I hope it's ok for me to tag you in this issue for visibility.

I am encountering a nearly identical issue as #7724 using the OpenAI-Compatible Frontend available today on main. Unfortunately, using a BLS model (which solved the original user's issue) still returns no content.

I've tried different versions of nvcr.io/nvidia/tritonserver:24.XX-trtllm-python-py3, as well as different versions of TensorRT-LLM (0.15.0 & 0.16.0.dev.20240603) and different models (Qwen & Llama).

In all cases, the OpenAI-Compatible Frontend returns no content, while generating text using Tritonserver's python bindings or a standalone Triton server launched with this script work as expected.

In the ensemble model specifically, it appears that preprocessor receives the correct prompt from the OpenAI-Compatible Frontend and passes the tokenized prompt to the tensorrt-llm model, but the tensorrt-llm model never passes any content to the postprocessor.

Please let me know if any additional info would be helpful, or if there are any additional steps you'd like me to take to debug.

@rmccorm4
Copy link
Contributor

Hi @njaramish, thanks for raising this issue.

Is this consistently reproducible with every request and every model? Or intermittent based on certain models, certain queries, etc? Any details on the exact scenarios that cause this behavior would be great.

@rmccorm4 rmccorm4 self-assigned this Dec 11, 2024
@njaramish
Copy link
Author

njaramish commented Dec 11, 2024

Hi @rmccorm4, please find a reprex below (apologies in advance for how long it is). This is consistently reproducible for me, for every request and every model/container combination I tried -- I have yet to get a non-empty from the OpenAI-Compatible Frontend.

  1. Set up container and build TRTLLM engine
docker run -it --net=host --gpus '"device=0"' --rm \
  -e TRTLLM_ORCHESTRATOR=1 \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

pip install /opt/tritonserver/python/triton*.whl
mv server server1
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
cd ../../..
git clone --depth 1 --branch v0.16.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone --depth 1 --branch v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/ model_repo/
mkdir raw_models
mkdir checkpoints
git config --global credential.helper store
huggingface-cli login
cd raw_models && git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
cd ..
python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir raw_models/Llama-3.1-8B-Instruct         --output_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming         --dtype auto         --tp_size 1         
trtllm-build --checkpoint_dir checkpoints/llama-3.1-8b-instruct-fp8-streaming --output_dir model_repo/tensorrt_llm/1 --gemm_plugin auto --workers 1 --gpt_attention_plugin auto --context_fmha enable --remove_input_padding enable --kv_cache_type paged --use_paged_context_fmha enable --max_beam_width 1 --max_seq_len 32768 --max_batch_size 8 --gather_all_token_logits
  1. Populate config.pbtxt files in the model repo:
    /opt/tritonserver/model-repo/preprocessing/config.pbtxt:
name: "preprocessing"
backend: "python"
max_batch_size: 8
input [
    {
        name: "QUERY"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "DECODER_QUERY"
        data_type: TYPE_STRING
        dims: [ 1 ]
        optional: true
    },
    {
        name: "IMAGE_BYTES"
        data_type: TYPE_UINT8
        dims: [ -1, -1, -1, -1 ]
        optional: true
    },
    {
        name: "IMAGE_URL"
        data_type: TYPE_STRING
        dims: [ 1 ]
        optional: true
    },
    {
        name: "VIDEO_BYTES"
        data_type: TYPE_UINT8
        dims: [ -1, -1, -1, -1 ]
        optional: true
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "BAD_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "STOP_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WORDS"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WEIGHTS"
        data_type: TYPE_FP32
        dims: [ -1 ]
        optional: true
    },
    {
        name: "END_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
        optional: true
    },
    {
        name: "PAD_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
        optional: true
    },
    {
        name: "PROMPT_TABLE_EXTRA_ID"
        data_type: TYPE_UINT64
        dims: [ 1 ]
        optional: true
    }
]
output [
    {
        name: "INPUT_ID"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_INPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "DECODER_INPUT_ID"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_DECODER_INPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "BAD_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "STOP_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "EMBEDDING_BIAS"
        data_type: TYPE_FP32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "OUT_END_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "OUT_PAD_ID"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "OUT_PROMPT_TABLE_EXTRA_IDS"
        data_type: TYPE_UINT64
        dims: [ -1 ]
    },
    {
        name: "PIXEL_VALUES"
        data_type: TYPE_FP16
        dims: [ -1, -1, -1, -1 ]
    },
    {
        name: "ASPECT_RATIO_IDS"
        data_type: TYPE_INT64
        dims: [ -1 ]
    },
    {
        name: "ASPECT_RATIO_MASK"
        data_type: TYPE_INT64
        dims: [ -1, -1 ]
    },
    {
        name: "CROSS_ATTENTION_MASK"
        data_type: TYPE_INT64
        dims: [ -1, -1, -1 ]
    },
    # Required for image postprocessing in the llava_onevision model
    {
        name: "IMAGE_SIZES"
        data_type: TYPE_INT64
        dims: [ 2 ]
    },
    # Indicates if the input is video in the llava_onevision model
    {
        name: "IS_VIDEO_INPUT"
        data_type: TYPE_BOOL
        dims: [ 1 ]
    }
]
parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/opt/tritonserver/raw_models/Llama-3.1-8B-Instruct"
  }
}

parameters {
  key: "add_special_tokens"
  value: {
    string_value: "True"
  }
}

parameters {
  key: "visual_model_path"
  value: {
    string_value: "${visual_model_path}"
  }
}

parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/opt/tritonserver/model_repo/tensorrt_llm/1"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

/opt/tritonserver/model-repo/postprocessing/config.pbtxt:

name: "postprocessing"
backend: "python"
max_batch_size: 8
dynamic_batching {}
input [
  {
    name: "TOKENS_BATCH"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "SEQUENCE_LENGTH"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/opt/tritonserver/raw_models/Llama-3.1-8B-Instruct"
  }
}

parameters {
  key: "skip_special_tokens"
  value: {
    string_value: "True"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

/opt/tritonserver/model_repo/tensorrt_llm/config.pbtxt:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 8

model_transaction_policy {
  decoupled: True
}

dynamic_batching {
    preferred_batch_size: [ 8 ]
    max_queue_delay_microseconds: 100
    default_queue_policy: { max_queue_size: 1000 }
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_input_features"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    allow_ragged_batch: true
    optional: true
  },
  {
    name: "encoder_output_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_table_extra_ids"
    data_type: TYPE_UINT64
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # cross_attention_mask shape `[bs, seq_len, num_images*num_tiles]`
  {
    name: "cross_attention_mask"
    data_type: TYPE_BOOL
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # skip_cross_attn_blocks shape `[bs, 1]`, only used in mllama
  {
    name: "skip_cross_attn_blocks"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "context_phase_params"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  },
  {
    name: "kv_cache_alloc_new_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_reused_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "kv_cache_alloc_total_blocks"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/opt/tritonserver/model_repo/tensorrt_llm/1"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "unspecified"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "max_sequence_length"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "guarantee_no_evict"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "cross_kv_cache_fraction"
  value: {
    string_value: "${cross_kv_cache_fraction}"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "0"
  }
}
# kv_cache_onboard_blocks is for internal implementation.
parameters: {
  key: "kv_cache_onboard_blocks"
  value: {
    string_value: "True"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0"
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "top_k"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
      string_value: "1.0"
  }
}

/opt/tritonserver/model_repo/ensemble/config.pbtxt:

name: "ensemble"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_table_extra_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "embedding_bias_words"
    data_type: TYPE_STRING
    dims: [ -1 ]
    optional: true
  },
  {
    name: "embedding_bias_weights"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      input_map {
        key: "QUERY"
        value: "text_input"
      }
      input_map {
        key: "DECODER_QUERY"
        value: "decoder_text_input"
      }
      input_map {
        key: "REQUEST_OUTPUT_LEN"
        value: "max_tokens"
      }
      input_map {
        key: "BAD_WORDS_DICT"
        value: "bad_words"
      }
      input_map {
        key: "STOP_WORDS_DICT"
        value: "stop_words"
      }
      input_map {
        key: "EMBEDDING_BIAS_WORDS"
        value: "embedding_bias_words"
      }
      input_map {
        key: "EMBEDDING_BIAS_WEIGHTS"
        value: "embedding_bias_weights"
      }
      input_map {
        key: "END_ID"
        value: "end_id"
      }
      input_map {
        key: "PAD_ID"
        value: "pad_id"
      }
      input_map {
        key: "PROMPT_TABLE_EXTRA_ID"
        value: "prompt_table_extra_id"
      }
      output_map {
        key: "REQUEST_INPUT_LEN"
        value: "_REQUEST_INPUT_LEN"
      }
      output_map {
        key: "INPUT_ID"
        value: "_INPUT_ID"
      }
      output_map {
        key: "REQUEST_DECODER_INPUT_LEN"
        value: "_REQUEST_DECODER_INPUT_LEN"
      }
      output_map {
        key: "DECODER_INPUT_ID"
        value: "_DECODER_INPUT_ID"
      }
      output_map {
        key: "REQUEST_OUTPUT_LEN"
        value: "_REQUEST_OUTPUT_LEN"
      }
      output_map {
        key: "STOP_WORDS_IDS"
        value: "_STOP_WORDS_IDS"
      }
      output_map {
        key: "BAD_WORDS_IDS"
        value: "_BAD_WORDS_IDS"
      }
      output_map {
        key: "EMBEDDING_BIAS"
        value: "_EMBEDDING_BIAS"
      }
      output_map {
        key: "OUT_END_ID"
        value: "_PREPROCESSOR_END_ID"
      }
      output_map {
        key: "OUT_PAD_ID"
        value: "_PREPROCESSOR_PAD_ID"
      }
      output_map {
        key: "OUT_PROMPT_TABLE_EXTRA_IDS"
        value: "_OUT_PROMPT_TABLE_EXTRA_IDS"
      }
    },
    {
      model_name: "tensorrt_llm"
      model_version: -1
      input_map {
        key: "input_ids"
        value: "_INPUT_ID"
      }
      input_map {
        key: "decoder_input_ids"
        value: "_DECODER_INPUT_ID"
      }
      input_map {
        key: "input_lengths"
        value: "_REQUEST_INPUT_LEN"
      }
      input_map {
        key: "decoder_input_lengths"
        value: "_REQUEST_DECODER_INPUT_LEN"
      }
      input_map {
        key: "exclude_input_in_output"
        value: "exclude_input_in_output"
      }
      input_map {
        key: "request_output_len"
        value: "_REQUEST_OUTPUT_LEN"
      }
      input_map {
          key: "end_id"
          value: "_PREPROCESSOR_END_ID"
      }
      input_map {
          key: "pad_id"
          value: "_PREPROCESSOR_PAD_ID"
      }
      input_map {
          key: "embedding_bias"
          value: "_EMBEDDING_BIAS"
      }
      input_map {
          key: "runtime_top_k"
          value: "top_k"
      }
      input_map {
          key: "runtime_top_p"
          value: "top_p"
      }
      input_map {
          key: "temperature"
          value: "temperature"
      }
      input_map {
          key: "len_penalty"
          value: "length_penalty"
      }
      input_map {
          key: "repetition_penalty"
          value: "repetition_penalty"
      }
      input_map {
          key: "min_length"
          value: "min_length"
      }
      input_map {
          key: "presence_penalty"
          value: "presence_penalty"
      }
      input_map {
          key: "frequency_penalty"
          value: "frequency_penalty"
      }
      input_map {
          key: "random_seed"
          value: "random_seed"
      }
      input_map {
          key: "return_log_probs"
          value: "return_log_probs"
      }
      input_map {
          key: "return_context_logits"
          value: "return_context_logits"
      }
      input_map {
          key: "return_generation_logits"
          value: "return_generation_logits"
      }
      input_map {
          key: "num_return_sequences"
          value: "num_return_sequences"
      }
      input_map {
          key: "beam_width"
          value: "beam_width"
      }
      input_map {
          key: "streaming"
          value: "stream"
      }
      input_map {
        key: "prompt_embedding_table"
        value: "prompt_embedding_table"
      }
      input_map {
        key: "prompt_vocab_size"
        value: "prompt_vocab_size"
      }
      input_map {
        key: "stop_words_list"
        value: "_STOP_WORDS_IDS"
      }
      input_map {
        key: "bad_words_list"
        value: "_BAD_WORDS_IDS"
      }
      input_map {
        key: "prompt_table_extra_ids"
        value: "_OUT_PROMPT_TABLE_EXTRA_IDS"
      },
      input_map {
        key: "lora_task_id",
        value: "lora_task_id"
      },
      input_map {
        key: "lora_weights",
        value: "lora_weights"
      },
      input_map {
        key: "lora_config",
        value: "lora_config"
      },
      output_map {
        key: "output_ids"
        value: "_TOKENS_BATCH"
      }
      output_map {
        key: "sequence_length"
        value: "_SEQUENCE_LENGTH"
      },
      output_map {
        key: "cum_log_probs"
        value: "cum_log_probs"
      }
      output_map {
        key: "output_log_probs"
        value: "output_log_probs"
      },
      output_map {
        key: "context_logits"
        value: "context_logits"
      },
      output_map {
        key: "generation_logits"
        value: "generation_logits"
      },
      output_map {
        key: "batch_index"
        value: "batch_index"
      },
      output_map {
        key: "sequence_index"
        value: "sequence_index"
      }
    },
    {
      model_name: "postprocessing"
      model_version: -1
      input_map {
        key: "TOKENS_BATCH"
        value: "_TOKENS_BATCH"
      }
      input_map {
        key: "SEQUENCE_LENGTH"
        value: "_SEQUENCE_LENGTH"
      }
      output_map {
        key: "OUTPUT"
        value: "text_output"
      }
    }
  ]
}

/opt/tritonserver/model_repo/tensorrt_llm_bls/config.pbtxt:

name: "tensorrt_llm_bls"
backend: "python"
max_batch_size: 8

model_transaction_policy {
  decoupled: True
}

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "decoder_text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  },
  {
    name: "image_input"
    data_type: TYPE_FP16
    dims: [ -1, 3, -1, -1 ]
    optional: true
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
   name: "bad_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
   name: "stop_words"
   data_type: TYPE_STRING
   dims: [ -1 ]
   optional: true
  },
  {
    name: "exclude_input_in_output"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "num_return_sequences"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_table_extra_id"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    optional: true
  },
  {
      name: "embedding_bias_words"
      data_type: TYPE_STRING
      dims: [ -1 ]
      optional: true
  },
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
  },
  {
      name: "num_draft_tokens",
      data_type: TYPE_INT32,
      dims: [ 1 ]
      optional: true
  },
  {
      name: "use_draft_logits",
      data_type: TYPE_BOOL,
      dims: [ 1 ]
      reshape: { shape: [ ] }
      optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
        data_type: TYPE_UINT64
        dims: [ 1 ]
    reshape: { shape: [ ] }
        optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
        data_type: TYPE_FP16
        dims: [ -1, -1 ]
        optional: true
        allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
        data_type: TYPE_INT32
        dims: [ -1, 3 ]
        optional: true
        allow_ragged_batch: true
  }
]
output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "sequence_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "${accumulate_tokens}"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}
parameters: {
  key: "tensorrt_llm_draft_model_name"
  value: {
    string_value: "${tensorrt_llm_draft_model_name}"
  }
}
parameters: {
  key: "multimodal_encoders_name"
  value: {
    string_value: "${multimodal_encoders_name}"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
  1. Populate test_bindings.py
import tritonserver
import numpy as np

# Constructing path to Model Repository
model_path = f"/opt/tritonserver/model_repo"

server_options = tritonserver.Options(
    server_id="ExampleServer",
    model_repository=model_path,
    log_error=True,
    log_warn=True,
    log_info=True,
    )

server = tritonserver.Server(server_options).start(wait_until_ready=True)

model = server.model("ensemble")

responses = model.infer(inputs={"text_input": [["What is machine learning?"]], \
        "return_context_logits": [[False]], \
        "return_log_probs": [[False]], \
        "max_tokens": np.array([[100]], dtype=np.int32)})

for response in responses:
    print(response.outputs['text_output'].to_bytes_array()[0].decode("utf-8"))
  1. Test with test_bindings.py produces good results:
python3 /opt/tritonserver/test_bindings.py
...
 Machine learning is a subset of artificial intelligence (AI) that involves training algorithms to learn from data and make predictions or decisions based on t                                                hat data. Unlike traditional programming, which relies on explicit rules and instructions, machine learning algorithms can improve their performance over time                                                 by analyzing and adapting to new data.
Machine learning is a key technology behind many modern applications, including:
1. Image and speech recognition
2. Natural language processing (NLP)
3. Predictive maintenance
4. Recommendation systems
....
  1. Launch OpenAI frontend with --enable-kserve-frontend throws error used to throw an error but works now:
/opt/tritonserver/server/python/openai/openai_frontend/main.py --model-repository /opt/tritonserver/model_repo --tokenizer /opt/tritonserver/raw_models/Llama-3.1-8B-Instruct --openai-port 9000 --enable-kserve-frontends
  1. Test endpoint in separate shell. Content in response is always empty when querying "ensemble" and "tensorrt_llm_bls" models:
curl \
  -w '\nTotal: %{time_total}s\n' \
  -s http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "ensemble", "messages": [{"role": "user", "content": "What is machine learning?"}]}'
{"id":"cmpl-0c45b290-b7d6-11ef-b4ec-5ecbd0277ad4","choices":[{"finish_reason":"stop","index":0,"message":{"content":"","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1733931542,"model":"ensemble","system_fingerprint":null,"object":"chat.completion","usage":null}
Total: 0.372138s
  1. Testing using OpenAI client also produces an empty response:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9000/v1",
    api_key="EMPTY",
)

model = "ensemble"
completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {"role": "user", "content": "Why is the sky blue?"},
    ],
    stop = '<|eot_id|>',
)

print(completion.choices)
  1. Testing KServe front end produces good results:
curl \
  -w '\nTotal: %{time_total}s\n' \
  -s http://localhost:8000/v2/models/ensemble/generate \
  -H 'Content-Type: application/json' \
  -d '{"text_input": "What is machine learning?", "max_tokens": 10, "bad_words": [],"stop_words": []}'

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a subset of artificial intelligence (AI"}
Total: 0.071636s

@MatteoPagliani
Copy link

I faced the same problems described by @njaramish. @rmccorm4 do you have any guess about what the root cause of the problem might be? Thanks for the support.

@njaramish
Copy link
Author

@rmccorm4 I've updated my reprex above for the 24.12 container and TensorRT-LLM v0.16.0 (including config files) -- still running into the same issue with the OpenAI frontend, where the content field of the response is empty.

With the latest version of the container and frontend code, the KServe frontend now works. Please note #9 above (which is new), showing that the KServe frontend produces the correct response, while the OpenAI frontend returns an empty response (same deployment).

Please let me know if there is anything else I can do to assist in debugging. Happy Holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants