langchain_nvidia_trt not working #108

rbgo404 · 2024-04-19T10:23:53Z

I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM.
Here's the issue:

Code used:

from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random

triton_url = "localhost:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

start_time = time.time()
tokens_generated = 0

for val in client._stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")

rbgo404 · 2024-04-19T10:24:47Z

Please share the configuration in the TensorRT-LLM end. What are the parameters modification required in the model's config.pbtxt

shubhadeepd · 2024-04-22T13:15:27Z

Hey @rbgo404
You can deploy the tensorRT-based LLM model by following the steps here
https://nvidia.github.io/GenerativeAIExamples/latest/local-gpu.html#using-local-gpus-for-a-q-a-chatbot

This notebook interacts with the model deployed behind llm-inference-server container which should get started up if you follow the steps above.

Let me know if you have any questions once you go through these steps!

ChiBerkeley · 2024-05-01T06:11:03Z

Hi, I followed the instruction but still has problem starting llm-inference-server. I'm currently using Tesla M60 and llama-2-13b-chat

shubhadeepd self-assigned this Apr 22, 2024

shubhadeepd added the question Further information is requested label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain_nvidia_trt not working #108

langchain_nvidia_trt not working #108

rbgo404 commented Apr 19, 2024

rbgo404 commented Apr 19, 2024

shubhadeepd commented Apr 22, 2024 •

edited

Loading

ChiBerkeley commented May 1, 2024

langchain_nvidia_trt not working #108

langchain_nvidia_trt not working #108

Comments

rbgo404 commented Apr 19, 2024

rbgo404 commented Apr 19, 2024

shubhadeepd commented Apr 22, 2024 • edited Loading

ChiBerkeley commented May 1, 2024

shubhadeepd commented Apr 22, 2024 •

edited

Loading