Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent problem with Llama3 chat template #330

Open
Lue-C opened this issue Sep 23, 2024 · 0 comments
Open

Apparent problem with Llama3 chat template #330

Lue-C opened this issue Sep 23, 2024 · 0 comments

Comments

@Lue-C
Copy link

Lue-C commented Sep 23, 2024

Hi,

I am sending requests to a llama-server by using the openai api. I also wrote the code in pytorch without a server to compare the results. I noticed that the in the first case the text generation does not stop after after giving an answer and keeps tellling me about climate change. When running the corresponding pytorch code, the generation is stopped appropriately and the quality of the answer is way better.
This is a behaviour I would expect if there is an issue with the chat_template but I am using the exact same format I found in examples. This the code in pytorch:

def respond() -> str:

    user_prompt = ""

    messages = [{"role": "system", "content": ''},
                     {"role": "user", "content":user_prompt}]

    text = tokenizer.apply_chat_template(
          messages,
          tokenize=False,
          add_generation_prompt=True
          )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    generate_kwargs = dict(
        input_ids= model_inputs.input_ids,
        streamer=streamer,
        max_new_tokens=1024,
        #do_sample=True,
        temperature=0.01,
        eos_token_id=terminators,
        )

    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()
    
    generated_text = " "
    for text in streamer:
        generated_text += text
        print(text, end='', flush=True)
    
    end_time = time.time()

and this is the code using the openai api

def stream_document():
    # OpenAI client setup
    client = openai.OpenAI(
        base_url="",  # API-Server URL
        api_key="sk-no-key-required"
    )
    user_prompt = ""
    messages = [{"role": "system", "content": ''},
                     {"role": "user", "content":user_prompt}]
    
    text=messages

    response = client.chat.completions.create(
        #model="gpt-3.5-turbo",
        model="Llama3",
        messages = text,
        stream=True, # Enable streaming
        temperature=0.01,
        max_completion_tokens=1024
    )
    # Process each chunk of data as it comes in
    for chunk in response:
        # Accessing the choices in the chunk
        for choice in chunk.choices:
            # Accessing the delta content within each choice
            if choice.delta and choice.delta.content:
                print(choice.delta.content, end='', flush=True)  # Print content without newline
    print("\nStream finished.")

Is there some way to give special tokens or specify the chat template to the client?

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant