Possible Error Leading to Inaccurate Evaluation Outcomes #54

NPap0 · 2023-12-01T08:58:36Z

Hey there, I'm no expert so maybe I'm mistaken (So please verify twice :P) but I think the way you evaluate openai is not the way you intended. I tried to replicate the evaluation and I came across this issue:
This is the prompt
https://github.com/defog-ai/sql-eval/blob/main/prompts/prompt_openai.md

But
https://github.com/defog-ai/sql-eval/blob/024ff013d02d5fac248fb56b99279cdb16d70aa0/query_generators/openai.py#L123C4-L123C4

When changing the prompt for each question, you call .format on the text of the user prompt only leaving the assistant prompt part untouched. And then you call the generate function but your assistant_prompt part is:

Given the database schema, here is the SQL query that answers `{user_question}`:
```sql

Without any user_question parameter. Meaning that this^ is exactly what the model gets as input, which changes the input, which can change the results, which can change the outcome and insights.

@rishsriv @wongjingping

The text was updated successfully, but these errors were encountered:

rishsriv · 2023-12-01T09:46:25Z

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

NPap0 · 2023-12-01T09:53:17Z

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

I'm aware, I just thought your intention was to include part of the assistant's response with the question repeated, if that is possible, and for the LLM to continue prediction from there. At least, that is how I understood it based on the code. Is my understanding incorrect?

rishsriv · 2023-12-01T15:43:54Z

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

NPap0 · 2023-12-01T15:47:35Z

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

Thank you for taking time to explain!

rishsriv closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Error Leading to Inaccurate Evaluation Outcomes #54

Possible Error Leading to Inaccurate Evaluation Outcomes #54

NPap0 commented Dec 1, 2023 •

edited

Loading

rishsriv commented Dec 1, 2023

NPap0 commented Dec 1, 2023

rishsriv commented Dec 1, 2023

NPap0 commented Dec 1, 2023

Possible Error Leading to Inaccurate Evaluation Outcomes #54

Possible Error Leading to Inaccurate Evaluation Outcomes #54

Comments

NPap0 commented Dec 1, 2023 • edited Loading

rishsriv commented Dec 1, 2023

NPap0 commented Dec 1, 2023

rishsriv commented Dec 1, 2023

NPap0 commented Dec 1, 2023

NPap0 commented Dec 1, 2023 •

edited

Loading