Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Error Leading to Inaccurate Evaluation Outcomes #54

Closed
NPap0 opened this issue Dec 1, 2023 · 4 comments
Closed

Possible Error Leading to Inaccurate Evaluation Outcomes #54

NPap0 opened this issue Dec 1, 2023 · 4 comments

Comments

@NPap0
Copy link

NPap0 commented Dec 1, 2023

Hey there, I'm no expert so maybe I'm mistaken (So please verify twice :P) but I think the way you evaluate openai is not the way you intended. I tried to replicate the evaluation and I came across this issue:
This is the prompt
https://github.com/defog-ai/sql-eval/blob/main/prompts/prompt_openai.md

But
https://github.com/defog-ai/sql-eval/blob/024ff013d02d5fac248fb56b99279cdb16d70aa0/query_generators/openai.py#L123C4-L123C4

When changing the prompt for each question, you call .format on the text of the user prompt only leaving the assistant prompt part untouched. And then you call the generate function but your assistant_prompt part is:

Given the database schema, here is the SQL query that answers `{user_question}`:
```sql

Without any user_question parameter. Meaning that this^ is exactly what the model gets as input, which changes the input, which can change the results, which can change the outcome and insights.

@rishsriv @wongjingping

@rishsriv
Copy link
Member

rishsriv commented Dec 1, 2023

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

@rishsriv rishsriv closed this as completed Dec 1, 2023
@NPap0
Copy link
Author

NPap0 commented Dec 1, 2023

The question is included in the user prompt. Please let us know if you’re able to get better results with other prompts.

I'm aware, I just thought your intention was to include part of the assistant's response with the question repeated, if that is possible, and for the LLM to continue prediction from there. At least, that is how I understood it based on the code. Is my understanding incorrect?

@rishsriv
Copy link
Member

rishsriv commented Dec 1, 2023

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

@NPap0
Copy link
Author

NPap0 commented Dec 1, 2023

That's a fair hypothesis :) In practice, we found that the Assistant prompt matters very little and is generally ignored by OpenAI in ChatCompletion endpoint. We included it to maintain comparability with the other prompts.

If we add the line assistant_prompt = assistant_prompt.format(user_question=question), the results are the same. gpt-4-turbo accuracy is 83% with our current code, and remains 83% with the line above added in.

Having said that – we should just fix it in the code anyway. Have opened a PR here, and we'll merge it on Monday. Thanks for reporting!

Thank you for taking time to explain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants