Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: accidental loop on client-side validation fail #61

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

codefromthecrypt
Copy link
Contributor

@codefromthecrypt codefromthecrypt commented Sep 25, 2024

Before, if a tool use response failed validation (e.g. due to being too large), we accidentally fed the error about the validation back into the LLM. Most models would fail on that, but for example llama3.1 instruct would ask to retry the tool and that would end up recursing potentially 128 times (max_tool_use).

The intent I think was that when there was a response we feel invalid that we would actually break the loop calling the LLM. So, this change breaks the loop in a way that retains the error and the stack trace.

The reason this went unnoticed was two fold. First, as described above, most LLMs would only recurse once before failing on a ToolCallResult (due to being passed an error as input). Second, we didn't check the type of response we received on such an error. If we did, we would expect that a user-side validation error (so, a pseudo-user reply), not an assistant error due to bad input (error literal) to a subsequent tool call.

Before, if a tool use response failed validation (e.g. due to
being too large), we accidentally fed the error about the
validation back into the LLM. Most models would fail on that,
but for example llama3.1 instruct would ask a question and that
would end up recursing potentially 128 times (max_tool_use).

The intent I think was that when there was a response we feel
invalid that we would actually break the loop calling the LLM. So,
this breaks the loop in a way that retains the error and the stack
trace.

The reason this went unnoticed was two fold. First, as described
above, most LLMs would only recurse once before failing on a
ToolCallResult (due to being passed an error as input). Second,
we didn't check the type of response we received on such an error.
If we did, we would expect that a user-side validation error (so,
a pseudo-user reply), not an assistant error due to bad input.

Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt
Copy link
Contributor Author

so before this change, the error content would be the message after the tool call sent to the LLM by mistake:

{'role': 'assistant', 'tool_calls': [{'function': {'arguments': '{}', 'name': 'get_password'}, 'id': 'call_a2yqv1bu', 'type': 'function'}]}
{'content': 'Traceback (most recent call last):\n  File "/Users/adriancole/oss/exchange/src/exchange/exchange.py", line 147, in call_funct...rd argument \'password\'\n\ntest_tool_use_output_chars.<locals>.get_password() got an unexpected keyword argument \'password\'', 'role': 'tool', 'tool_call_id': 'call_mnfsw5w8'}

Say if ollama with mistral-nemo, the LLM would try to process the content which had an error in it and return text like this.

I apologize for the inconvenience. It seems there's been an error processing your request due to a large response from the tool. Could you please try again with a different query?

I found this with "llama3.1:8b-instruct-q4_0" because it seems to return a response to just call it, yet again.

Message(role='assistant', id='msg_58b3a92b8a3342ce87157bcd', created=1727267372, content=[ToolUse(id='call_y04fbpsn', name='get_password', parameters={}, is_error=False, error_message=None)])

So, the TL;DR; is we never should have passed an error literal as content for a tool call. This was a mistake, but only caused an extra completion in most LLMs. However, luckily it caused a loop in llama3.1, so I could notice the root problem.

@@ -111,6 +111,12 @@ def reply(self, max_tool_use: int = 128) -> Message:
for tool_use in response.tool_use:
tool_result = self.call_function(tool_use)
content.append(tool_result)
if tool_result.is_error:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps in this case .is_error is true because in our validation, we catch an exception and then set this field. I don't know why we don't propagate instead, as if we did, this bug wouldn't have been possible.

If someone has an idea, I'll go back and make a comment. If they don't maybe we can in a later PR change to not swallow the error and let it break the request instead of tunneling it through a result field backwards.

Copy link
Contributor Author

@codefromthecrypt codefromthecrypt Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically here, if there was a validation error, we swallow the exception which before had us retry, and in this change we break on it instead.

https://github.com/square/exchange/blob/main/src/exchange/exchange.py#L158-L161

I could think of some reasons to try this way, but if we did, we'd need to make sure we don't add the error into the LLM context if we did.

@michaelneale
Copy link
Collaborator

ok I think I understand - but in general we do want the errors to feedback right? but just not this way?

@codefromthecrypt
Copy link
Contributor Author

ok I think I understand - but in general we do want the errors to feedback right? but just not this way?

yeap we should give the error directly to the user (response was too long) vs accidentally washing it through the LLM and getting an error like this instead:

'it seems like the tool call i received is not in the expected format. could '
 'you please provide more context or clarify what type of authentication is '
 "required? i'll do my best to assist you once i have the correct "
 'information. \n'
 '\n'
 'if you need to authenticate, i can try calling the `get_password` tool again '
 'with the correct parameters. please let me know how to proceed.'

@baxen
Copy link
Collaborator

baxen commented Sep 26, 2024

This was actually behaving as intended I think? although its underdocumented. Here was the goal with this code path:

We're trying to improve performance of the agent loop by letting it self correct if it runs into an error. That's also why as you noticed we catch tracebacks in tool calls and set it as a tool result error.

For example, if I ask goose to check a logs file that is very long, it might first try to directly cat logs.txt. Then it can see an error about the output being too large, and it can often recover to try tail -n 100 logs.txt or similar.

That's one of the more complex modes though. This same code path is what allows an agent loop like goose to see that it tried an invalid shell command, tried to read a file that doesn't exist yet, etc. It can recover from those without needing new user intervention

So I do think we want to keep this, but it's unfortunate that some of the models are looping here for sure. It'd be a more complex fix but maybe we try to observe the number of consecutive is_error calls and have another max_errors parameter that is lower (e.g. 3 or 4 consecutive errors seems like enough to give up). WDYT?

@codefromthecrypt
Copy link
Contributor Author

codefromthecrypt commented Sep 26, 2024

@baxen thanks for the summary, the thing I believe is not going to solve things may have been missed in my over explaining above.

Right now, we literally put the following into the context for the llm to interpret, literally the error message Traceback (most recent call last)...

{'content': 'Traceback (most recent call last):\n  File "/Users/adriancole/oss/exchange/src/exchange/exchange.py", line 147, in call_funct...rd argument \'password\'\n\ntest_tool_use_output_chars.<locals>.get_password() got an unexpected keyword argument \'password\'', 'role': 'tool', 'tool_call_id': 'call_mnfsw5w8'}

This seems unintentional, unless you are saying it is one of..

  • we expect it to interpret our stack trace about a validation fail and behave differently when the error is in context for a tool call? (note: it doesn't at the moment in any LLM I've tried)
  • this is an acceptable collateral damage, and we are ok sending something unresolveable back to the LLM as it makes the code simpler

Am I understanding it right?

@codefromthecrypt
Copy link
Contributor Author

separately on ..

maybe we try to observe the number of consecutive is_error calls and have another max_errors parameter that is lower (e.g. 3 or 4 consecutive errors seems like enough to give up). WDYT?

If we keep the same logic, definitely consecutive errors should break out so we don't have to wait for 128.

This could also buy time to ponder a cleaner overall solution if one exists, for exchange side validation failures. For example, I think we might need to rewrite the stack trace or better introduce it, if we think that the tool will act differently. Like in the case of this, returning the password is not likely to change in length. I would expect that if we meant to push the error back into the context, we would be expecting the LLM to say "sorry I can't return a shorter valid password"

@baxen
Copy link
Collaborator

baxen commented Sep 26, 2024

This seems unintentional, unless you are saying it is one of..

we expect it to interpret our stack trace about a validation fail and behave differently when the error is in context for a tool call? (note: it doesn't at the moment in any LLM I've tried)

Yes this first one - that error you shared

...get_password() got an unexpected keyword argument \'password\'...

is enough info for most of the models to recover and send a new tool call without a password argument.

Here's a real session i just ran where it ran into the max size error mode. It gets the length error traceback, and switches from a direct read to using head

{"role": "assistant", "id": "msg_96921841e1fe4ed380851d9b", "created": 1727322555, "content": [{"id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "name": "read_file", "parameters": {"path": "logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_7ac8e549020645c8872720bd", "created": 1727322555, "content": [{"tool_use_id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "output": "Traceback (most recent call last):\n  File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 155, in call_function\n    validate_tool_output(output)\n  File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 25, in validate_tool_output\n    raise ValueError(\"This tool call created an output that was too long to handle!\")\nValueError: This tool call created an output that was too long to handle!\n\nThis tool call created an output that was too long to handle!", "is_error": true, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_197ab71d00b8476ba8672b9a", "created": 1727322558, "content": [{"id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "name": "update_plan", "parameters": {"tasks": [{"description": "Read the content of logs.txt file", "status": "failed"}, {"description": "Summarize the content of logs.txt file", "status": "failed"}, {"description": "Using the first 1000 lines of logs.txt for summarization process", "status": "planned"}, {"description": "Read the first 1000 lines of logs.txt", "status": "planned"}, {"description": "Summarize the first 1000 lines of logs.txt file", "status": "planned"}]}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_cbd376ac2d404224956ebad1", "created": 1727322558, "content": [{"tool_use_id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "output": "[{\"description\": \"Read the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Summarize the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Using the first 1000 lines of logs.txt for summarization process\", \"status\": \"planned\"}, {\"description\": \"Read the first 1000 lines of logs.txt\", \"status\": \"planned\"}, {\"description\": \"Summarize the first 1000 lines of logs.txt file\", \"status\": \"planned\"}]", "is_error": false, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_eeec00c91d0d46e08bd179a9", "created": 1727322559, "content": [{"id": "call_zkpvXxdlbkFLeW4pAiBhB8Lk", "name": "shell", "parameters": {"command": "head -n 1000 logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]}

There is a good argument though that we could skip the full traceback and just show the error message. If the error messages are good, that's enough, but some errors really don't say enough without more context, it's a bit debatable which is better - this would be a great thing to performance test.

@michaelneale
Copy link
Collaborator

yeah I was worried we would lose something - kind of need to soak test this a bit, or consider stripping some back, but that error content isn't necessarily a problem - anymore than it would be for me to see it (but annoying, but also annoying if I didn't see it and then missed some content).

@michaelneale
Copy link
Collaborator

so given this isn't entirely accidental - is there something to do here for some models that won't work with this much feedback?

@codefromthecrypt
Copy link
Contributor Author

yeah this isn't an accident, as it turns out. I do think a few things would help:

  • be clear in assertions so that it doesn't appear an accident ;)
  • have a break because if a model doesn't understand, it could loop 128 times which feels like forever

sg?

@codefromthecrypt codefromthecrypt marked this pull request as draft September 26, 2024 10:53
@codefromthecrypt
Copy link
Contributor Author

so I've an idea how to implement above and also revise comments so future me doesn't get confused. really appreciate y'all walking me through this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants