-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: accidental loop on client-side validation fail #61
base: main
Are you sure you want to change the base?
Conversation
Before, if a tool use response failed validation (e.g. due to being too large), we accidentally fed the error about the validation back into the LLM. Most models would fail on that, but for example llama3.1 instruct would ask a question and that would end up recursing potentially 128 times (max_tool_use). The intent I think was that when there was a response we feel invalid that we would actually break the loop calling the LLM. So, this breaks the loop in a way that retains the error and the stack trace. The reason this went unnoticed was two fold. First, as described above, most LLMs would only recurse once before failing on a ToolCallResult (due to being passed an error as input). Second, we didn't check the type of response we received on such an error. If we did, we would expect that a user-side validation error (so, a pseudo-user reply), not an assistant error due to bad input. Signed-off-by: Adrian Cole <[email protected]>
so before this change, the error content would be the message after the tool call sent to the LLM by mistake:
Say if ollama with mistral-nemo, the LLM would try to process the content which had an error in it and return text like this.
I found this with "llama3.1:8b-instruct-q4_0" because it seems to return a response to just call it, yet again.
So, the TL;DR; is we never should have passed an error literal as content for a tool call. This was a mistake, but only caused an extra completion in most LLMs. However, luckily it caused a loop in llama3.1, so I could notice the root problem. |
@@ -111,6 +111,12 @@ def reply(self, max_tool_use: int = 128) -> Message: | |||
for tool_use in response.tool_use: | |||
tool_result = self.call_function(tool_use) | |||
content.append(tool_result) | |||
if tool_result.is_error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps in this case .is_error is true because in our validation, we catch an exception and then set this field. I don't know why we don't propagate instead, as if we did, this bug wouldn't have been possible.
If someone has an idea, I'll go back and make a comment. If they don't maybe we can in a later PR change to not swallow the error and let it break the request instead of tunneling it through a result field backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically here, if there was a validation error, we swallow the exception which before had us retry, and in this change we break on it instead.
https://github.com/square/exchange/blob/main/src/exchange/exchange.py#L158-L161
I could think of some reasons to try this way, but if we did, we'd need to make sure we don't add the error into the LLM context if we did.
ok I think I understand - but in general we do want the errors to feedback right? but just not this way? |
yeap we should give the error directly to the user (response was too long) vs accidentally washing it through the LLM and getting an error like this instead:
|
This was actually behaving as intended I think? although its underdocumented. Here was the goal with this code path: We're trying to improve performance of the agent loop by letting it self correct if it runs into an error. That's also why as you noticed we catch tracebacks in tool calls and set it as a tool result error. For example, if I ask goose to check a logs file that is very long, it might first try to directly That's one of the more complex modes though. This same code path is what allows an agent loop like goose to see that it tried an invalid shell command, tried to read a file that doesn't exist yet, etc. It can recover from those without needing new user intervention So I do think we want to keep this, but it's unfortunate that some of the models are looping here for sure. It'd be a more complex fix but maybe we try to observe the number of consecutive is_error calls and have another max_errors parameter that is lower (e.g. 3 or 4 consecutive errors seems like enough to give up). WDYT? |
@baxen thanks for the summary, the thing I believe is not going to solve things may have been missed in my over explaining above. Right now, we literally put the following into the context for the llm to interpret, literally the error message
This seems unintentional, unless you are saying it is one of..
Am I understanding it right? |
separately on ..
If we keep the same logic, definitely consecutive errors should break out so we don't have to wait for 128. This could also buy time to ponder a cleaner overall solution if one exists, for exchange side validation failures. For example, I think we might need to rewrite the stack trace or better introduce it, if we think that the tool will act differently. Like in the case of this, returning the password is not likely to change in length. I would expect that if we meant to push the error back into the context, we would be expecting the LLM to say "sorry I can't return a shorter valid password" |
Yes this first one - that error you shared
is enough info for most of the models to recover and send a new tool call without a password argument. Here's a real session i just ran where it ran into the max size error mode. It gets the length error traceback, and switches from a direct read to using {"role": "assistant", "id": "msg_96921841e1fe4ed380851d9b", "created": 1727322555, "content": [{"id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "name": "read_file", "parameters": {"path": "logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_7ac8e549020645c8872720bd", "created": 1727322555, "content": [{"tool_use_id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "output": "Traceback (most recent call last):\n File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 155, in call_function\n validate_tool_output(output)\n File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 25, in validate_tool_output\n raise ValueError(\"This tool call created an output that was too long to handle!\")\nValueError: This tool call created an output that was too long to handle!\n\nThis tool call created an output that was too long to handle!", "is_error": true, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_197ab71d00b8476ba8672b9a", "created": 1727322558, "content": [{"id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "name": "update_plan", "parameters": {"tasks": [{"description": "Read the content of logs.txt file", "status": "failed"}, {"description": "Summarize the content of logs.txt file", "status": "failed"}, {"description": "Using the first 1000 lines of logs.txt for summarization process", "status": "planned"}, {"description": "Read the first 1000 lines of logs.txt", "status": "planned"}, {"description": "Summarize the first 1000 lines of logs.txt file", "status": "planned"}]}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_cbd376ac2d404224956ebad1", "created": 1727322558, "content": [{"tool_use_id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "output": "[{\"description\": \"Read the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Summarize the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Using the first 1000 lines of logs.txt for summarization process\", \"status\": \"planned\"}, {\"description\": \"Read the first 1000 lines of logs.txt\", \"status\": \"planned\"}, {\"description\": \"Summarize the first 1000 lines of logs.txt file\", \"status\": \"planned\"}]", "is_error": false, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_eeec00c91d0d46e08bd179a9", "created": 1727322559, "content": [{"id": "call_zkpvXxdlbkFLeW4pAiBhB8Lk", "name": "shell", "parameters": {"command": "head -n 1000 logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]} There is a good argument though that we could skip the full traceback and just show the error message. If the error messages are good, that's enough, but some errors really don't say enough without more context, it's a bit debatable which is better - this would be a great thing to performance test. |
yeah I was worried we would lose something - kind of need to soak test this a bit, or consider stripping some back, but that error content isn't necessarily a problem - anymore than it would be for me to see it (but annoying, but also annoying if I didn't see it and then missed some content). |
so given this isn't entirely accidental - is there something to do here for some models that won't work with this much feedback? |
yeah this isn't an accident, as it turns out. I do think a few things would help:
sg? |
so I've an idea how to implement above and also revise comments so future me doesn't get confused. really appreciate y'all walking me through this! |
Before, if a tool use response failed validation (e.g. due to being too large), we accidentally fed the error about the validation back into the LLM. Most models would fail on that, but for example llama3.1 instruct would ask to retry the tool and that would end up recursing potentially 128 times (max_tool_use).
The intent I think was that when there was a response we feel invalid that we would actually break the loop calling the LLM. So, this change breaks the loop in a way that retains the error and the stack trace.
The reason this went unnoticed was two fold. First, as described above, most LLMs would only recurse once before failing on a ToolCallResult (due to being passed an error as input). Second, we didn't check the type of response we received on such an error. If we did, we would expect that a user-side validation error (so, a pseudo-user reply), not an assistant error due to bad input (error literal) to a subsequent tool call.