fix: accidental loop on client-side validation fail #61

codefromthecrypt · 2024-09-25T12:12:44Z

Before, if a tool use response failed validation (e.g. due to being too large), we accidentally fed the error about the validation back into the LLM. Most models would fail on that, but for example llama3.1 instruct would ask to retry the tool and that would end up recursing potentially 128 times (max_tool_use).

The intent I think was that when there was a response we feel invalid that we would actually break the loop calling the LLM. So, this change breaks the loop in a way that retains the error and the stack trace.

The reason this went unnoticed was two fold. First, as described above, most LLMs would only recurse once before failing on a ToolCallResult (due to being passed an error as input). Second, we didn't check the type of response we received on such an error. If we did, we would expect that a user-side validation error (so, a pseudo-user reply), not an assistant error due to bad input (error literal) to a subsequent tool call.

Before, if a tool use response failed validation (e.g. due to being too large), we accidentally fed the error about the validation back into the LLM. Most models would fail on that, but for example llama3.1 instruct would ask a question and that would end up recursing potentially 128 times (max_tool_use). The intent I think was that when there was a response we feel invalid that we would actually break the loop calling the LLM. So, this breaks the loop in a way that retains the error and the stack trace. The reason this went unnoticed was two fold. First, as described above, most LLMs would only recurse once before failing on a ToolCallResult (due to being passed an error as input). Second, we didn't check the type of response we received on such an error. If we did, we would expect that a user-side validation error (so, a pseudo-user reply), not an assistant error due to bad input. Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt · 2024-09-25T12:37:34Z

so before this change, the error content would be the message after the tool call sent to the LLM by mistake:

{'role': 'assistant', 'tool_calls': [{'function': {'arguments': '{}', 'name': 'get_password'}, 'id': 'call_a2yqv1bu', 'type': 'function'}]}
{'content': 'Traceback (most recent call last):\n  File "/Users/adriancole/oss/exchange/src/exchange/exchange.py", line 147, in call_funct...rd argument \'password\'\n\ntest_tool_use_output_chars.<locals>.get_password() got an unexpected keyword argument \'password\'', 'role': 'tool', 'tool_call_id': 'call_mnfsw5w8'}

Say if ollama with mistral-nemo, the LLM would try to process the content which had an error in it and return text like this.

I apologize for the inconvenience. It seems there's been an error processing your request due to a large response from the tool. Could you please try again with a different query?

I found this with "llama3.1:8b-instruct-q4_0" because it seems to return a response to just call it, yet again.

Message(role='assistant', id='msg_58b3a92b8a3342ce87157bcd', created=1727267372, content=[ToolUse(id='call_y04fbpsn', name='get_password', parameters={}, is_error=False, error_message=None)])

So, the TL;DR; is we never should have passed an error literal as content for a tool call. This was a mistake, but only caused an extra completion in most LLMs. However, luckily it caused a loop in llama3.1, so I could notice the root problem.

codefromthecrypt · 2024-09-25T21:16:16Z

src/exchange/exchange.py

@@ -111,6 +111,12 @@ def reply(self, max_tool_use: int = 128) -> Message:
            for tool_use in response.tool_use:
                tool_result = self.call_function(tool_use)
                content.append(tool_result)
+                if tool_result.is_error:


ps in this case .is_error is true because in our validation, we catch an exception and then set this field. I don't know why we don't propagate instead, as if we did, this bug wouldn't have been possible.

If someone has an idea, I'll go back and make a comment. If they don't maybe we can in a later PR change to not swallow the error and let it break the request instead of tunneling it through a result field backwards.

specifically here, if there was a validation error, we swallow the exception which before had us retry, and in this change we break on it instead.

https://github.com/square/exchange/blob/main/src/exchange/exchange.py#L158-L161

I could think of some reasons to try this way, but if we did, we'd need to make sure we don't add the error into the LLM context if we did.

michaelneale · 2024-09-25T23:47:01Z

ok I think I understand - but in general we do want the errors to feedback right? but just not this way?

codefromthecrypt · 2024-09-25T23:48:37Z

ok I think I understand - but in general we do want the errors to feedback right? but just not this way?

yeap we should give the error directly to the user (response was too long) vs accidentally washing it through the LLM and getting an error like this instead:

'it seems like the tool call i received is not in the expected format. could '
 'you please provide more context or clarify what type of authentication is '
 "required? i'll do my best to assist you once i have the correct "
 'information. \n'
 '\n'
 'if you need to authenticate, i can try calling the `get_password` tool again '
 'with the correct parameters. please let me know how to proceed.'

baxen · 2024-09-26T03:15:51Z

This was actually behaving as intended I think? although its underdocumented. Here was the goal with this code path:

We're trying to improve performance of the agent loop by letting it self correct if it runs into an error. That's also why as you noticed we catch tracebacks in tool calls and set it as a tool result error.

For example, if I ask goose to check a logs file that is very long, it might first try to directly cat logs.txt. Then it can see an error about the output being too large, and it can often recover to try tail -n 100 logs.txt or similar.

That's one of the more complex modes though. This same code path is what allows an agent loop like goose to see that it tried an invalid shell command, tried to read a file that doesn't exist yet, etc. It can recover from those without needing new user intervention

So I do think we want to keep this, but it's unfortunate that some of the models are looping here for sure. It'd be a more complex fix but maybe we try to observe the number of consecutive is_error calls and have another max_errors parameter that is lower (e.g. 3 or 4 consecutive errors seems like enough to give up). WDYT?

codefromthecrypt · 2024-09-26T03:29:03Z

@baxen thanks for the summary, the thing I believe is not going to solve things may have been missed in my over explaining above.

Right now, we literally put the following into the context for the llm to interpret, literally the error message Traceback (most recent call last)...

{'content': 'Traceback (most recent call last):\n  File "/Users/adriancole/oss/exchange/src/exchange/exchange.py", line 147, in call_funct...rd argument \'password\'\n\ntest_tool_use_output_chars.<locals>.get_password() got an unexpected keyword argument \'password\'', 'role': 'tool', 'tool_call_id': 'call_mnfsw5w8'}

This seems unintentional, unless you are saying it is one of..

we expect it to interpret our stack trace about a validation fail and behave differently when the error is in context for a tool call? (note: it doesn't at the moment in any LLM I've tried)
this is an acceptable collateral damage, and we are ok sending something unresolveable back to the LLM as it makes the code simpler

Am I understanding it right?

codefromthecrypt · 2024-09-26T03:33:02Z

separately on ..

maybe we try to observe the number of consecutive is_error calls and have another max_errors parameter that is lower (e.g. 3 or 4 consecutive errors seems like enough to give up). WDYT?

If we keep the same logic, definitely consecutive errors should break out so we don't have to wait for 128.

This could also buy time to ponder a cleaner overall solution if one exists, for exchange side validation failures. For example, I think we might need to rewrite the stack trace or better introduce it, if we think that the tool will act differently. Like in the case of this, returning the password is not likely to change in length. I would expect that if we meant to push the error back into the context, we would be expecting the LLM to say "sorry I can't return a shorter valid password"

baxen · 2024-09-26T03:54:19Z

This seems unintentional, unless you are saying it is one of..

we expect it to interpret our stack trace about a validation fail and behave differently when the error is in context for a tool call? (note: it doesn't at the moment in any LLM I've tried)

Yes this first one - that error you shared

...get_password() got an unexpected keyword argument \'password\'...

is enough info for most of the models to recover and send a new tool call without a password argument.

Here's a real session i just ran where it ran into the max size error mode. It gets the length error traceback, and switches from a direct read to using head

{"role": "assistant", "id": "msg_96921841e1fe4ed380851d9b", "created": 1727322555, "content": [{"id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "name": "read_file", "parameters": {"path": "logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_7ac8e549020645c8872720bd", "created": 1727322555, "content": [{"tool_use_id": "call_CoR6Jea7e6M3Xuv6sGOGmUxA", "output": "Traceback (most recent call last):\n  File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 155, in call_function\n    validate_tool_output(output)\n  File \"/Users/baxen/.cache/uv/archive-v0/LmY1VP9HUZMxhCvZKHB0A/lib/python3.12/site-packages/exchange/exchange.py\", line 25, in validate_tool_output\n    raise ValueError(\"This tool call created an output that was too long to handle!\")\nValueError: This tool call created an output that was too long to handle!\n\nThis tool call created an output that was too long to handle!", "is_error": true, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_197ab71d00b8476ba8672b9a", "created": 1727322558, "content": [{"id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "name": "update_plan", "parameters": {"tasks": [{"description": "Read the content of logs.txt file", "status": "failed"}, {"description": "Summarize the content of logs.txt file", "status": "failed"}, {"description": "Using the first 1000 lines of logs.txt for summarization process", "status": "planned"}, {"description": "Read the first 1000 lines of logs.txt", "status": "planned"}, {"description": "Summarize the first 1000 lines of logs.txt file", "status": "planned"}]}, "is_error": false, "error_message": null, "type": "ToolUse"}]}
{"role": "user", "id": "msg_cbd376ac2d404224956ebad1", "created": 1727322558, "content": [{"tool_use_id": "call_FLSPTSSlb1NHRpmwonXT4bMX", "output": "[{\"description\": \"Read the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Summarize the content of logs.txt file\", \"status\": \"failed\"}, {\"description\": \"Using the first 1000 lines of logs.txt for summarization process\", \"status\": \"planned\"}, {\"description\": \"Read the first 1000 lines of logs.txt\", \"status\": \"planned\"}, {\"description\": \"Summarize the first 1000 lines of logs.txt file\", \"status\": \"planned\"}]", "is_error": false, "type": "ToolResult"}]}
{"role": "assistant", "id": "msg_eeec00c91d0d46e08bd179a9", "created": 1727322559, "content": [{"id": "call_zkpvXxdlbkFLeW4pAiBhB8Lk", "name": "shell", "parameters": {"command": "head -n 1000 logs.txt"}, "is_error": false, "error_message": null, "type": "ToolUse"}]}

There is a good argument though that we could skip the full traceback and just show the error message. If the error messages are good, that's enough, but some errors really don't say enough without more context, it's a bit debatable which is better - this would be a great thing to performance test.

michaelneale · 2024-09-26T04:30:21Z

yeah I was worried we would lose something - kind of need to soak test this a bit, or consider stripping some back, but that error content isn't necessarily a problem - anymore than it would be for me to see it (but annoying, but also annoying if I didn't see it and then missed some content).

michaelneale · 2024-09-26T04:32:19Z

so given this isn't entirely accidental - is there something to do here for some models that won't work with this much feedback?

codefromthecrypt · 2024-09-26T04:34:28Z

yeah this isn't an accident, as it turns out. I do think a few things would help:

be clear in assertions so that it doesn't appear an accident ;)
have a break because if a model doesn't understand, it could loop 128 times which feels like forever

sg?

codefromthecrypt · 2024-09-26T10:54:12Z

so I've an idea how to implement above and also revise comments so future me doesn't get confused. really appreciate y'all walking me through this!

codefromthecrypt commented Sep 25, 2024

View reviewed changes

michaelneale requested review from baxen and zakiali September 25, 2024 23:00

codefromthecrypt marked this pull request as draft September 26, 2024 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accidental loop on client-side validation fail #61

fix: accidental loop on client-side validation fail #61

codefromthecrypt commented Sep 25, 2024 •

edited

Loading

codefromthecrypt commented Sep 25, 2024

codefromthecrypt Sep 25, 2024

codefromthecrypt Sep 25, 2024 •

edited

Loading

michaelneale commented Sep 25, 2024

codefromthecrypt commented Sep 25, 2024

baxen commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024 •

edited

Loading

codefromthecrypt commented Sep 26, 2024

baxen commented Sep 26, 2024

michaelneale commented Sep 26, 2024

michaelneale commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024

fix: accidental loop on client-side validation fail #61

Are you sure you want to change the base?

fix: accidental loop on client-side validation fail #61

Conversation

codefromthecrypt commented Sep 25, 2024 • edited Loading

codefromthecrypt commented Sep 25, 2024

codefromthecrypt Sep 25, 2024

Choose a reason for hiding this comment

codefromthecrypt Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

michaelneale commented Sep 25, 2024

codefromthecrypt commented Sep 25, 2024

baxen commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024 • edited Loading

codefromthecrypt commented Sep 26, 2024

baxen commented Sep 26, 2024

michaelneale commented Sep 26, 2024

michaelneale commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024

codefromthecrypt commented Sep 26, 2024

codefromthecrypt commented Sep 25, 2024 •

edited

Loading

codefromthecrypt Sep 25, 2024 •

edited

Loading

codefromthecrypt commented Sep 26, 2024 •

edited

Loading