Allow agent loop to recover from tool call errors#592
Conversation
Co-authored-by: Nick Misasi <nick13misasi@gmail.com>
🤖 LLM Evaluation ResultsOpenAI
❌ Failed EvaluationsShow 1 failuresOPENAI1. TestReactEval/[openai]_react_cat_message
Anthropic
❌ Failed EvaluationsShow 1 failuresANTHROPIC1. TestReactEval/[anthropic]_react_cat_message
This comment was automatically generated by the eval CI pipeline. |
There was a problem hiding this comment.
Stale comment
Security Review: No Issues Found
Reviewed the changes for tool error retry/recovery logic. The PR is well-bounded and does not introduce security vulnerabilities:
- Resource exhaustion: Tool retries are capped at
MaxConsecutiveToolCallFailures = 3, after which tools are disabled at the API level viaWithToolsDisabled()(not just a prompt instruction). The existingMaxToolResolutionDepth = 10also applies.- Permission model: The continuation path reuses the same already-validated
user,channel, andllmContext. No new data access paths or endpoints are introduced.- Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.
- Prompt injection surface: The injected system message (
ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it.- Auto-run wrapper hardening: The new
cfg.ToolsDisabledearly-return inAutoRunToolsWrapper.ChatCompletionis a defensive improvement that prevents entering the auto-run loop when tools are already disabled.Sent by Cursor Automation: Find vulnerabilities
📝 WalkthroughWalkthroughChange introduces tooling to detect repeated failed tool-call executions, inject a retry-limit system message, and disable tools for subsequent LLM invocations. Adjustments span conversation tool handling, the auto-run loop, new retry utilities, and tests to validate continuation and KV cleanup. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant LLM as LLM
participant Handler as Tool Handler
participant Tools as Tool Resolver
participant KV as KV Store
rect rgba(100,150,200,0.5)
User->>LLM: user prompt
LLM->>Handler: emits tool call events
Handler->>Tools: resolve/execute tool
Tools-->>Handler: result (success/error)
Handler->>LLM: append tool-result post and re-invoke LLM
end
rect rgba(200,100,100,0.5)
Handler->>Handler: CountTrailingFailedToolCalls(posts)
Handler->>Handler: EnsureToolRetryLimitSystemMessage(posts)
Handler->>Handler: set toolsDisabled = true
end
rect rgba(100,200,100,0.5)
Handler->>LLM: re-invoke with tools disabled
LLM-->>User: assistant explains error / requests guidance
KV->>Handler: deferred KV deletions occur regardless of continuation outcome
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
llm/auto_run_tools_test.go (1)
436-496: Assert the fourth invocation actually had tools disabled.This proves the retry-limit prompt was injected, but it doesn't prove
WithToolsDisabled()reached the wrapped model. If that option were dropped and the stubbed fourth response stayed text, this test would still pass. Capturing the per-callLanguageModelConfigand assertingToolsDisabled == trueon the last call would make the regression this test is targeting explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/auto_run_tools_test.go` around lines 436 - 496, In TestAutoRunToolsWrapper_DisablesToolsAfterThreeConsecutiveFailures capture the per-call LanguageModelConfig passed into the wrapped model (extend or use capturingLLM to record each call's config), then after calling wrapper.ChatCompletion assert that the last recorded config has ToolsDisabled == true (verifying WithToolsDisabled() reached the inner model); reference the capturingLLM used to wrap inner and the WithToolsDisabled option so you validate the fourth invocation (inner.callCount / final recorded call) had ToolsDisabled set.conversations/tool_handling_test.go (1)
1063-1066: Verify the continuation request actually contains the tool error.This test only checks that a follow-up post was streamed. Because
capturingLanguageModel.ChatCompletionignoresCompletionRequest, it won't catch a regression where the follow-up turn runs with redacted or empty tool data instead of"tool execution failed". Capturing the request posts and asserting the last bot post carries the error status/result would cover the core behavior directly.Also applies to: 1129-1140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@conversations/tool_handling_test.go` around lines 1063 - 1066, The test currently only verifies a follow-up post was streamed but doesn't assert that the continuation request actually includes the tool error because capturingLanguageModel.ChatCompletion ignores the CompletionRequest; update the test to capture the outgoing CompletionRequest (or the streamed posts from fakeStreamingService) and assert the final bot continuation contains the tool error text "tool execution failed" (or the expected error status/result). Specifically, modify or extend capturingLanguageModel.ChatCompletion to record the passed CompletionRequest (or inspect fakeStreamingService's captured posts) and add an assertion that the last bot post / continuation request includes the tool error, ensuring the check covers bots.NewBot-created bot behavior during continuation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@conversations/tool_handling.go`:
- Around line 437-443: completeAndStreamToolResponse currently treats
ToolCallStatusError as actionable, which can cause an early return before
cleaning up private KV entries; ensure c.deleteToolCallKVEntries(post.Id,
resultKVKey, toolCallKVKey) always runs even on error-only batches by invoking
the cleanup unconditionally (move the call before any early return or use defer)
after computing hasActionableResult; reference symbols:
completeAndStreamToolResponse, hasActionableResult, llm.ToolCallStatusError, and
c.deleteToolCallKVEntries to locate and update the code so tool_result_private /
tool_call_private are always removed.
---
Nitpick comments:
In `@conversations/tool_handling_test.go`:
- Around line 1063-1066: The test currently only verifies a follow-up post was
streamed but doesn't assert that the continuation request actually includes the
tool error because capturingLanguageModel.ChatCompletion ignores the
CompletionRequest; update the test to capture the outgoing CompletionRequest (or
the streamed posts from fakeStreamingService) and assert the final bot
continuation contains the tool error text "tool execution failed" (or the
expected error status/result). Specifically, modify or extend
capturingLanguageModel.ChatCompletion to record the passed CompletionRequest (or
inspect fakeStreamingService's captured posts) and add an assertion that the
last bot post / continuation request includes the tool error, ensuring the check
covers bots.NewBot-created bot behavior during continuation.
In `@llm/auto_run_tools_test.go`:
- Around line 436-496: In
TestAutoRunToolsWrapper_DisablesToolsAfterThreeConsecutiveFailures capture the
per-call LanguageModelConfig passed into the wrapped model (extend or use
capturingLLM to record each call's config), then after calling
wrapper.ChatCompletion assert that the last recorded config has ToolsDisabled ==
true (verifying WithToolsDisabled() reached the inner model); reference the
capturingLLM used to wrap inner and the WithToolsDisabled option so you validate
the fourth invocation (inner.callCount / final recorded call) had ToolsDisabled
set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: eecef550-0b71-454f-aa44-26b2bf9a5b0b
📒 Files selected for processing (5)
conversations/tool_handling.goconversations/tool_handling_test.gollm/auto_run_tools.gollm/auto_run_tools_test.gollm/tool_retry.go
Co-authored-by: Nick Misasi <nick13misasi@gmail.com>
There was a problem hiding this comment.
Stale comment
Security Review: No Issues Found
Reviewed the latest revision (commits
8f237beand3e77f92) for tool error retry/recovery logic. The PR is well-bounded and does not introduce security vulnerabilities:
- Resource exhaustion: Tool retries are capped at
MaxConsecutiveToolCallFailures = 3, after which tools are disabled at the API level viaWithToolsDisabled()(not just a prompt instruction). The existingMaxToolResolutionDepth = 10loop bound also applies in the auto-run path, limiting total tool executions.- Permission model: The continuation path reuses the same already-validated
user,channel, andllmContext. No new data access paths or endpoints are introduced.- Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.
- Prompt injection surface: The injected system message (
ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it. Tool error text flowing back to the LLM is the same trust level as tool success results, which is pre-existing behavior.- KV cleanup (new commit
3e77f92): Moving KV cleanup todeferensurestool_result_private/tool_call_privateentries are always removed even when the follow-up LLM call fails, preventing stale KV entry accumulation. This is a correctness improvement.- Auto-run wrapper hardening: The
cfg.ToolsDisabledearly-return inAutoRunToolsWrapper.ChatCompletionprevents entering the auto-run tool loop when tools are already disabled — a defensive improvement.Sent by Cursor Automation: Find vulnerabilities
crspeller
left a comment
There was a problem hiding this comment.
Looks good. One non-blocking observation.
| // CountTrailingFailedToolCalls counts consecutive trailing tool executions that | ||
| // failed. A successful tool execution resets the streak. Posts without executed | ||
| // tool results stop the scan because they represent a new agent turn. | ||
| func CountTrailingFailedToolCalls(posts []Post) int { |
There was a problem hiding this comment.
These look like they could be easily unit tested, but have not unit tests.
There was a problem hiding this comment.
@cursor add unit test coverage for this file. Only write valuable tests, don't overly rely on mocks or go crazy with edge case coverage
There was a problem hiding this comment.
Summary
- Added focused table-driven unit tests for
CountTrailingFailedToolCalls()covering consecutive trailing failures, reset-on-success behavior, and stopping on non-executed tool batches.tool_retry_test.go:12-69 - Added focused table-driven unit tests for
EnsureToolRetryLimitSystemMessage()covering prepend, append-to-existing-system-prompt, and no-op behavior when the retry message already exists.tool_retry_test.go:71-125
Testing
- ✅
go test -v ./llm/... -run '^(TestCountTrailingFailedToolCalls|TestEnsureToolRetryLimitSystemMessage)$'— targeted just the new helper coverage so the tests stay valuable and directly exercise the behavior requested in review.
Co-authored-by: Nick Misasi <nick13misasi@gmail.com>
There was a problem hiding this comment.
Security Review: No Issues Found
Reviewed the full PR (through 03e2753) for the tool error retry/recovery logic. No security vulnerabilities identified:
- Resource exhaustion: Tool retries are hard-capped at
MaxConsecutiveToolCallFailures = 3, enforced server-side viaWithToolsDisabled()(not just a prompt instruction). The existingMaxToolResolutionDepth = 10loop bound also applies in the auto-run path. - Permission model: The continuation path reuses the same already-validated
user,channel, andllmContext. No new data access paths or endpoints are introduced. - Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.
- Prompt injection surface: The injected system message (
ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it. Tool error text flowing back to the LLM has the same trust level as tool success results (pre-existing behavior). - System prompt growth:
EnsureToolRetryLimitSystemMessagehas an idempotency check, and both call sites disable tools after injection, preventing recursive prompt growth across iterations. - KV cleanup: Moving KV cleanup to
deferensurestool_result_private/tool_call_privateentries are always removed even when the follow-up LLM call fails — a correctness improvement. - Auto-run wrapper hardening: The
cfg.ToolsDisabledearly-return inAutoRunToolsWrapper.ChatCompletionprevents entering the auto-run tool loop when tools are already disabled.
Sent by Cursor Automation: Find vulnerabilities
There was a problem hiding this comment.
🧹 Nitpick comments (1)
llm/tool_retry_test.go (1)
103-113: Add an idempotency case for prompts that already include the retry message.Current coverage only checks the exact-equality path (
Message == ToolRetryLimitSystemMessage). Add a case where the system prompt already contains the retry message alongside other prompt text to prevent duplicate appends on repeated calls.✅ Proposed test case addition
@@ { name: "returns posts unchanged when retry message already exists", posts: []Post{ {Role: PostRoleSystem, Message: ToolRetryLimitSystemMessage}, {Role: PostRoleUser, Message: "hello"}, }, expected: []Post{ {Role: PostRoleSystem, Message: ToolRetryLimitSystemMessage}, {Role: PostRoleUser, Message: "hello"}, }, }, + { + name: "returns posts unchanged when retry message is already appended to system prompt", + posts: []Post{ + {Role: PostRoleSystem, Message: "base prompt\n\n" + ToolRetryLimitSystemMessage}, + {Role: PostRoleUser, Message: "hello"}, + }, + expected: []Post{ + {Role: PostRoleSystem, Message: "base prompt\n\n" + ToolRetryLimitSystemMessage}, + {Role: PostRoleUser, Message: "hello"}, + }, + }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/tool_retry_test.go` around lines 103 - 113, Add an idempotency test case in llm/tool_retry_test.go that covers a system prompt which already contains ToolRetryLimitSystemMessage as part of a larger string (not exact-equal); create a new test entry in the existing cases array using the Post struct (fields Role and Message) where a system Post.Message contains extra text plus ToolRetryLimitSystemMessage and assert the expected output is unchanged (no duplicate append), referencing ToolRetryLimitSystemMessage and the Post/PostRoleSystem names so the test targets the same code path that appends the retry message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@llm/tool_retry_test.go`:
- Around line 103-113: Add an idempotency test case in llm/tool_retry_test.go
that covers a system prompt which already contains ToolRetryLimitSystemMessage
as part of a larger string (not exact-equal); create a new test entry in the
existing cases array using the Post struct (fields Role and Message) where a
system Post.Message contains extra text plus ToolRetryLimitSystemMessage and
assert the expected output is unchanged (no duplicate append), referencing
ToolRetryLimitSystemMessage and the Post/PostRoleSystem names so the test
targets the same code path that appends the retry message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 19be01c7-cff3-4285-bd79-5daa61719827
📒 Files selected for processing (1)
llm/tool_retry_test.go



Summary
This change feeds tool execution errors back into the agent loop instead of terminating immediately when every tool call in a batch fails. It also tracks consecutive failed tool executions and, after the third failure in a row, disables further tool use for that follow-up turn while instructing the agent to explain the latest error and ask the user for guidance.
This follow-up also ensures private tool-result KV entries are always cleaned up, even if the continuation turn itself fails.
Ticket Link
NONE
Screenshots
NONE
Release Note
Summary by CodeRabbit
New Features
Bug Fixes
Tests