Allow agent loop to recover from tool call errors by nickmisasi · Pull Request #592 · mattermost/mattermost-plugin-agents

nickmisasi · 2026-04-03T02:03:20Z

Summary

This change feeds tool execution errors back into the agent loop instead of terminating immediately when every tool call in a batch fails. It also tracks consecutive failed tool executions and, after the third failure in a row, disables further tool use for that follow-up turn while instructing the agent to explain the latest error and ask the user for guidance.

This follow-up also ensures private tool-result KV entries are always cleaned up, even if the continuation turn itself fails.

Ticket Link

NONE

Screenshots

NONE

Release Note

Improved agent tool-calling recovery by returning tool execution errors to the model for up to three consecutive retries before asking the user for guidance.

Summary by CodeRabbit

New Features
- Added a tool retry limit: after 3 consecutive tool failures, tools are disabled and the system asks for guidance.
Bug Fixes
- Treat tool errors as actionable so continuations occur when only errors exist.
- Ensure temporary data is cleaned up even if follow-up processing fails.
- Prevent runaway tool loops by inserting a retry-limit system instruction and disabling tools.
Tests
- Expanded tests for retry-limit behavior, continuation on errors, and cleanup on failure.

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

github-actions · 2026-04-03T02:05:10Z

🤖 LLM Evaluation Results

OpenAI

⚠️ Overall: 18/19 tests passed (94.7%)

Provider	Total	Passed	Failed	Pass Rate
⚠️ OPENAI	19	18	1	94.7%

❌ Failed Evaluations

Show 1 failures

OPENAI

1. TestReactEval/[openai]_react_cat_message

Score: 0.00
Rubric: The word/emoji is a cat emoji or a heart/love emoji
Reason: The output is the text "heart_eyes_cat", which is not an actual cat emoji or a heart/love emoji character.

Anthropic

⚠️ Overall: 18/19 tests passed (94.7%)

Provider	Total	Passed	Failed	Pass Rate
⚠️ ANTHROPIC	19	18	1	94.7%

❌ Failed Evaluations

Show 1 failures

ANTHROPIC

1. TestReactEval/[anthropic]_react_cat_message

Score: 0.00
Rubric: The word/emoji is a cat emoji or a heart/love emoji
Reason: The output is the text string "heart_eyes_cat", not an actual cat emoji (e.g., 😺) or a heart/love emoji (e.g., ❤️).

This comment was automatically generated by the eval CI pipeline.

cursor

Stale comment

Security Review: No Issues Found

Reviewed the changes for tool error retry/recovery logic. The PR is well-bounded and does not introduce security vulnerabilities:

Resource exhaustion: Tool retries are capped at MaxConsecutiveToolCallFailures = 3, after which tools are disabled at the API level via WithToolsDisabled() (not just a prompt instruction). The existing MaxToolResolutionDepth = 10 also applies.

Permission model: The continuation path reuses the same already-validated user, channel, and llmContext. No new data access paths or endpoints are introduced.

Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.

Prompt injection surface: The injected system message (ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it.

Auto-run wrapper hardening: The new cfg.ToolsDisabled early-return in AutoRunToolsWrapper.ChatCompletion is a defensive improvement that prevents entering the auto-run loop when tools are already disabled.

_{Sent by Cursor Automation: Find vulnerabilities}

coderabbitai · 2026-04-06T15:44:58Z

📝 Walkthrough

Walkthrough

Change introduces tooling to detect repeated failed tool-call executions, inject a retry-limit system message, and disable tools for subsequent LLM invocations. Adjustments span conversation tool handling, the auto-run loop, new retry utilities, and tests to validate continuation and KV cleanup.

Changes

Cohort / File(s)	Summary
Tool Retry Module `llm/tool_retry.go`	New file adding `MaxConsecutiveToolCallFailures`, `ToolRetryLimitSystemMessage`, `CountTrailingFailedToolCalls(posts []Post) int`, and `EnsureToolRetryLimitSystemMessage(posts []Post) []Post` to detect trailing tool-call errors and manage the retry-limit system message.
Auto-run / Run Loop `llm/auto_run_tools.go`, `llm/auto_run_tools_test.go`	`ChatCompletion` now also delegates when `cfg.ToolsDisabled` is set. `runToolLoop` uses per-iteration mutable copies of opts and auto-run-tools; when trailing failures reach the max, it injects the retry-limit system message, appends `WithToolsDisabled()` to options, and stops further auto-run resolution. New test verifies behavior after three consecutive failures.
Conversation Tool Handling `conversations/tool_handling.go`, `conversations/tool_handling_test.go`	Continuation gating broadened to treat `ToolCallStatusError` like `ToolCallStatusSuccess` for deciding to continue. `HandleToolResult` defers KV cleanup to ensure deletes on early returns. `completeAndStreamToolResponse` rewrites posts with the retry-limit system message and sets `toolsDisabled = true` when trailing failures hit the limit. Tests updated to assert continuation on errored tool calls and verify KV cleanup when continuation fails; test doubles now capture full LLM completion requests.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant LLM as LLM
    participant Handler as Tool Handler
    participant Tools as Tool Resolver
    participant KV as KV Store

    rect rgba(100,150,200,0.5)
    User->>LLM: user prompt
    LLM->>Handler: emits tool call events
    Handler->>Tools: resolve/execute tool
    Tools-->>Handler: result (success/error)
    Handler->>LLM: append tool-result post and re-invoke LLM
    end

    rect rgba(200,100,100,0.5)
    Handler->>Handler: CountTrailingFailedToolCalls(posts)
    Handler->>Handler: EnsureToolRetryLimitSystemMessage(posts)
    Handler->>Handler: set toolsDisabled = true
    end

    rect rgba(100,200,100,0.5)
    Handler->>LLM: re-invoke with tools disabled
    LLM-->>User: assistant explains error / requests guidance
    KV->>Handler: deferred KV deletions occur regardless of continuation outcome
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Setup Cloud Test Server

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Allow agent loop to recover from tool call errors' accurately captures the main objective: enabling the agent to handle and recover from tool execution failures rather than terminating immediately.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cursor/agent-tool-error-retries-e7d3

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

llm/auto_run_tools_test.go (1)
436-496: Assert the fourth invocation actually had tools disabled.

This proves the retry-limit prompt was injected, but it doesn't prove WithToolsDisabled() reached the wrapped model. If that option were dropped and the stubbed fourth response stayed text, this test would still pass. Capturing the per-call LanguageModelConfig and asserting ToolsDisabled == true on the last call would make the regression this test is targeting explicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm/auto_run_tools_test.go` around lines 436 - 496, In
TestAutoRunToolsWrapper_DisablesToolsAfterThreeConsecutiveFailures capture the
per-call LanguageModelConfig passed into the wrapped model (extend or use
capturingLLM to record each call's config), then after calling
wrapper.ChatCompletion assert that the last recorded config has ToolsDisabled ==
true (verifying WithToolsDisabled() reached the inner model); reference the
capturingLLM used to wrap inner and the WithToolsDisabled option so you validate
the fourth invocation (inner.callCount / final recorded call) had ToolsDisabled
set.
conversations/tool_handling_test.go (1)
1063-1066: Verify the continuation request actually contains the tool error.

This test only checks that a follow-up post was streamed. Because capturingLanguageModel.ChatCompletion ignores CompletionRequest, it won't catch a regression where the follow-up turn runs with redacted or empty tool data instead of "tool execution failed". Capturing the request posts and asserting the last bot post carries the error status/result would cover the core behavior directly.

Also applies to: 1129-1140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conversations/tool_handling_test.go` around lines 1063 - 1066, The test
currently only verifies a follow-up post was streamed but doesn't assert that
the continuation request actually includes the tool error because
capturingLanguageModel.ChatCompletion ignores the CompletionRequest; update the
test to capture the outgoing CompletionRequest (or the streamed posts from
fakeStreamingService) and assert the final bot continuation contains the tool
error text "tool execution failed" (or the expected error status/result).
Specifically, modify or extend capturingLanguageModel.ChatCompletion to record
the passed CompletionRequest (or inspect fakeStreamingService's captured posts)
and add an assertion that the last bot post / continuation request includes the
tool error, ensuring the check covers bots.NewBot-created bot behavior during
continuation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@conversations/tool_handling.go`:
- Around line 437-443: completeAndStreamToolResponse currently treats
ToolCallStatusError as actionable, which can cause an early return before
cleaning up private KV entries; ensure c.deleteToolCallKVEntries(post.Id,
resultKVKey, toolCallKVKey) always runs even on error-only batches by invoking
the cleanup unconditionally (move the call before any early return or use defer)
after computing hasActionableResult; reference symbols:
completeAndStreamToolResponse, hasActionableResult, llm.ToolCallStatusError, and
c.deleteToolCallKVEntries to locate and update the code so tool_result_private /
tool_call_private are always removed.

---

Nitpick comments:
In `@conversations/tool_handling_test.go`:
- Around line 1063-1066: The test currently only verifies a follow-up post was
streamed but doesn't assert that the continuation request actually includes the
tool error because capturingLanguageModel.ChatCompletion ignores the
CompletionRequest; update the test to capture the outgoing CompletionRequest (or
the streamed posts from fakeStreamingService) and assert the final bot
continuation contains the tool error text "tool execution failed" (or the
expected error status/result). Specifically, modify or extend
capturingLanguageModel.ChatCompletion to record the passed CompletionRequest (or
inspect fakeStreamingService's captured posts) and add an assertion that the
last bot post / continuation request includes the tool error, ensuring the check
covers bots.NewBot-created bot behavior during continuation.

In `@llm/auto_run_tools_test.go`:
- Around line 436-496: In
TestAutoRunToolsWrapper_DisablesToolsAfterThreeConsecutiveFailures capture the
per-call LanguageModelConfig passed into the wrapped model (extend or use
capturingLLM to record each call's config), then after calling
wrapper.ChatCompletion assert that the last recorded config has ToolsDisabled ==
true (verifying WithToolsDisabled() reached the inner model); reference the
capturingLLM used to wrap inner and the WithToolsDisabled option so you validate
the fourth invocation (inner.callCount / final recorded call) had ToolsDisabled
set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: eecef550-0b71-454f-aa44-26b2bf9a5b0b

📥 Commits

Reviewing files that changed from the base of the PR and between 605e0db and 8f237be.

📒 Files selected for processing (5)

conversations/tool_handling.go
conversations/tool_handling_test.go
llm/auto_run_tools.go
llm/auto_run_tools_test.go
llm/tool_retry.go

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

cursor

Stale comment

Security Review: No Issues Found

Reviewed the latest revision (commits 8f237be and 3e77f92) for tool error retry/recovery logic. The PR is well-bounded and does not introduce security vulnerabilities:

Resource exhaustion: Tool retries are capped at MaxConsecutiveToolCallFailures = 3, after which tools are disabled at the API level via WithToolsDisabled() (not just a prompt instruction). The existing MaxToolResolutionDepth = 10 loop bound also applies in the auto-run path, limiting total tool executions.

Permission model: The continuation path reuses the same already-validated user, channel, and llmContext. No new data access paths or endpoints are introduced.

Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.

Prompt injection surface: The injected system message (ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it. Tool error text flowing back to the LLM is the same trust level as tool success results, which is pre-existing behavior.

KV cleanup (new commit 3e77f92): Moving KV cleanup to defer ensures tool_result_private / tool_call_private entries are always removed even when the follow-up LLM call fails, preventing stale KV entry accumulation. This is a correctness improvement.

Auto-run wrapper hardening: The cfg.ToolsDisabled early-return in AutoRunToolsWrapper.ChatCompletion prevents entering the auto-run tool loop when tools are already disabled — a defensive improvement.

_{Sent by Cursor Automation: Find vulnerabilities}

crspeller

Looks good. One non-blocking observation.

crspeller · 2026-04-06T17:42:25Z

+// CountTrailingFailedToolCalls counts consecutive trailing tool executions that
+// failed. A successful tool execution resets the streak. Posts without executed
+// tool results stop the scan because they represent a new agent turn.
+func CountTrailingFailedToolCalls(posts []Post) int {


These look like they could be easily unit tested, but have not unit tests.

@cursor add unit test coverage for this file. Only write valuable tests, don't overly rely on mocks or go crazy with edge case coverage

Summary

Added focused table-driven unit tests for CountTrailingFailedToolCalls() covering consecutive trailing failures, reset-on-success behavior, and stopping on non-executed tool batches. tool_retry_test.go:12-69

Added focused table-driven unit tests for EnsureToolRetryLimitSystemMessage() covering prepend, append-to-existing-system-prompt, and no-op behavior when the retry message already exists. tool_retry_test.go:71-125

Testing

✅ go test -v ./llm/... -run '^(TestCountTrailingFailedToolCalls|TestEnsureToolRetryLimitSystemMessage)$' — targeted just the new helper coverage so the tests stay valuable and directly exercise the behavior requested in review.

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

cursor

Security Review: No Issues Found

Reviewed the full PR (through 03e2753) for the tool error retry/recovery logic. No security vulnerabilities identified:

Resource exhaustion: Tool retries are hard-capped at MaxConsecutiveToolCallFailures = 3, enforced server-side via WithToolsDisabled() (not just a prompt instruction). The existing MaxToolResolutionDepth = 10 loop bound also applies in the auto-run path.
Permission model: The continuation path reuses the same already-validated user, channel, and llmContext. No new data access paths or endpoints are introduced.
Channel isolation: No changes to channel membership checks or data scoping. Tool error results were already stored on post props; this PR only allows the LLM to respond to them.
Prompt injection surface: The injected system message (ToolRetryLimitSystemMessage) is a hardcoded constant — no user-controlled data flows into it. Tool error text flowing back to the LLM has the same trust level as tool success results (pre-existing behavior).
System prompt growth: EnsureToolRetryLimitSystemMessage has an idempotency check, and both call sites disable tools after injection, preventing recursive prompt growth across iterations.
KV cleanup: Moving KV cleanup to defer ensures tool_result_private / tool_call_private entries are always removed even when the follow-up LLM call fails — a correctness improvement.
Auto-run wrapper hardening: The cfg.ToolsDisabled early-return in AutoRunToolsWrapper.ChatCompletion prevents entering the auto-run tool loop when tools are already disabled.

_{Sent by Cursor Automation: Find vulnerabilities}

coderabbitai

🧹 Nitpick comments (1)

llm/tool_retry_test.go (1)

103-113: Add an idempotency case for prompts that already include the retry message.

Current coverage only checks the exact-equality path (Message == ToolRetryLimitSystemMessage). Add a case where the system prompt already contains the retry message alongside other prompt text to prevent duplicate appends on repeated calls.

✅ Proposed test case addition

@@
 		{
 			name: "returns posts unchanged when retry message already exists",
 			posts: []Post{
 				{Role: PostRoleSystem, Message: ToolRetryLimitSystemMessage},
 				{Role: PostRoleUser, Message: "hello"},
 			},
 			expected: []Post{
 				{Role: PostRoleSystem, Message: ToolRetryLimitSystemMessage},
 				{Role: PostRoleUser, Message: "hello"},
 			},
 		},
+		{
+			name: "returns posts unchanged when retry message is already appended to system prompt",
+			posts: []Post{
+				{Role: PostRoleSystem, Message: "base prompt\n\n" + ToolRetryLimitSystemMessage},
+				{Role: PostRoleUser, Message: "hello"},
+			},
+			expected: []Post{
+				{Role: PostRoleSystem, Message: "base prompt\n\n" + ToolRetryLimitSystemMessage},
+				{Role: PostRoleUser, Message: "hello"},
+			},
+		},
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/tool_retry_test.go` around lines 103 - 113, Add an idempotency test case
in llm/tool_retry_test.go that covers a system prompt which already contains
ToolRetryLimitSystemMessage as part of a larger string (not exact-equal); create
a new test entry in the existing cases array using the Post struct (fields Role
and Message) where a system Post.Message contains extra text plus
ToolRetryLimitSystemMessage and assert the expected output is unchanged (no
duplicate append), referencing ToolRetryLimitSystemMessage and the
Post/PostRoleSystem names so the test targets the same code path that appends
the retry message.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@llm/tool_retry_test.go`:
- Around line 103-113: Add an idempotency test case in llm/tool_retry_test.go
that covers a system prompt which already contains ToolRetryLimitSystemMessage
as part of a larger string (not exact-equal); create a new test entry in the
existing cases array using the Post struct (fields Role and Message) where a
system Post.Message contains extra text plus ToolRetryLimitSystemMessage and
assert the expected output is unchanged (no duplicate append), referencing
ToolRetryLimitSystemMessage and the Post/PostRoleSystem names so the test
targets the same code path that appends the retry message.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 19be01c7-cff3-4285-bd79-5daa61719827

📥 Commits

Reviewing files that changed from the base of the PR and between 3e77f92 and 03e2753.

📒 Files selected for processing (1)

llm/tool_retry_test.go

Allow agent tool retries on tool errors

8f237be

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

nickmisasi marked this pull request as ready for review April 6, 2026 15:32

cursor Bot reviewed Apr 6, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 6, 2026

View reviewed changes

Comment thread conversations/tool_handling.go

Ensure tool result KV cleanup on follow-up errors

3e77f92

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

coderabbitai Bot approved these changes Apr 6, 2026

View reviewed changes

cursor Bot reviewed Apr 6, 2026

View reviewed changes

nickmisasi requested a review from crspeller April 6, 2026 17:34

crspeller approved these changes Apr 6, 2026

View reviewed changes

Add unit tests for tool retry helpers

03e2753

Co-authored-by: Nick Misasi <nick13misasi@gmail.com>

cursor Bot reviewed Apr 6, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

nickmisasi merged commit 6ddcfde into master Apr 6, 2026
35 checks passed

nickmisasi deleted the cursor/agent-tool-error-retries-e7d3 branch April 6, 2026 18:14

coderabbitai Bot mentioned this pull request Apr 10, 2026

Restrict bot-triggered channel tools to auto-run everywhere #611

Merged

Conversation

nickmisasi commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Ticket Link

Screenshots

Release Note

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 LLM Evaluation Results

OpenAI

❌ Failed Evaluations

OPENAI

Anthropic

❌ Failed Evaluations

ANTHROPIC

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crspeller left a comment

Choose a reason for hiding this comment

Uh oh!

crspeller Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

nickmisasi Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nickmisasi commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented Apr 3, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot Apr 6, 2026 •

edited

Loading