fix(hook): PendingToolRecoveryHook false positive on historical ToolResultBlocks in multi-turn (#1555)#1566
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Design note: why this approach vs suggested alternativesThe issue description proposed two options: Option A (filter by role): filter out ASSISTANT/TOOL messages, then check for ToolResultBlock. This is fragile — in HITL scenarios, user-provided ToolResultBlocks also carry TOOL roles, so the filter does not reliably distinguish historical results from user-provided results. Option B (expose snapshotSize on PreCallEvent): requires an API change to PreCallEvent, which is more invasive. This PR approach (match pendingIds): directly checks whether any ToolResultBlock ID matches a pending tool call. Historical results are never in pendingIds, so no false positive. HITL results for a pending call DO match, so they correctly defer to doCall(). No API change, 3 lines modified. |
AgentScopeJavaBot
left a comment
There was a problem hiding this comment.
🤖 AI Review
The fix is precise and correctly addresses the multi-turn false positive described in #1555. By narrowing the HITL guard from "any ToolResultBlock present" to "ToolResultBlock id ∈ pendingIds", the hook now correctly distinguishes user-supplied results for the currently pending tool call from historical results already resolved in earlier turns. Since pendingIds is built from ToolUseBlock ids on the last assistant message that have no matching result, historical (already-resolved) ids cannot collide with it, so the new condition is logically tight. The HITL deferral path remains intact: when the user sends a matching id the hook still returns the event unmodified for ReActAgent.doCall() to consume. No new null-safety risk of practical concern — pendingIds is sourced from non-null ToolUseBlock::getId, and getContentBlocks always returns a non-null list. Test coverage is thorough and well-partitioned (core bug, HITL, unrelated id, single-turn regression, multi-pending). No issues found that would block merge.
| this.executed = executed; | ||
| } | ||
|
|
||
| @Tool(name = "test_tool", description = "A test tool") |
There was a problem hiding this comment.
[nit] createConditionalStopHook(counter, stopAt) couples the test to internal PostReasoningEvent firing order (the stopAt=3/stopAt=7 magic numbers). It works today but will be brittle if reasoning-event semantics evolve. Consider stopping based on a structural signal (e.g. once a specific ToolUseBlock id appears) in a follow-up to make these tests more refactor-resilient. Non-blocking.
| inputMessages.stream().anyMatch(m -> m.hasContentBlocks(ToolResultBlock.class)); | ||
| inputMessages.stream() | ||
| .flatMap(m -> m.getContentBlocks(ToolResultBlock.class).stream()) | ||
| .anyMatch(tr -> pendingIds.contains(tr.getId())); |
There was a problem hiding this comment.
[nit] Defensive consideration only (not a real bug today): ToolResultBlock.getId() is documented as nullable, and Set#contains(null) is a legal lookup. In normal flows pendingIds is sourced from non-null ToolUseBlock ids so a null result-id cannot match, but if a malformed user message ever ships a null-id ToolResultBlock the hook will silently ignore it (treat as not-HITL and auto-patch). That is arguably the safer behavior, so no change required — flagging only so the assumption is explicit.
|
FYI #1409 |
|
Noticed that |
|
Checked the main branch — If this file is still used somewhere (e.g. on a release branch), happy to rebase. Otherwise feel free to close. |
|
PTAL @chickenlj |
Summary
PendingToolRecoveryHookis silently non-functional in multi-turn conversations.Root Cause
AgentBase.notifyPreCall()(line 672-678) merges the full memory snapshot with new callArgs intoPreCallEvent.inputMessages. The hook's guard condition at line 110 checks this merged list for anyToolResultBlock:In multi-turn scenarios, memory always contains ToolResultBlocks from earlier successful tool calls, so userProvidedResults is always true — the hook always skips recovery, and ReActAgent.doCall() throws IllegalStateException.
Fix
Narrow the check to only match ToolResultBlock IDs present in the pendingIds set:
Historical ToolResultBlocks (e.g. call_1) are for already-resolved tool calls, so their IDs are never in pendingIds — no false positive. HITL results (user-provided ToolResultBlock for a pending call) do match pendingIds, so the hook correctly defers to doCall().
Why existing tests pass
HookStopAgentTest.testNewMsgWithPendingToolUseContinuesActing only covers single-turn: memory has no historical ToolResultBlock, so the false positive never triggers.
Closes #1555
Test plan