Skip to content

fix(core,cli): close tool_use↔tool_result invariant across all failure paths#4176

Open
wenshao wants to merge 4 commits into
QwenLM:mainfrom
wenshao:fix/persist-partial-tool-use-on-stream-error
Open

fix(core,cli): close tool_use↔tool_result invariant across all failure paths#4176
wenshao wants to merge 4 commits into
QwenLM:mainfrom
wenshao:fix/persist-partial-tool-use-on-stream-error

Conversation

@wenshao
Copy link
Copy Markdown
Collaborator

@wenshao wenshao commented May 15, 2026

Summary

Fixes an unrecoverable wedge on weak-network connections (e.g. on a train) when calling DeepSeek-V4-Pro (or any Anthropic-compatible backend) through the Anthropic protocol. Same wedge reproduces under three additional failure modes surfaced in review; all closed in this PR.

Symptom — the model 400s with:

unexpected messages.N.content.0: tool_use_id found in tool_result blocks:
call_00_xxx. Each tool_result block must have a corresponding tool_use
block in the previous message.

and pressing Ctrl+Y to retry yields the same error indefinitely. The chat session is permanently wedged.

Root cause — mid-stream tool_use drop

The Anthropic-compatible SSE emits tool_use content blocks as content_block_startinput_json_delta*content_block_stopmessage_deltamessage_stop. In anthropicContentGenerator.processStream, the functionCall chunk is yielded at content_block_stop.

When the SSE drops between content_block_stop of a tool_use and the terminal message_stop (typical for flaky train/airplane WiFi):

  1. The functionCall chunk has already been yielded → Turn.run registers a ToolCallRequest event → useGeminiStream schedules the tool after the for-await exits.
  2. But processStreamResponse's this.history.push({role:'model'…}) never runs — the for-await throws before reaching it.
  3. The tool finishes asynchronously → handleCompletedToolssubmitQuery(…, ToolResult)geminiChat.sendMessageStream pushes user:[functionResponse] into history.
  4. History is now […, user:"query", user:[tool_result]]. The converter emits two consecutive user messages where the second carries tool_result blocks but the preceding message has no matching tool_use → 400.
  5. stripOrphanedUserEntriesFromHistory only pops trailing user entries. The lost tool_use is unrecoverable, and the chat is wedged for the rest of the session.

Fixes

1. Persist partial assistant turn when stream errors mid tool_use

processStreamResponse (packages/core/src/core/geminiChat.ts) wraps its for-await in a try/catch. On stream error, if any functionCall chunk was already yielded (hasToolCall === true) and we accumulated parts, persist the partial assistant turn into history before re-throwing — so the eventual tool_result submission has its matching tool_use.

Plain-text partial turns (no functionCall) are intentionally not persisted: the Retry path pops the trailing user prompt and re-issues it; a stale partial-text model turn between them would either bias the retry or surface as duplicate output.

The shared processStreamResponse covers both Anthropic and OpenAI content generators, so the fix applies to all anthropic-compatible and openai-compatible providers — not just DeepSeek.

2. Close tool_use ↔ tool_result invariant at every failure point

Surfaced in @yiliang114's review: the partial-history push above relies on the React tool scheduler eventually producing a matching tool_result. Three races break that contract:

  • Race A — Ctrl+Y while the in-flight tool hasn't finished. Trailing entry is model[tool_use], not user, so stripOrphanedUserEntriesFromHistory doesn't pop it. Retry pushes a fresh user turn after the orphan tool_use. Meanwhile the scheduler's onAllToolCallsComplete is single-shot AND gated on isResponding (useGeminiStream.ts:1971), so the eventual real tool_result is silently swallowed.
  • Race B — process crash / OOM / SIGKILL between the partial-tool_use push and tool completion. The dangling model[tool_use] wedges the first API call after --resume.
  • Race C — manual JSONL edits / external tooling leaving the same dangling shape.

Three pieces working together close every race:

  1. repairOrphanedToolUseTurns(history) in geminiChat.ts — walks history left-to-right and synthesizes an error-typed functionResponse for every functionCall whose id isn't echoed back in the next user turn. Appends to an existing user turn when present, otherwise inserts a new one.
  2. Wiring at two points:
    • client.startChat() after loading the transcript — covers Race B/C (--resume).
    • chat.sendMessageStream immediately after this.history.push(userContent) — covers Race A in-session. The user-supplied turn gets the first chance to close the pair before the synthesizer sees it as dangling, so a Retry of a previous ToolResult submission (lastPrompt is a functionResponse-parts array) lands the real functionResponse instead of a synthetic error.
  3. History-based dedup in handleCompletedTools (useGeminiStream.ts) — before submitting tool results, scan chat.history for any functionResponse.id already present (planted by the repair pass) and drop those toolCalls' responseParts from the wire payload while still markToolsAsSubmitted so the UI/scheduler advances. The dedup runs before the isResponding early-return so the scheduler's single-shot onAllToolCallsComplete doesn't leave the tool stuck in completed-but-not-submitted when a Retry stream is in flight.

This is the qwen-code analogue of Claude Code's yieldMissingToolResultBlocks (query.ts:123-149), split across the core/cli boundary because our React scheduler runs out-of-band from the stream loop (upstream's in-band StreamingToolExecutor can atomically .discard() live tools at the synthesis point; we use history-dedup at handleCompletedTools instead to keep the in-flight scheduler's late real result from colliding with the synthesized error).

Test plan

  • npx vitest run packages/core/src/core/geminiChat.test.ts — 90 tests pass (3 partial-history-push cases + 8 repair-helper unit cases + 2 chat.sendMessageStream integration cases)
  • npx vitest run packages/core/src/core/client.test.ts — 131 tests pass (Retry path mocks updated for new method)
  • npx vitest run packages/cli/src/ui/hooks/useGeminiStream.test.tsx — 92 tests pass (Race A end-to-end dedup integration test)
  • npx vitest run --project '@qwen-code/qwen-code-core' — all 8186 core tests pass
  • tsc --noEmit -p packages/core/tsconfig.json clean
  • CI green on all four commits (Lint + Test×3 platforms + CodeQL)
  • Manual: throttle network, trigger a tool_use response, kill the connection mid-stream, confirm the next user/tool_result submission succeeds instead of 400ing. Ctrl+Y mid-flight and --resume of a session crashed mid-flight.

Related

Closes #4178

…_use

Weak-network failures during an Anthropic-compatible stream (DeepSeek,
api.anthropic.com, etc.) can drop the SSE between a tool_use
`content_block_stop` and the terminal `message_stop`. The functionCall
chunk is already yielded at content_block_stop, so:

  - Turn.run records a ToolCallRequest event.
  - useGeminiStream's for-await exits and schedules the tool.
  - handleCompletedTools eventually fires submitQuery(..., ToolResult)
    and pushes a user[functionResponse] into history.
  - But processStreamResponse's history.push for the model turn never
    ran (the for-await threw first), so the matching tool_use is gone.

The next request body has `user → user[tool_result]` with no tool_use
in between, and the server rejects with HTTP 400:
"tool_use_id ... must have a corresponding tool_use block in the
previous message". Ctrl+Y can't recover because
stripOrphanedUserEntriesFromHistory only strips trailing user
entries — the lost tool_use is unrecoverable, and the session is
wedged.

Wrap the for-await loop in processStreamResponse with try/catch.
When the stream throws AND any functionCall chunk was already
yielded (hasToolCall=true), persist the partial assistant turn to
history before re-throwing. The eventual tool_result submission
then has a matching tool_use and the session can continue.

Plain-text partial turns (no functionCall yielded) are intentionally
NOT persisted: the Retry path pops the trailing user prompt and
re-issues it, so a stale partial-text model turn between them would
either bias the retry or surface as duplicate output.
@github-actions
Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR fixes a critical unrecoverable wedge that occurs when streaming connections drop mid-tool_use on weak networks. The fix ensures partial assistant turns containing tool calls are persisted to history before re-throwing stream errors, maintaining the tool_use/tool_result pairing invariant required by Anthropic-compatible APIs. The implementation is well-reasoned with comprehensive tests and excellent documentation.

🔍 General Feedback

  • Exceptional documentation: The code comments and PR description provide outstanding context about the root cause, including the specific Anthropic SSE emission pattern (content_block_startinput_json_delta*content_block_stopmessage_deltamessage_stop)
  • Targeted fix: The change is surgical—only persisting partial turns when hasToolCall === true, deliberately skipping text-only partials to avoid biasing retry behavior
  • Shared code path benefit: The fix in processStreamResponse automatically covers both Anthropic and OpenAI content generators
  • Test coverage: Two new test cases precisely validate both branches of the new error handling logic

🎯 Specific Feedback

🟡 High

  • File: packages/core/src/core/geminiChat.ts:1320 - The streamError capture pattern is sound, but consider whether the caught error should be logged before re-throwing for debugging purposes. Weak-network issues are notoriously hard to reproduce, and having structured logs when this recovery path triggers would aid future debugging.

    Suggested fix:

      } catch (e) {
        streamError = e;
        // Log for debugging weak-network recovery path
        if (hasToolCall) {
          debugLogger.log('Persisting partial assistant turn due to stream error');
        }
      }

🟢 Medium

  • File: packages/core/src/core/geminiChat.ts:1257-1268 - The comment explaining streamError is extremely detailed (which is good), but it's positioned before the variable is actually used. Consider moving this extensive explanation to the recovery branch at line 1383 where the actual decision logic lives, and keep a shorter note here.

    Suggested restructuring:

      let streamError: unknown = null;  // Captured mid-stream; see recovery logic below
    
      try {
        // ... loop unchanged ...
      } catch (e) {
        streamError = e;
      }
    
      // ... later, at the recovery branch ...
      /**
       * Mid-stream failure recovery: [full explanation currently at line 1257]
       */
      if (streamError !== null) { ... }
  • File: packages/core/src/core/geminiChat.test.ts:702-760 - The test "persists partial assistant turn when stream throws after a tool_use chunk" is thorough but could benefit from explicitly asserting that the history length is exactly 2 (user + model) after the error, making the test's intent even clearer to future readers.

    Suggested addition (after line 758):

        expect(history.length).toBe(2);  // user turn + model turn with tool_use

🔵 Low

  • File: packages/core/src/core/geminiChat.ts:1383 - The recovery comment block is ~20 lines. While the detail is valuable, consider extracting this into a private method like shouldPersistPartialTurn() with an inline JSDoc. This would make the main flow easier to scan and the recovery logic independently testable.

  • File: packages/core/src/core/geminiChat.test.ts - Consider adding a test case that simulates the full retry flow (Ctrl+Y scenario) to verify that the persisted tool_use allows the subsequent ToolResult submission to succeed rather than 400ing. This would be an integration-style test complementing the current unit tests.

✅ Highlights

  • Brilliant root cause analysis: The PR description's explanation of the Anthropic SSE emission pattern and how the drop between content_block_stop and message_stop causes the wedge is exemplary technical writing
  • Discriminating fix: The deliberate choice to persist tool_use partials but NOT text-only partials shows deep understanding of the retry path's behavior and avoids introducing duplicate output issues
  • Comprehensive test coverage: The two new test cases cover both branches of the new logic with realistic stream simulations
  • Cross-provider benefit: Fixing the shared processStreamResponse method means Anthropic, OpenAI, and DeepSeek all benefit without duplicate changes
  • Clear failure mode documentation: The comment explaining why stripOrphanedUserEntries can't recover from this scenario (it only strips trailing user entries, not resurrect lost tool_use blocks) is invaluable for future maintainers

…ry fix

Adds a third case to the partial-history persistence test suite for
reasoning-mode providers (DeepSeek thinking, Claude 4.6+ adaptive): when
the assistant turn streams a thinking block AND a tool_use before the
SSE drops, the partial push must keep the thinking part before the
functionCall so DeepSeek's `injectThinkingOnToolUseTurns` converter
pass sees an existing block on the replayed turn and does not pre-pend
a synthetic empty one (which would discard the model's original
reasoning text).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a wedge state on weak networks where Anthropic-compatible streams (DeepSeek, Anthropic, etc.) drop after a tool_use content_block_stop but before message_stop. The yielded functionCall chunk causes downstream tool execution and a follow-up tool_result user turn, but no matching tool_use was ever pushed into history, leading to permanent 400s. The fix wraps the stream loop in processStreamResponse with try/catch and persists the partial assistant turn into history when a tool call has already been yielded before re-throwing.

Changes:

  • Capture mid-stream errors in processStreamResponse instead of letting them abort post-loop processing.
  • When hasToolCall is true on error, push the partial assistant turn (thinking + consolidated parts) into history before re-throwing, preserving tool_use/tool_result pairing.
  • Add three vitest cases covering tool-only, thinking+tool, and text-only mid-stream failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/core/src/core/geminiChat.ts Adds try/catch around stream iteration and a recovery branch that conditionally persists the partial model turn before re-throwing.
packages/core/src/core/geminiChat.test.ts Adds three new tests verifying partial-turn persistence with tool_use, with thinking+tool, and non-persistence for text-only failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -1369,6 +1383,46 @@ export class GeminiChat {
});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking follow-up.

This persists the partial tool_use and relies on handleCompletedTools to eventually push a matching tool_result. Two paths re-create the same wedge one layer down:

  1. Ctrl+Y before the tool finishes — history is [user, model(tool_use)], no trailing user entry for stripOrphanedUserEntriesFromHistory to strip. The retry sends the orphan tool_use and 400s with the same error this PR is trying to escape.
  2. Scheduler aborts / process killed before submitQuery(ToolResult) fires — the persisted tool_use becomes a permanent orphan. Same wedge.

The stronger invariant is to close the pair at the failure point itself: for every already-yielded tool_use, synthesize an is_error: true tool_result before re-throwing, instead of betting on a downstream consumer. anthropics/claude-code does this in src/query.ts (yieldMissingToolResultBlocks, ~L123) and calls it from both the stream-error and the user-abort paths — the pair is never out of sync.

Current fix recovers the common case, so not blocking. Worth a follow-up to also cover the retry race and the abort path symmetrically.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed both paths reproduce — handleCompletedTools is gated on isResponding (useGeminiStream.ts:1971) and useReactToolScheduler.allToolCallsCompleteHandler is single-shot (useReactToolScheduler.ts:127-132), so a Ctrl+Y that lands between the partial-push and tool completion does silently swallow the eventual tool_result. The abort-before-submit path is the same shape.

(Worth noting: this race exists pre-#4176 too — any Ctrl+Y while a tool is still running on a normally-completed stream hits the same wedge. This PR moves the partial-tool_use case into the same race class, but doesn't widen it.)

On yieldMissingToolResultBlocks shape — straight port has a corner: if we synthesize is_error: true at the failure point AND the live scheduler later completes, handleCompletedTools submits a second tool_result with the same tool_use_id, producing two consecutive user turns (orphan tool_result in the trailing one). Upstream sidesteps this because their StreamingToolExecutor is owned by the same loop that does the synthesis — it can discard() in-flight tools atomically (query.ts:733). Qwen-code's React scheduler runs out-of-band, so the synthesis path needs to coordinate cancellation/dedup with it.

Folded both races + the cancel-coordination requirement into #4178 — that's where the closing-at-failure-point work belongs. Will tackle it as the next PR; #4176 stays minimal to ship the common-case recovery now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folded both races + the cancel-coordination requirement into this PR in 3de3241a2 rather than splitting to #4178. Three pieces:

  • repairOrphanedToolUseTurns(history) synthesizes error functionResponse for every dangling functionCall.id — close to upstream's yieldMissingToolResultBlocks shape.
  • Wired into startChat() (Race B/C — --resume of a crashed session) and the sendMessageStream Retry branch after stripOrphanedUserEntriesFromHistory (Race A — Ctrl+Y mid-flight).
  • handleCompletedTools dedupes against chat.history before submitting tool_results — that's the cross-layer coordination piece you flagged. Since our React scheduler runs out-of-band (vs. upstream's in-band StreamingToolExecutor that can .discard() atomically at synthesis), the synthesizer plants the synthetic functionResponse first and the scheduler drops its late real result when it sees a matching callId already in history. Same trade-off upstream's .discard() + yieldMissingToolResultBlocks makes — the model sees the synthetic error and can retry the tool if it still wants the result.

Tests: 8 new repair-helper cases in geminiChat.test.ts (Race A merge-into-existing-user-turn, Race B trailing-tool_use-only, parallel tool_use with partial responses, no-op on already-paired history, multiple non-adjacent dangling rounds, caller-supplied reason text, no-op on tool-free history). 131/131 client.test.ts and 91/91 useGeminiStream.test.tsx still passing.

#4178 will be closed once this PR merges (its scope is now this PR). #4177 (idle watchdog) stays as the independent follow-up.

Copy link
Copy Markdown
Collaborator

@yiliang114 yiliang114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix recovers the common SSE-drop case correctly. Left one non-blocking follow-up inline about closing the tool_use ↔ tool_result invariant at the failure point rather than handing off to the scheduler.

@wenshao
Copy link
Copy Markdown
Collaborator Author

wenshao commented May 15, 2026

For reviewers — this PR is the minimal, surgical fix. The broader weak-network resilience story has two follow-ups filed so they can be reviewed independently:

Together (#4176 + #4177 + #4178) qwen-code's behavior on weak networks would be comparable to upstream Claude Code, addressing the same class of bug as anthropics/claude-code#6836 (open meta-issue with ~155 duplicate reports, oncall label).

Extends the partial-history fix in fe35e37 to cover the residual race
paths surfaced in PR QwenLM#4176 review:

  - Race A: Ctrl+Y while in-flight tool hasn't finished. History is
    [user, model(tool_use)] — `stripOrphanedUserEntriesFromHistory` only
    pops trailing user entries, so the retry payload lands as a fresh
    user turn after the orphan tool_use and API rejects. Meanwhile the
    scheduler's `onAllToolCallsComplete` is single-shot and gated on
    `isResponding`, so the eventual tool_result is silently swallowed.
  - Race B: process crash / OOM / SIGKILL between the partial-tool_use
    push and the React scheduler's tool_result submission. On `--resume`
    the dangling model(tool_use) wedges the first API call.
  - Race C: external tooling / manual JSONL edits leaving the same
    dangling shape.

The fix has three pieces working together:

1. `repairOrphanedToolUseTurns(history)` in geminiChat.ts walks history
   left-to-right and synthesizes an `error`-typed functionResponse for
   every functionCall whose id is not echoed back in the next user
   turn. Appends to an existing user turn when present, otherwise
   inserts a new one. Returns the injected (callId, name) list.

2. `GeminiClient.repairOrphanedToolUseTurnsInHistory()` wraps the helper
   and is called from three points:
     - `startChat()` after loading the transcript (Race B/C, --resume).
     - `sendMessageStream` Retry branch after stripOrphans (Race A).
     - `sendMessageStream` UserQuery/Cron branch (defensive belt-and-
       suspenders for anything that slipped past 1 and 2).

3. `handleCompletedTools` in useGeminiStream.ts dedupes against
   chat.history before submitting tool_results — if a synthetic
   functionResponse for the same callId is already present (planted by
   the repair pass), the in-flight scheduler's late result is dropped
   and the call is `markToolsAsSubmitted` so the UI advances. Same
   trade-off upstream Claude Code's `StreamingToolExecutor.discard()`
   makes — late real results are dropped on the wire after synthesis,
   the model sees the synthetic error and can retry the tool if it
   still wants the result.

Together with the partial-history push from fe35e37, every tool_use
that ever streamed to the consumer is guaranteed to have a matching
tool_result on the wire — regardless of whether the stream errored,
the user retried mid-flight, the process crashed, or the session is
later resumed. This is the qwen-code analogue of upstream Claude Code's
`yieldMissingToolResultBlocks` (query.ts:123-149), but split across
the core/cli boundary because the React tool scheduler runs out-of-band
from the stream loop (so the synthesis path can't atomically discard
in-flight tools the way upstream's StreamingToolExecutor can; the
history-dedup at handleCompletedTools fills that gap instead).

Tests:
  - 8 new repair-helper tests in geminiChat.test.ts cover Race A,
    Race B, partial coverage of parallel tool_use, idempotence on
    already-paired history, no-op on tool-free history, caller-
    supplied reason text, multiple non-adjacent dangling rounds, and
    routes through the GeminiChat instance-method wrapper.
  - client.test.ts mocks updated for the new GeminiChat method.
  - All 88 geminiChat tests pass; 131 client tests pass; 91
    useGeminiStream tests pass.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +417 to +474
export function repairOrphanedToolUseTurns(
history: Content[],
reason: string = ORPHAN_TOOL_USE_REPAIR_REASON,
): { injected: Array<{ callId: string; name: string }> } {
const injected: Array<{ callId: string; name: string }> = [];

// Forward walk: i mutates as we splice, so use index-based iteration
// and skip the freshly-inserted user turn to avoid re-scanning it.
for (let i = 0; i < history.length; i++) {
const turn = history[i];
if (turn.role !== 'model') continue;

// Collect (id → name) for every functionCall in this model turn.
const expected = new Map<string, string>();
for (const part of turn.parts ?? []) {
const fc = part.functionCall;
if (fc?.id) {
expected.set(fc.id, fc.name ?? 'unknown');
}
}
if (expected.size === 0) continue;

const next = history[i + 1];
const matched = new Set<string>();
if (next?.role === 'user') {
for (const part of next.parts ?? []) {
const id = part.functionResponse?.id;
if (id) matched.add(id);
}
}

const missing = [...expected.entries()].filter(([id]) => !matched.has(id));
if (missing.length === 0) continue;

const syntheticParts: Part[] = missing.map(([callId, name]) => ({
functionResponse: {
id: callId,
name,
response: { error: reason },
},
}));

if (next?.role === 'user') {
next.parts = [...(next.parts ?? []), ...syntheticParts];
} else {
history.splice(i + 1, 0, { role: 'user', parts: syntheticParts });
// Skip the freshly-inserted user turn so the outer loop doesn't
// visit it as a model turn (it isn't) and stays linear-time.
i++;
}

for (const [callId, name] of missing) {
injected.push({ callId, name });
}
}

return { injected };
}
Comment on lines 1472 to +1535
// Record assistant turn with raw Content and metadata
if (thoughtContentPart || contentText || hasToolCall || usageMetadata) {
const contextWindowSize =
this.config.getContentGeneratorConfig()?.contextWindowSize;
this.chatRecordingService?.recordAssistantTurn({
model,
message: [
...(thoughtContentPart ? [thoughtContentPart] : []),
...(contentText ? [{ text: contentText }] : []),
...(hasToolCall
? contentParts
.map(redactStructuredOutputArgsForRecording)
.filter(
(
p,
): p is { functionCall: NonNullable<Part['functionCall']> } =>
p !== null,
)
: []),
],
tokens: usageMetadata,
contextWindowSize,
});
}

// Mid-stream failure recovery: if the upstream stream threw (typical on
// weak networks — SSE cut between a tool_use `content_block_stop` and
// the terminal `message_stop`) AND any `functionCall` chunk was already
// yielded to consumers, we must persist the partial assistant turn here.
//
// The content generator (Anthropic / OpenAI) emits a `functionCall` part
// only at the end of a tool_use block. Once yielded, `Turn.run` registers
// a `ToolCallRequest` event, the React tool scheduler queues the call,
// and `handleCompletedTools` will fire `submitQuery(..., ToolResult)` —
// pushing a user message with `functionResponse` into history — even
// though the parent stream errored. Without preserving the matching
// tool_use on the model side, the next request body would have
// `user → user[tool_result]` with no tool_use in between, and the
// Anthropic-compatible API (DeepSeek, Anthropic, etc.) rejects with
// "tool_use_id ... must have a corresponding tool_use block in the
// previous message"
// — an unrecoverable state because Ctrl+Y's `stripOrphanedUserEntries`
// only strips trailing user entries; the lost tool_use can't be
// resurrected.
//
// Plain-text partial turns (no functionCall yielded) are deliberately
// NOT persisted — the Retry path pops the trailing user prompt and
// re-issues it; a stale partial-text model turn between them would
// either bias the retry or surface as a duplicate.
if (streamError !== null) {
if (
hasToolCall &&
(thoughtContentPart || consolidatedHistoryParts.length > 0)
) {
this.history.push({
role: 'model',
parts: [
...(thoughtContentPart ? [thoughtContentPart] : []),
...consolidatedHistoryParts,
],
});
}
throw streamError;
}
Post-review audit of 3de3241 surfaced two real races the earlier fix
did not actually close:

  - Race A (the one yiliang114 originally flagged): handleCompletedTools
    is gated on `isResponding` (useGeminiStream:1971) BEFORE the dedup
    branch ran, so when the user Ctrl+Y'd mid-tool, the dedup never
    fired and the in-flight tool stayed permanently stuck in
    `completed-but-not-submitted` (the scheduler's
    `allToolCallsCompleteHandler` is single-shot). The dedup branch was
    structurally unreachable on the very race it was meant to defend
    against.

  - Race in the Retry path: the repair pass in
    `client.sendMessageStream` Retry branch ran BEFORE the chat-internal
    `push(userContent)`, so a Retry of a previous ToolResult submission
    (lastPrompt is a functionResponse part array) raced its own real
    `functionResponse` against the synthesized error one. The synthesis
    was winning and pre-empting the real result, producing two
    `functionResponse` entries with the same callId on the wire.

This commit relocates both pieces to where the contract actually holds:

1. Move the dedup branch in `handleCompletedTools` to BEFORE the
   `isResponding` early-return so `markToolsAsSubmitted` always runs
   for callIds already paired in history, unblocking the UI/scheduler
   even when a new stream is in flight. The submission-side `geminiTools
   = completedAndReadyToSubmitTools.filter(... && !inHistory ...)` then
   covers the no-double-submit case in one expression.

2. Move the repair call from `client.sendMessageStream` Retry branch
   into `chat.sendMessageStream` immediately AFTER the user-supplied
   turn is pushed. The user's own tool_result (when the Retry payload
   carries one) gets the first chance to close the pair before the
   synthesizer sees it as dangling. `client.startChat()` keeps a
   belt-and-suspenders pass at session-load time so any pre-send code
   reading `chat.history` sees a well-formed shape.

Adds three integration tests covering the new wiring:

  - geminiChat.test.ts: chat.sendMessageStream synthesizes a
    functionResponse when history carries a dangling tool_use AND the
    user-supplied content doesn't already close it (Race B/C plus
    Race A from the chat side).
  - geminiChat.test.ts: chat.sendMessageStream does NOT synthesize
    when the user-supplied content IS a matching functionResponse
    (the Retry-of-ToolResult race the previous wiring tripped on).
  - useGeminiStream.test.tsx: handleCompletedTools dedups a late real
    result whose callId already has a functionResponse in chat.history
    (Race A end-to-end: markToolsAsSubmitted fires but no
    submitQuery is dispatched).

Test summary: 90/90 geminiChat, 131/131 client, 92/92 useGeminiStream;
all 8186 core tests pass; tsc clean.

Known limitation logged for follow-up: a Retry of ToolResult whose
intervening stream itself partial-pushed a SECOND tool_use produces a
trailing user turn with both the stale re-pushed `fr_A` and the
synthesized `fr_B` for the second call — `fr_A` retry lands as an
orphan because `fc_A` was already paired earlier in history. Triggered
only when partial-push fires on the second stream AND user retries the
tool_result rather than the user prompt; extremely low frequency in
practice. Cleanest fix is in the retryLastPrompt layer (don't re-push
an already-paired tool_result), which is out of scope for this PR.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +1993 to +2034
// History-based dedup MUST run before the `isResponding` early-return.
// If a synthetic `functionResponse` for this callId is already in
// chat.history (planted on session-load by
// `client.repairOrphanedToolUseTurnsInHistory` or on every
// `chat.sendMessageStream` push by the inline repair pass), the
// in-flight scheduler result must be marked submitted NOW —
// `useReactToolScheduler.allToolCallsCompleteHandler` is single-shot
// per batch, so a later isResponding=true early-return would leave
// the tool stuck in `completed-but-not-submitted` forever (Race A
// surfaced in PR #4176 review). The real result is dropped on the
// wire — same trade-off upstream Claude Code makes when its
// `StreamingToolExecutor.discard()` follows a
// `yieldMissingToolResultBlocks` synthesis (`query.ts:733` + `:984`).
const historyCallIdsWithResponse = new Set<string>();
// Guard the call: some test harnesses build a partial GeminiClient
// mock without `getHistory`. Skipping dedup in that case is safe —
// it just means tests that never set up the repair pre-condition
// run with the original (pre-dedup) submission shape.
if (geminiClient && typeof geminiClient.getHistory === 'function') {
for (const entry of geminiClient.getHistory()) {
if (entry.role !== 'user') continue;
for (const part of entry.parts ?? []) {
const id = part.functionResponse?.id;
if (id) historyCallIdsWithResponse.add(id);
}
}
}
const dedupedCallIds = completedAndReadyToSubmitTools
.filter((tc) => historyCallIdsWithResponse.has(tc.request.callId))
.map((tc) => tc.request.callId);
if (dedupedCallIds.length > 0) {
debugLogger.warn(
`[REPAIR] Dropping ${dedupedCallIds.length} late tool result(s) ` +
`whose callId already has a functionResponse in history: ` +
`${dedupedCallIds.join(', ')}`,
);
markToolsAsSubmitted(dedupedCallIds);
}

if (isResponding) {
return;
}
Comment on lines 1493 to +1555
if (thoughtContentPart || contentText || hasToolCall || usageMetadata) {
const contextWindowSize =
this.config.getContentGeneratorConfig()?.contextWindowSize;
this.chatRecordingService?.recordAssistantTurn({
model,
message: [
...(thoughtContentPart ? [thoughtContentPart] : []),
...(contentText ? [{ text: contentText }] : []),
...(hasToolCall
? contentParts
.map(redactStructuredOutputArgsForRecording)
.filter(
(
p,
): p is { functionCall: NonNullable<Part['functionCall']> } =>
p !== null,
)
: []),
],
tokens: usageMetadata,
contextWindowSize,
});
}

// Mid-stream failure recovery: if the upstream stream threw (typical on
// weak networks — SSE cut between a tool_use `content_block_stop` and
// the terminal `message_stop`) AND any `functionCall` chunk was already
// yielded to consumers, we must persist the partial assistant turn here.
//
// The content generator (Anthropic / OpenAI) emits a `functionCall` part
// only at the end of a tool_use block. Once yielded, `Turn.run` registers
// a `ToolCallRequest` event, the React tool scheduler queues the call,
// and `handleCompletedTools` will fire `submitQuery(..., ToolResult)` —
// pushing a user message with `functionResponse` into history — even
// though the parent stream errored. Without preserving the matching
// tool_use on the model side, the next request body would have
// `user → user[tool_result]` with no tool_use in between, and the
// Anthropic-compatible API (DeepSeek, Anthropic, etc.) rejects with
// "tool_use_id ... must have a corresponding tool_use block in the
// previous message"
// — an unrecoverable state because Ctrl+Y's `stripOrphanedUserEntries`
// only strips trailing user entries; the lost tool_use can't be
// resurrected.
//
// Plain-text partial turns (no functionCall yielded) are deliberately
// NOT persisted — the Retry path pops the trailing user prompt and
// re-issues it; a stale partial-text model turn between them would
// either bias the retry or surface as a duplicate.
if (streamError !== null) {
if (
hasToolCall &&
(thoughtContentPart || consolidatedHistoryParts.length > 0)
) {
this.history.push({
role: 'model',
parts: [
...(thoughtContentPart ? [thoughtContentPart] : []),
...consolidatedHistoryParts,
],
});
}
throw streamError;
}
@wenshao wenshao changed the title fix(core): persist partial assistant turn when stream errors mid tool_use fix(core,cli): close tool_use↔tool_result invariant across all failure paths May 15, 2026
@wenshao wenshao requested a review from yiliang114 May 15, 2026 15:19
Copy link
Copy Markdown
Collaborator Author

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Approve to Comment: self-PR.

Review Summary: 2 files, +270/-36. Fixes unrecoverable wedge when streaming drops between tool_use content_block_stop and message_stop. Core logic is correct; 606 tests pass; tsc + eslint clean.

No high-confidence Critical issues found. 5 low-confidence suggestions identified (not posted as inline comments — needs human review):

  1. AbortSignal mid-tool_use persists partial turn — catch 块未区分 abort 与网络错误,取消后 history 含孤立 tool_use。建议加 守卫。
  2. Zero logging on partial turn persistence — 静默修改 history,调试困难。建议加 。
  3. recordAssistantTurn 在错误路径无条件调用 — 失败流的部分数据被记录为正常 turn。建议移入 块。
  4. Missing thinking-only test — 推理模型场景(thought:true 无 functionCall)无回归覆盖。
  5. Duplicate history.push 代码 — 错误/成功路径各一份相同实现,建议提取方法。

— DeepSeek/deepseek-v4-pro via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Request changes to Comment: self-PR.

// The dedup MUST run regardless of `isResponding`, because the
// scheduler's `onAllToolCallsComplete` is single-shot and would
// otherwise leave the tool stuck in `completed-but-not-submitted`.
const lateRealResult: TrackedCompletedToolCall = {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] This fixture does not structurally satisfy TrackedCompletedToolCall, and the changed-file typecheck fails here with TS2352. As written, the PR cannot pass tsc on the changed files.

Please update the mock object to match the real TrackedCompletedToolCall shape (or choose a narrower test type that actually reflects what this test needs) instead of forcing the incompatible object through a direct assertion.

— gpt-5.5 via Qwen Code /review

},
}));

if (next?.role === 'user') {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Appending the synthetic functionResponse after existing user parts means the Ctrl+Y retry shape becomes user: text, tool_result. The Anthropic converter preserves part order, so it serializes the retry prompt before the tool_result that is supposed to answer the immediately preceding tool_use. Anthropic-compatible backends require those tool_result blocks to come first in the following user message, so this can still reject the retry and leave the session wedged.

Insert the synthetic responses before the first non-functionResponse part when merging into an existing user turn, while preserving any real tool results that are already at the front.

Suggested change
if (next?.role === 'user') {
const existingParts = next.parts ?? [];
const firstNonFunctionResponseIndex = existingParts.findIndex(
(part) => !part.functionResponse,
);
const insertAt =
firstNonFunctionResponseIndex === -1
? existingParts.length
: firstNonFunctionResponseIndex;
next.parts = [
...existingParts.slice(0, insertAt),
...syntheticParts,
...existingParts.slice(insertAt),
];

— gpt-5.5 via Qwen Code /review

// subsystem is trying to escape. (Belt-and-suspenders: the same
// helper runs again inside `chat.sendMessageStream` after the user
// content is pushed, so a dangling left here by setHistory /
// compaction reordering is also caught — but doing it here keeps
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] This new startChat() wiring is the only resume-time integration point that repairs a dangling model[functionCall] before the first resumed send, but the tests only cover the helper directly and mock this method in client.test.ts. A future refactor could remove or reorder this call while the helper tests still pass, regressing the --resume recovery path this PR is meant to protect.

Please add a client-level test that starts a chat with extraHistory ending in a model turn containing a functionCall, then asserts the initialized chat history contains the synthetic user[functionResponse] before any send occurs.

— gpt-5.5 via Qwen Code /review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Close tool_use↔tool_result invariant at failure point (defense in depth)

3 participants