Skip to content

Commit 2985881

Browse files
committed
refactor(core): consolidate partial-tool_use repair docs into one design block (PR #4176)
Addresses the pomelo-nwu review observation on c30bba6: ~60% of the +918 lines added to geminiChat.ts were comments, with several 20–40 line prose blocks repeating the same race-class / tool_use_id wedge analysis at every call site. Net –139 / +117 (net delete ~22 lines but mostly redistributing weight from per-site blocks to a single canonical note). Changes: * Add one canonical design block above ORPHAN_TOOL_USE_REPAIR_REASON covering: the tool_use_id 400 wedge, the three race classes (A Ctrl+Y mid-flight, B process crash, C SSE drop), the two-layer fix (partial-push + repair), and the partial-push marker lifecycle. Every per-site comment that used to repeat this now points back. * Trim popPartialIfPushed COMPRESSED-branch comment from 14 lines to 4 (no-op-today + reason for keeping). * Trim recovery-catch comment block from ~38 lines to ~7 (pop-order matters + index-checked, pointer to canonical note). * Trim addHistory invariant comment from ~22 lines to ~5 (when this fires + why error-level, pointer to canonical note). * Trim processStreamResponse partial-push comment from ~22 lines to ~8 (Race C name + signal + plain-text-not-persisted rule). * Replace tool_use_id wedge re-explanations at 5 sites (repair doc, scanModelTurn doc, applyRepair doc, popPartialIfPushed comment, processStreamResponse comment) with single-line pointers. Canonical block at line 411 is now the only authoritative copy. Reviewer marked this as non-blocking ("shouldn't gate the merge"); done as cleanup polish while LGTM is in. Tests: 375/375 (no test changes). tsc + eslint + prettier clean.
1 parent a8ac579 commit 2985881

1 file changed

Lines changed: 117 additions & 139 deletions

File tree

packages/core/src/core/geminiChat.ts

Lines changed: 117 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,75 @@ const ORPHAN_TOOL_USE_REPAIR_REASON =
397397
'Tool execution result was not recorded — likely interrupted by network ' +
398398
'failure, abort, or process exit. Treat as failure and retry if needed.';
399399

400+
/*
401+
* ============================================================================
402+
* Partial-tool_use repair subsystem — canonical design note.
403+
* ============================================================================
404+
*
405+
* Every comment block elsewhere in this file that mentions one of the
406+
* concepts below points back here. Per-site comments should be one or two
407+
* lines stating WHAT the local code does; the WHY lives here.
408+
*
409+
* --- The wedge ----------------------------------------------------------
410+
*
411+
* Anthropic-compatible backends (Anthropic, DeepSeek, …) reject a request
412+
* whose `user[tool_result]` blocks are not at the HEAD of the user message
413+
* immediately following the `model[tool_use]` they answer:
414+
*
415+
* "tool_use_id ... must have a corresponding tool_use block in the
416+
* previous message"
417+
*
418+
* Without a matching pair the session is unrecoverable — `stripOrphanedUser
419+
* EntriesFromHistory` only strips trailing user entries, so a lost tool_use
420+
* cannot be resurrected and the next send 400s repeatedly.
421+
*
422+
* --- The race classes that produce dangling tool_uses --------------------
423+
*
424+
* Race A (Ctrl+Y mid-flight): user retries before the in-flight tool
425+
* finishes. The scheduler's `onAllToolCallsComplete` is single-shot
426+
* per batch and would otherwise leave the tool stuck in
427+
* `completed-but-not-submitted` forever.
428+
* Race B (process crash / OOM mid-flight): the JSONL transcript captures
429+
* the dangling `model[fc]` and `--resume` rehydrates it.
430+
* Race C (network drop between `content_block_stop` of a tool_use and
431+
* the terminal `message_stop`): `processStreamResponse` re-throws
432+
* after we have already yielded a `functionCall` chunk, so the React
433+
* scheduler is on its way to submit a real `functionResponse` while
434+
* in-memory history has no matching `model[fc]`.
435+
*
436+
* --- The two-layer fix ---------------------------------------------------
437+
*
438+
* (1) Persist the partial assistant turn at the failure point in
439+
* `processStreamResponse` (`this.history.push({role: 'model', parts:
440+
* [...]})` plus the `pendingPartialAssistantTurnIndex` /
441+
* `pendingPartialAssistantRecord` markers) so the matching
442+
* `model[fc]` is on disk and in memory when the late `user[fr]`
443+
* arrives.
444+
* (2) Repair any remaining dangling `model[fc]` whose
445+
* `user[fr]` never landed (`repairOrphanedToolUseTurns`):
446+
* - SYNTHESIZE an `error` fr for ids with no matching response;
447+
* - HOIST the real fr into the immediately-adjacent user turn
448+
* when it landed in a non-adjacent later turn;
449+
* - DROP duplicate fr copies for the same id.
450+
* Then `useGeminiStream.handleCompletedTools` dedupes the
451+
* scheduler's late real result against `chat.history` so the
452+
* synthetic and the real result never collide on the wire.
453+
*
454+
* --- Partial-push marker lifecycle ---------------------------------------
455+
*
456+
* Set together on (streamError + hasToolCall + hasContent) inside
457+
* `processStreamResponse`. Cleared together by `popPartialIfPushed` on a
458+
* retryable error rollback, or flushed together to JSONL by the outer
459+
* `finally` after the retry loop exits. Defense-in-depth: every
460+
* history-mutation method (clearHistory / addHistory / setHistory /
461+
* truncateHistory / stripThoughtsFromHistory /
462+
* stripOrphanedUserEntriesFromHistory) resets both markers in lockstep so
463+
* a stale index can't shift onto an unrelated model turn and cause
464+
* `popPartialIfPushed` to splice the wrong entry. Any single-field reset
465+
* is a bug.
466+
* ============================================================================
467+
*/
468+
400469
/**
401470
* Walk `history` left-to-right and close every dangling tool_use ↔ tool_result
402471
* pair so the wire format the next API call sees is always
@@ -410,15 +479,9 @@ const ORPHAN_TOOL_USE_REPAIR_REASON =
410479
* `yieldMissingToolResultBlocks` (`query.ts:123-149`).
411480
* - HOIST: for any `functionCall.id` whose real `functionResponse` lives in
412481
* a non-adjacent following user turn (typical shape:
413-
* `model[fc], user[text], user[fr_real]` — produced when a user aborts a
414-
* long-running tool, types a follow-up, and the React scheduler's late
415-
* `submitQuery` appends the real `fr` as a SEPARATE user entry), MOVE the
416-
* real `fr` part out of its original turn into the adjacent one. Without
417-
* hoisting, the synthesis pass correctly skips the call (a real `fr`
418-
* exists somewhere later) but the wire layout still serializes
419-
* `model[tool_use] → user[text] → user[tool_result]`, which
420-
* Anthropic-compatible backends reject with "tool_use_id ... must have a
421-
* corresponding tool_use block in the previous message".
482+
* `model[fc], user[text], user[fr_real]`), MOVE the real `fr` part out
483+
* of its original turn into the adjacent one. Required by the wire
484+
* layout — see the canonical design note above for the wedge.
422485
*
423486
* Mutates `history` in place and returns the set of injected `(callId, name)`
424487
* tuples so callers (the React tool scheduler) can dedupe a real `tool_result`
@@ -492,16 +555,10 @@ interface RepairPlan {
492555
* follows. Pure read; no mutation.
493556
*
494557
* Storing ALL locations (not just the first) is load-bearing for the
495-
* duplicate case: if the same callId is echoed back more than once
496-
* across the consecutive user turns (e.g.
497-
* `model[fc id=cid], user[text], user[fr cid], user[fr cid]` — possible
498-
* when the React scheduler retries the late `submitQuery` and a
499-
* duplicate fr lands), hoisting only the first would leave the
500-
* duplicate behind. The wire payload then serializes
501-
* `model[tool_use] -> user[tool_result] -> user[tool_result]`
502-
* and the backend rejects the trailing block as an orphan
503-
* ("tool_use_id ... must have a corresponding tool_use block in the
504-
* previous message").
558+
* duplicate case: e.g. `model[fc id=cid], user[text], user[fr cid],
559+
* user[fr cid]` (the React scheduler retries the late `submitQuery`).
560+
* The downstream decision phase needs every copy so it can drop
561+
* duplicates — otherwise the wire payload re-triggers the wedge.
505562
*/
506563
function scanModelTurn(history: Content[], modelIdx: number): ScanResult {
507564
const expected = new Map<string, string>();
@@ -604,16 +661,10 @@ function planRepair(scan: ScanResult): RepairPlan {
604661
* user turn (before any non-fr parts) OR insert a new user turn
605662
* between `modelIdx` and whatever follows.
606663
*
607-
* Anthropic-compatible backends require the tool_result blocks at the
608-
* head of the immediately following user message; appending instead
609-
* (`[text, fr]`) re-triggers the 400 the synthesis pass exists to
610-
* escape. Mirrors upstream Claude Code's `hoistToolResults`.
611-
*
612-
* CONSEQUENCE OF REMOVAL of the head-insert: dropping this hoist (e.g.
613-
* naively `next.parts = [...existing, ...partsToInject]`) re-introduces
614-
* the "tool_use_id ... must have a corresponding tool_use block in the
615-
* previous message" 400 the synthesis pass exists to prevent. Do not
616-
* "simplify" this branch.
664+
* Step 3 inserts at the HEAD (before non-fr parts), not the tail —
665+
* mirrors upstream Claude Code's `hoistToolResults`. See the canonical
666+
* design note above `ORPHAN_TOOL_USE_REPAIR_REASON` for why a tail
667+
* append re-triggers the 400. Do not "simplify" the head-insert away.
617668
*/
618669
function applyRepair(
619670
history: Content[],
@@ -1172,15 +1223,13 @@ export class GeminiChat {
11721223
lastError = error;
11731224

11741225
// If `processStreamResponse` persisted a partial assistant turn
1175-
// (mid-stream error after a `functionCall` was already yielded),
1176-
// every retry-and-continue path below must drop that turn first.
1177-
// Otherwise a successful retry's response lands AFTER the stale
1178-
// failed-attempt model turn — two consecutive `model` entries
1179-
// with an orphan tool_use in the first, re-triggering the
1180-
// "tool_use_id ... corresponding tool_use" 400 this fix is
1181-
// supposed to escape. Paths that `break` (unretryable) keep
1182-
// the partial — the caller will see it as part of the error
1183-
// surface.
1226+
// (mid-stream error after a `functionCall` was already
1227+
// yielded), every retry-and-continue path below must drop
1228+
// that turn first; otherwise the retry's response lands as
1229+
// a second consecutive model turn with an orphan tool_use
1230+
// (the wedge — see the canonical note above
1231+
// `ORPHAN_TOOL_USE_REPAIR_REASON`). Paths that `break`
1232+
// (unretryable) keep the partial.
11841233
const popPartialIfPushed = () => {
11851234
const idx = self.pendingPartialAssistantTurnIndex;
11861235
if (idx === null) return;
@@ -1302,21 +1351,10 @@ export class GeminiChat {
13021351
reactiveInfo.compressionStatus ===
13031352
CompressionStatus.COMPRESSED
13041353
) {
1305-
// Defense-in-depth no-op: tryCompress() succeeded
1306-
// means it has already replaced this.history via
1307-
// setHistory(), which calls clearPendingPartialState()
1308-
// — so by the time we reach this line, the marker is
1309-
// null and popPartialIfPushed splices nothing. We
1310-
// keep the call as a uniformity assertion against
1311-
// future refactors that might switch tryCompress to
1312-
// an in-place mutation: in that world, the marker
1313-
// would NOT be reset by setHistory and this call
1314-
// becomes the only thing that drops the stale
1315-
// partial before requestContents is rebuilt below.
1316-
// Removing it would couple correctness to the
1317-
// implementation detail "setHistory always clears
1318-
// the marker", which the other retry branches don't
1319-
// share.
1354+
// No-op today: tryCompress's setHistory has already
1355+
// cleared the marker. Kept for uniformity with the
1356+
// other retry branches in case a future in-place
1357+
// tryCompress stops resetting it.
13201358
popPartialIfPushed();
13211359
requestContents = self.getRequestHistory();
13221360
debugLogger.info(
@@ -1552,47 +1590,13 @@ export class GeminiChat {
15521590
// coalesced back into the preceding model entry after the loop.
15531591
successfulRecoveries++;
15541592
} catch (recoveryError) {
1555-
// If a recovery attempt fails (e.g., empty response, network
1556-
// error), stop recovering and let the partial output stand.
1557-
// Pop the dangling recovery message to keep history valid.
1558-
//
1559-
// Order matters: when the recovery stream errors AFTER
1560-
// yielding a `functionCall` chunk, `processStreamResponse`
1561-
// pushes a partial `model` turn into history before
1562-
// re-throwing. The naive "if last is user, pop" check
1563-
// would then no-op (last is now the partial `model`),
1564-
// leaving `user(OUTPUT_RECOVERY_MESSAGE)` stranded as a
1565-
// real user turn the user never sent. Two consequences:
1566-
// - the control-prompt text (which carries instructions
1567-
// meant only for the model's own continuation context)
1568-
// pollutes durable history and biases later turns,
1569-
// - the inline repair on the next sendMessageStream
1570-
// synthesizes an `error` `functionResponse` for the
1571-
// dangling `functionCall`, which the
1572-
// `handleCompletedTools` history-based dedup then drops
1573-
// when the React scheduler's REAL tool result arrives,
1574-
// so the model sees an "execution result was not
1575-
// recorded" error for a tool that actually succeeded.
1576-
// Pop the partial model turn FIRST, then the recovery
1577-
// user turn. The partial-push markers are also cleared
1578-
// in lockstep so the outer `finally` JSONL flush can't
1579-
// resurrect a partial we just deleted from live history.
1580-
//
1581-
// Index-checked pop instead of a positional `pop()` so
1582-
// we match the diagnostic standard set by
1583-
// `popPartialIfPushed` above (splice at `idx` + warn on
1584-
// bounds/role mismatch). The two rollback strategies
1585-
// share an undocumented positional assumption: nothing
1586-
// mutates `this.history` between
1587-
// `processStreamResponse`'s push and the for-await
1588-
// catch here. If a future change inserts a mutation in
1589-
// that window (compression side-effect, abort-signal
1590-
// handler, telemetry hook), a naked
1591-
// `history.pop()` would silently remove the wrong
1592-
// entry while `clearPendingPartialState()` clears
1593-
// markers for the actual partial — leaving it
1594-
// permanently stranded with no log trail. The warn
1595-
// makes any future violation visible immediately.
1593+
// Pop the partial `model[fc]` FIRST (if processStreamResponse
1594+
// pushed one before re-throwing), THEN the recovery user
1595+
// turn. Reversed order would strand `OUTPUT_RECOVERY_MESSAGE`
1596+
// as a real user turn. Index-checked pop mirrors
1597+
// `popPartialIfPushed` above — see the design note above
1598+
// `ORPHAN_TOOL_USE_REPAIR_REASON` for the wedge mechanism
1599+
// and the partial-push marker lifecycle.
15961600
const expectedIdx = self.pendingPartialAssistantTurnIndex;
15971601
const lastIdx = self.history.length - 1;
15981602
if (
@@ -1936,34 +1940,18 @@ export class GeminiChat {
19361940
*/
19371941
addHistory(content: Content): void {
19381942
this.history.push(content);
1939-
// The marker is per-send-attempt. Today's callers (cancelled-tool
1940-
// synthesis in useGeminiStream, ACP session injects,
1941-
// shellCommandProcessor) only run between sends, so the originating
1942-
// sendMessageStream has either already popped the partial via the
1943-
// retry loop or hit an unrecoverable break — in both cases the
1944-
// marker is no longer load-bearing.
1945-
//
1946-
// If a future code path ever calls addHistory BETWEEN the partial
1947-
// push and the retry attempt, silently clearing the marker would
1948-
// strand the partial: popPartialIfPushed would no-op, the failed
1949-
// attempt's `model[functionCall]` would survive into the retry,
1950-
// and a successful retry's response would land as a SECOND
1951-
// consecutive model turn (the wedge this whole subsystem exists
1952-
// to prevent). The log below makes that coupling observable —
1953-
// anyone investigating a stale-partial bug will see this log line
1954-
// pointing straight at the offending caller. Error-level (not
1955-
// warn) because this is a true invariant violation: the existing
1956-
// call graph cannot legitimately hit this branch, so any
1957-
// occurrence is a real bug in a future caller, not noise.
1943+
// addHistory only runs between sends, so the partial-push marker
1944+
// should already be cleared. If it is not, a new caller is
1945+
// violating that invariant — surface it at error level so the
1946+
// offending stack is visible. See the design note above
1947+
// `ORPHAN_TOOL_USE_REPAIR_REASON` for the marker lifecycle.
19581948
if (
19591949
this.pendingPartialAssistantTurnIndex !== null ||
19601950
this.pendingPartialAssistantRecord !== null
19611951
) {
19621952
debugLogger.error(
19631953
'[INVARIANT_VIOLATION] addHistory called while a partial-push ' +
1964-
'marker is active — clearing it. This is unexpected during an active sendMessageStream ' +
1965-
'and likely indicates a new caller violating the between-sends ' +
1966-
'invariant. See comment at GeminiChat.addHistory for context.',
1954+
'marker is active — clearing it.',
19671955
);
19681956
}
19691957
this.clearPendingPartialState();
@@ -2251,30 +2239,20 @@ export class GeminiChat {
22512239
}
22522240
}
22532241

2254-
// Mid-stream failure recovery: if the upstream stream threw (typical on
2255-
// weak networks — SSE cut between a tool_use `content_block_stop` and
2256-
// the terminal `message_stop`) AND any `functionCall` chunk was already
2257-
// yielded to consumers, we must persist the partial assistant turn here.
2258-
//
2259-
// The content generator (Anthropic / OpenAI) emits a `functionCall` part
2260-
// only at the end of a tool_use block. Once yielded, `Turn.run` registers
2261-
// a `ToolCallRequest` event, the React tool scheduler queues the call,
2262-
// and `handleCompletedTools` will fire `submitQuery(..., ToolResult)` —
2263-
// pushing a user message with `functionResponse` into history — even
2264-
// though the parent stream errored. Without preserving the matching
2265-
// tool_use on the model side, the next request body would have
2266-
// `user → user[tool_result]` with no tool_use in between, and the
2267-
// Anthropic-compatible API (DeepSeek, Anthropic, etc.) rejects with
2268-
// "tool_use_id ... must have a corresponding tool_use block in the
2269-
// previous message"
2270-
// — an unrecoverable state because Ctrl+Y's `stripOrphanedUserEntries`
2271-
// only strips trailing user entries; the lost tool_use can't be
2272-
// resurrected.
2242+
// Mid-stream failure recovery (Race C in the canonical note above
2243+
// `ORPHAN_TOOL_USE_REPAIR_REASON`): if the upstream stream threw
2244+
// AFTER a `functionCall` chunk was already yielded — typical on
2245+
// weak networks: SSE cut between a tool_use `content_block_stop`
2246+
// and the terminal `message_stop` — we persist the partial
2247+
// assistant turn so the React scheduler's incoming
2248+
// `user[functionResponse]` has a matching `model[tool_use]` to
2249+
// pair with.
22732250
//
2274-
// Plain-text partial turns (no functionCall yielded) are deliberately
2275-
// NOT persisted — the Retry path pops the trailing user prompt and
2276-
// re-issues it; a stale partial-text model turn between them would
2277-
// either bias the retry or surface as a duplicate.
2251+
// Plain-text partial turns (no functionCall yielded) are
2252+
// deliberately NOT persisted — the Retry path pops the trailing
2253+
// user prompt and re-issues it; a stale partial-text model turn
2254+
// between them would either bias the retry or surface as a
2255+
// duplicate.
22782256
if (streamError !== null) {
22792257
// Reuse the `willPersistToHistory` gate from the recordAssistantTurn
22802258
// block above instead of re-deriving it. When `streamError !== null`,

0 commit comments

Comments
 (0)