Describe the bug
The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.
The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.
This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.
Error output
✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
aB(t[TNe]===0)
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:
- Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
- undici pool receives GOAWAY on active multiplexed connection — the handler in
client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
- Pool state invariant violated —
pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
- 5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure
This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:
Contributing factors (from community reports)
Factor | Evidence | Issue
-- | -- | --
Claude models hit this far more than OpenAI/Google | "occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models" |
#1743
Long sessions (4+ hours) | 19h27m session with 2,812 lines changed |
#1754
Sub-agent → parent transitions | "When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..." |
#2189
Large output generation | Failures cluster when model transitions from reading to writing |
#2050,
#2189
Post-v1.0.6 persistence | Identical failures on v1.0.6 and v1.0.9 |
#2101,
#2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
- GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
- 503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
- 429 / rate limit: Respect
Retry-After. 1 retry max. Don't burn quota.
- 400 / client error: Don't retry at all. Surface immediately.
The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
- At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
- Ideally: send an
X-Retry-Of: <original-request-id> header so the billing system can deduplicate connection-level retry charges
- Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.
The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues
#2101 (v1.0.6) and
#2189 (v1.0.9) report the identical failure pattern post-fix.
This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.
Error output
✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
aB(t[TNe]===0)
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:
Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
Pool state invariant violated — pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure
This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:
Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)
The transition between sub-agent exploration and parent-agent output generation (#2189)
The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection
Contributing factors (from community reports)
FactorEvidenceIssueClaude models hit this far more than OpenAI/Google"occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models"#1743Long sessions (4+ hours)19h27m session with 2,812 lines changed#1754Sub-agent → parent transitions"When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..."#2189Large output generationFailures cluster when model transitions from reading to writing#2050, #2189Post-v1.0.6 persistenceIdentical failures on v1.0.6 and v1.0.9#2101, #2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
400 / client error: Don't retry at all. Surface immediately.
The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
Ideally: send an X-Retry-Of: header so the billing system can deduplicate connection-level retry charges
Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
Affected version
v1.0.6
Steps to reproduce the behavior
Most reliable (works ~80% of the time):
copilot
/model → select Claude Opus 4.6 (High)
/plan on a medium-to-large repo
Let sub-agent exploration complete
When the agent transitions to writing output → GOAWAY hits
Also reliable:
Run a session for 4+ hours with periodic prompts, then make a complex request
Use /fleet with Claude models to create parallel sub-agent connections
/resume a dormant session (4+ hours old) and immediately prompt a complex task
Expected behavior
GOAWAY frames are handled as normal HTTP/2 lifecycle events, not assertion-triggering errors
Connection pool is recycled before retrying after a GOAWAY
Retries on connection-level errors don't silently burn premium requests
Users see distinct, informative messages for connection errors vs. rate limits vs. server errors
Long-running sessions proactively recycle connections before GOAWAY is received
Additional context
Environment
CLI version: v1.0.12 (also reproducible on v1.0.6, v1.0.9, v1.0.10)
Plan: Enterprise (also affects Business, Pro, Pro+)
OS: Linux, macOS, Windows/WSL (cross-platform)
Models primarily affected: Claude Opus 4.6, Claude Sonnet 4.6
Network: Both direct connections and behind corporate proxies (Zscaler, Netskope)
Related issues
#1743 — Autopilot mode AssertionError (v0.0.420, established model correlation with Opus)
#1754 — AssertionError during retrospective after 19h session (v0.0.420, best root cause analysis)
#2050 — Claude Sonnet 4.6 GOAWAY failure with full stack trace (v1.0.5, reveals undici internals)
#2101 — Transient API error → rate limit cascade (v1.0.6, post-fix)
#2189 — Claude Opus 4.6 GOAWAY during plan generation (v1.0.9, post-fix, 4 consecutive reproductions)
#1627 — Retry loop burns premium requests, switching models doesn't help (v0.0.420)
#2073 — Frequent transient errors leading to rate limits (v1.0.5)
nodejs/undici#4059 — Root cause: AssertionError race condition (OPEN)
nodejs/undici#3140 — Related: GOAWAY request hang (fixed in v6.14.0)
nodejs/undici#3011 — Related: null session ref after GOAWAY
Describe the bug
The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (
pendingCount === 0) is violated, causing anAssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.
This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.
Error output
The minified assertion trace reveals undici
Pool/PoolBaseinternals:TNe→kPending,SNe→kRunning, pluskClients,kNeedDrain,kAddClient,kGetDispatcher,kRemoveClientsymbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).Root cause analysis
The failure chain is:
client-h2.jssetsclient[kHTTP2Session] = null, but in-flight streams haven't drained yetpendingCount === 0assertion fails because requests were queued between GOAWAY receipt and session closeAssertionErrorthrown from minified pool codegetCompletionWithToolscatches error, classifies as "transient", retries on the SAME corrupted poolThis traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at
api-request.js:141-150even contains aTODO: Does this need queueMicrotask?comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici
PoolwithallowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:Contributing factors (from community reports)
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The
Poolconstructor (visible in #2050's trace) acceptsclientTtl. SettingclientTtl: 60000(60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.Additionally, listen for the
disconnectevent on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
Retry-After. 1 retry max. Don't burn quota.The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
- At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
- Ideally: send an
- Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error. The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix. This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models. Error output ✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value: aB(t[TNe]===0)X-Retry-Of: <original-request-id>header so the billing system can deduplicate connection-level retry charges● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:
Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
Pool state invariant violated — pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure
This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:
Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)
The transition between sub-agent exploration and parent-agent output generation (#2189)
The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection
Contributing factors (from community reports)
FactorEvidenceIssueClaude models hit this far more than OpenAI/Google"occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models"#1743Long sessions (4+ hours)19h27m session with 2,812 lines changed#1754Sub-agent → parent transitions"When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..."#2189Large output generationFailures cluster when model transitions from reading to writing#2050, #2189Post-v1.0.6 persistenceIdentical failures on v1.0.6 and v1.0.9#2101, #2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
400 / client error: Don't retry at all. Surface immediately.
The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
Ideally: send an X-Retry-Of: header so the billing system can deduplicate connection-level retry charges
Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
Affected version
v1.0.6
Steps to reproduce the behavior
Most reliable (works ~80% of the time):
copilot
/model → select Claude Opus 4.6 (High)
/plan on a medium-to-large repo
Let sub-agent exploration complete
When the agent transitions to writing output → GOAWAY hits
Also reliable:
Run a session for 4+ hours with periodic prompts, then make a complex request
Use /fleet with Claude models to create parallel sub-agent connections
/resume a dormant session (4+ hours old) and immediately prompt a complex task
Expected behavior
GOAWAY frames are handled as normal HTTP/2 lifecycle events, not assertion-triggering errors
Connection pool is recycled before retrying after a GOAWAY
Retries on connection-level errors don't silently burn premium requests
Users see distinct, informative messages for connection errors vs. rate limits vs. server errors
Long-running sessions proactively recycle connections before GOAWAY is received
Additional context
Environment
CLI version: v1.0.12 (also reproducible on v1.0.6, v1.0.9, v1.0.10)
Plan: Enterprise (also affects Business, Pro, Pro+)
OS: Linux, macOS, Windows/WSL (cross-platform)
Models primarily affected: Claude Opus 4.6, Claude Sonnet 4.6
Network: Both direct connections and behind corporate proxies (Zscaler, Netskope)
Related issues
#1743 — Autopilot mode AssertionError (v0.0.420, established model correlation with Opus)
#1754 — AssertionError during retrospective after 19h session (v0.0.420, best root cause analysis)
#2050 — Claude Sonnet 4.6 GOAWAY failure with full stack trace (v1.0.5, reveals undici internals)
#2101 — Transient API error → rate limit cascade (v1.0.6, post-fix)
#2189 — Claude Opus 4.6 GOAWAY during plan generation (v1.0.9, post-fix, 4 consecutive reproductions)
#1627 — Retry loop burns premium requests, switching models doesn't help (v0.0.420)
#2073 — Frequent transient errors leading to rate limits (v1.0.5)
nodejs/undici#4059 — Root cause: AssertionError race condition (OPEN)
nodejs/undici#3140 — Related: GOAWAY request hang (fixed in v6.14.0)
nodejs/undici#3011 — Related: null session ref after GOAWAY