serve: bound /usage per provider so one slow provider can't stall the response#1748
serve: bound /usage per provider so one slow provider can't stall the response#1748enieuwy wants to merge 1 commit into
Conversation
|
Codex review: found issues before merge. Reviewed June 25, 2026, 2:43 AM ET / 06:43 UTC. Summary Reproducibility: yes. from source: current main loops selected Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review findings
Review detailsBest possible solution: Keep the per-provider timeout for finite request deadlines, but either preserve the explicit disabled-deadline behavior or make the compatibility change deliberate and documented before merge. Do we have a high-confidence way to reproduce the issue? Yes from source: current main loops selected Is this the best way to solve the issue? No as submitted: concurrent per-provider collection is the right narrow fix for finite deadlines, but changing the documented disabled-timeout contract needs either preservation or explicit maintainer-approved documentation. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model internal, reasoning high; reviewed against ada3660e9d61. Label changesLabel changes:
Label justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52c2585b53
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| output.payload.append(Self.makeProviderErrorPayload( | ||
| provider: provider, | ||
| account: nil, | ||
| source: "auto", |
There was a problem hiding this comment.
Preserve cache keys for provider timeout rows
When an identified provider (for example Codex with visible accounts or any token-account provider) has a previous good /usage response and this new provider-level timeout fires, this synthetic error row is emitted with account: nil and no cacheAccountKey. mergeLastGoodUsageItems only restores cached usage when it can build the (provider, accountID) key from usageCacheKeys, so these timeout rows never match the saved good rows and the endpoint returns the raw timeout error instead of the intended last-known-good fallback.
Useful? React with 👍 / 👎.
… response `serveUsage` collected providers in a sequential loop with no per-provider timeout, so a single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the whole `/usage` handler. The only backstop was the outer request deadline, which returns a 504 with an empty body and discards every provider already collected — and because that 504 is not `.ok`, the last-known-good merge (`mergeLastGoodUsageItems`, which requires `.ok`) never ran. Net effect: one stuck provider made the entire endpoint return nothing, which pushed shell/Zellij consumers onto degraded CLI fallback. Collect providers concurrently, bounding each with `BoundedTaskJoin` at a budget strictly below the outer request deadline (`serveProviderTimeout`). A provider over budget now contributes a provider error row instead of blocking the others, so the response stays `.ok`, the cache can restore that row from last-known-good, and every healthy provider still renders. Each provider's timeout clock starts when its task is spawned, so a hung provider cannot serialize the others' deadlines. Results merge in caller-provided provider order regardless of completion order. The serve usage context's `webTimeout` is aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline). Adds CLIServeRouterTests coverage for the timeout budget and for a hung provider degrading to an error row without blocking siblings.
52c2585 to
94a53fb
Compare
|
Thanks both — addressed in On the cache-key finding (Codex P2 / ClawSweeper P2): correct, and the PR body over-claimed. I took the "explicitly narrow the behavior and tests" option rather than keying timeout rows. Reconstructing a timeout row from last-good would require re-resolving accounts (including Codex's On real behavior proof (P1): added to the PR body. Ran the fixed @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. |
Problem
codexbar serve's/usagehandler (serveUsage) collected providers in a sequential loop with no per-provider timeout. A single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the entire handler.The only backstop was the outer request deadline (
serveResponseWithDeadline, default 30s), which returns a 504 with an empty body and discards every provider already collected. Worse, the last-known-good merge (CLIServeResponseCache.mergeLastGoodUsageItems) only runs whenresponse.status == .ok, so the 504 path produces zero data — no fresh providers, no cached fallback.Net effect: one stuck provider makes the whole endpoint return nothing, which pushes shell/Zellij consumers (e.g. showy-quota) onto degraded CLI fallback. Observed live: a wedged serve held port 8080 with
/healthreturningokwhile/usagetimed out for the full 30s.Fix
BoundedTaskJoinat a budget strictly below the outer request deadline (serveProviderTimeout(requestTimeout:)=requestTimeout * 0.8, or 25s when the deadline is disabled/non-finite).webTimeoutis aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline).Last-known-good behavior (scope clarification)
Per-account error rows that carry a cache key are still merged with last-known-good by
CLIServeResponseCacheexactly as before. A timeout row is account-agnostic (nocacheAccountKey) and is intentionally not reconstructed from last-good — this matches the existing cache rule that "a timeout cannot prove which account is currently active" (staleResponsealready refusesusage:keys, and tests assert unkeyed timeout rows are not reconstructed). This PR's win is that a hung provider no longer takes the healthy providers down with it; it does not claim to restore the hung provider's own last value.Real behavior proof
Built the debug CLI and ran the fixed
codexbar serveon a spare port (8092, leaving the app's 8080 untouched). On this machineclaudeandcursorare genuinely wedged — the same providers that triggered the original incident.Normal
--request-timeout 30(per-provider budget 24s):/usagereturns in 25.3s with the healthy providers intact:Pre-fix, the claude/cursor hang would have produced an empty 504 and lost
codex+antigravitytoo. Post-fix they render fresh.Small
--request-timeout 2(per-provider budget 1.6s):/usagereturns in 1.70s — proving the per-provider budget, not the outer deadline, is the cutoff:(
elapsedmeasured withcurl; provider values/account labels redacted.)Tests
CLIServeRouterTests:serveProviderTimeoutbudget mapping, including the strict-below-deadline invariant for sub-second timeouts and the disabled-deadline case.cacheAccountKey == nil), documenting the deliberate non-reconstruction.Verified locally:
swift buildclean (StrictConcurrency),swift test --filter CLIServeRouterTestsgreen,make check(SwiftFormat + SwiftLint) 0 violations.