Skip to content

serve: bound /usage per provider so one slow provider can't stall the response#1748

Open
enieuwy wants to merge 1 commit into
steipete:mainfrom
enieuwy:fix/serve-usage-per-provider-timeout
Open

serve: bound /usage per provider so one slow provider can't stall the response#1748
enieuwy wants to merge 1 commit into
steipete:mainfrom
enieuwy:fix/serve-usage-per-provider-timeout

Conversation

@enieuwy

@enieuwy enieuwy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

codexbar serve's /usage handler (serveUsage) collected providers in a sequential loop with no per-provider timeout. A single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the entire handler.

The only backstop was the outer request deadline (serveResponseWithDeadline, default 30s), which returns a 504 with an empty body and discards every provider already collected. Worse, the last-known-good merge (CLIServeResponseCache.mergeLastGoodUsageItems) only runs when response.status == .ok, so the 504 path produces zero data — no fresh providers, no cached fallback.

Net effect: one stuck provider makes the whole endpoint return nothing, which pushes shell/Zellij consumers (e.g. showy-quota) onto degraded CLI fallback. Observed live: a wedged serve held port 8080 with /health returning ok while /usage timed out for the full 30s.

Fix

  • Collect providers concurrently, bounding each with BoundedTaskJoin at a budget strictly below the outer request deadline (serveProviderTimeout(requestTimeout:) = requestTimeout * 0.8, or 25s when the deadline is disabled/non-finite).
  • A provider that exceeds its budget contributes its own provider error row instead of blocking the others, so the response still renders every healthy provider with fresh data.
  • Each provider's timeout clock starts when its task is spawned, so a hung provider can't serialize the others' deadlines.
  • Results merge in caller-provided provider order regardless of completion order.
  • The serve usage context's webTimeout is aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline).
  • The outer request deadline is retained as a last-resort safety net.

Last-known-good behavior (scope clarification)

Per-account error rows that carry a cache key are still merged with last-known-good by CLIServeResponseCache exactly as before. A timeout row is account-agnostic (no cacheAccountKey) and is intentionally not reconstructed from last-good — this matches the existing cache rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and tests assert unkeyed timeout rows are not reconstructed). This PR's win is that a hung provider no longer takes the healthy providers down with it; it does not claim to restore the hung provider's own last value.

Real behavior proof

Built the debug CLI and ran the fixed codexbar serve on a spare port (8092, leaving the app's 8080 untouched). On this machine claude and cursor are genuinely wedged — the same providers that triggered the original incident.

Normal --request-timeout 30 (per-provider budget 24s): /usage returns in 25.3s with the healthy providers intact:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  DATA   source=cli

Pre-fix, the claude/cursor hang would have produced an empty 504 and lost codex + antigravity too. Post-fix they render fresh.

Small --request-timeout 2 (per-provider budget 1.6s): /usage returns in 1.70s — proving the per-provider budget, not the outer deadline, is the cutoff:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  ERROR  msg="antigravity usage timed out" kind=provider

(elapsed measured with curl; provider values/account labels redacted.)

Tests

CLIServeRouterTests:

  • serveProviderTimeout budget mapping, including the strict-below-deadline invariant for sub-second timeouts and the disabled-deadline case.
  • A hung provider (30s simulated) with a 0.1s budget: collection returns in ~0.1s, healthy providers render in caller order, and the hung one yields a single provider error row that is account-agnostic (cacheAccountKey == nil), documenting the deliberate non-reconstruction.

Verified locally: swift build clean (StrictConcurrency), swift test --filter CLIServeRouterTests green, make check (SwiftFormat + SwiftLint) 0 violations.

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codex review: found issues before merge. Reviewed June 25, 2026, 2:43 AM ET / 06:43 UTC.

Summary
The PR changes codexbar serve /usage to collect providers concurrently under a per-provider timeout, aligns provider web timeout to that budget, adds focused CLI serve tests, and updates the changelog.

Reproducibility: yes. from source: current main loops selected /usage providers sequentially and only the outer request deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup, but I did not run live probes in this read-only review.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 3 files affected; 142 additions, 7 deletions. The implementation is focused on CLI serve behavior, one test file, and one changelog entry.
  • Documented option affected: 1 CLI option behavior changed. --request-timeout 0 is currently documented as waiting indefinitely, but the PR gives /usage a 25 second provider budget.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Decide whether --request-timeout 0 remains an indefinite wait or becomes a bounded fail-open mode for /usage.
  • [P2] Add or update focused tests and user-facing docs for the chosen timeout semantics.

Risk before merge

  • [P2] The PR intentionally changes documented --request-timeout 0 behavior for /usage; existing polling clients that opted into indefinite waits may now receive provider timeout rows after 25 seconds.
  • [P1] This read-only review did not run local Swift tests, and GitHub macOS test shards were still in progress when inspected.

Maintainer options:

  1. Preserve the disabled-deadline contract (recommended)
    Keep --request-timeout 0 as an explicit opt-out from provider bounds for /usage, and cover that compatibility path with a focused test.
  2. Make the contract change explicit
    If maintainers want 0 to disable only the outer 504 while /usage still fails open after 25 seconds, update user-facing CLI docs/help and release context before landing.
  3. Pause until timeout semantics are decided
    Hold or close this PR if the permanent meaning of --request-timeout 0 is not settled for long-running serve clients.

Next step before merge

  • [P2] A maintainer needs to choose the compatibility direction for --request-timeout 0 before an automated repair would be safe.

Security
Cleared: No concrete security or supply-chain concern was found; the diff changes Swift CLI serve logic, focused tests, and a changelog entry only.

Review findings

  • [P1] Honor the disabled request-timeout contract — Sources/CodexBarCLI/CLIServeCommand.swift:775
Review details

Best possible solution:

Keep the per-provider timeout for finite request deadlines, but either preserve the explicit disabled-deadline behavior or make the compatibility change deliberate and documented before merge.

Do we have a high-confidence way to reproduce the issue?

Yes from source: current main loops selected /usage providers sequentially and only the outer request deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup, but I did not run live probes in this read-only review.

Is this the best way to solve the issue?

No as submitted: concurrent per-provider collection is the right narrow fix for finite deadlines, but changing the documented disabled-timeout contract needs either preservation or explicit maintainer-approved documentation.

Full review comments:

  • [P1] Honor the disabled request-timeout contract — Sources/CodexBarCLI/CLIServeCommand.swift:775
    This maps disabled or non-finite request deadlines to a 25 second provider budget for /usage, but docs/cli.md still says --request-timeout 0 keeps serve waiting indefinitely. Existing clients that deliberately disabled deadlines can now get timeout error rows after 25 seconds, so preserve that 0 contract or make the compatibility change explicit in user-facing docs before merge.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against ada3660e9d61.

Label changes

Label changes:

  • add merge-risk: 🚨 compatibility: Merging as-is could break existing clients that use documented --request-timeout 0 semantics to let serve wait indefinitely.
  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
  • add status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.
  • remove status: 👀 ready for maintainer look: Current PR status label is status: ⏳ waiting on author.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

  • P2: This is a normal-priority CLI server availability fix with limited blast radius, despite one compatibility blocker before merge.
  • merge-risk: 🚨 compatibility: Merging as-is could break existing clients that use documented --request-timeout 0 semantics to let serve wait indefinitely.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.
Evidence reviewed

What I checked:

  • Repository policy read: AGENTS.md was found and read fully; its CLI-focused validation guidance, no-Keychain-prompt constraint, and Swift concurrency caution were relevant to this review. (AGENTS.md:1, ada3660e9d61)
  • Current main sequential behavior: Current main builds one UsageCommandOutput and loops through selection.asList sequentially for /usage, with only the outer request deadline stopping a hung provider. (Sources/CodexBarCLI/CLIServeCommand.swift:734, ada3660e9d61)
  • PR bounded concurrent collection: The PR head computes a provider budget, calls serveCollectUsageOutputs, starts provider fetches concurrently, joins each with BoundedTaskJoin, and emits a provider error row on timeout. (Sources/CodexBarCLI/CLIServeCommand.swift:750, 94a53fb3cc5f)
  • Compatibility blocker: The PR maps disabled or non-finite request deadlines to a 25 second provider budget, while current CLI docs say --request-timeout 0 keeps serve waiting indefinitely. (Sources/CodexBarCLI/CLIServeCommand.swift:775, 94a53fb3cc5f)
  • Documented timeout contract: docs/cli.md documents --request-timeout 0 as disabling serve request deadlines and waiting indefinitely, which conflicts with the PR's new /usage provider bound. (docs/cli.md:53, ada3660e9d61)
  • Regression coverage added: The PR adds tests for the timeout budget mapping and for a simulated hung provider returning quickly while healthy providers stay in caller order and the timeout row remains account-agnostic. (Tests/CodexBarTests/CLIServeRouterTests.swift:225, 94a53fb3cc5f)

Likely related people:

  • enieuwy: Prior merged history includes serve request deadlines/cache coalescing, serve config reload, Antigravity CLI serve fallback, and this PR's current branch. (role: feature-history owner; confidence: high; commits: 06b7de126f1a, 894bbcb8c20b, cd201b9e5035; files: Sources/CodexBarCLI/CLIServeCommand.swift, Tests/CodexBarTests/CLIServeRouterTests.swift, docs/cli.md)
  • steipete: Recent history and blame connect Peter Steinberger to the release baseline, the bounded join helper, and adjacent CLI/server hardening paths. (role: recent adjacent owner; confidence: high; commits: f380287041b8, ac8159f6b10c, a3134da7a536; files: Sources/CodexBarCore/BoundedTaskJoin.swift, Sources/CodexBarCLI/CLIServeCommand.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)
  • ThiagoCAltoe: The original localhost codexbar serve command and /usage surface were introduced in the merged serve command history. (role: introduced behavior; confidence: medium; commits: 74a019c4aa65; files: Sources/CodexBarCLI/CLIServeCommand.swift, Sources/CodexBarCLI/CLILocalHTTPServer.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52c2585b53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +826 to +829
output.payload.append(Self.makeProviderErrorPayload(
provider: provider,
account: nil,
source: "auto",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cache keys for provider timeout rows

When an identified provider (for example Codex with visible accounts or any token-account provider) has a previous good /usage response and this new provider-level timeout fires, this synthetic error row is emitted with account: nil and no cacheAccountKey. mergeLastGoodUsageItems only restores cached usage when it can build the (provider, accountID) key from usageCacheKeys, so these timeout rows never match the saved good rows and the endpoint returns the raw timeout error instead of the intended last-known-good fallback.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels Jun 25, 2026
… response

`serveUsage` collected providers in a sequential loop with no per-provider
timeout, so a single slow or hung provider (e.g. a CLI/web fetch that never
returns) blocked the whole `/usage` handler. The only backstop was the outer
request deadline, which returns a 504 with an empty body and discards every
provider already collected — and because that 504 is not `.ok`, the
last-known-good merge (`mergeLastGoodUsageItems`, which requires `.ok`) never
ran. Net effect: one stuck provider made the entire endpoint return nothing,
which pushed shell/Zellij consumers onto degraded CLI fallback.

Collect providers concurrently, bounding each with `BoundedTaskJoin` at a
budget strictly below the outer request deadline (`serveProviderTimeout`).
A provider over budget now contributes a provider error row instead of
blocking the others, so the response stays `.ok`, the cache can restore that
row from last-known-good, and every healthy provider still renders. Each
provider's timeout clock starts when its task is spawned, so a hung provider
cannot serialize the others' deadlines. Results merge in caller-provided
provider order regardless of completion order. The serve usage context's
`webTimeout` is aligned to the per-provider budget (was a fixed 60s that
exceeded the 30s request deadline).

Adds CLIServeRouterTests coverage for the timeout budget and for a hung
provider degrading to an error row without blocking siblings.
@enieuwy enieuwy force-pushed the fix/serve-usage-per-provider-timeout branch from 52c2585 to 94a53fb Compare June 25, 2026 06:32
@enieuwy

enieuwy commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks both — addressed in 94a53fb3.

On the cache-key finding (Codex P2 / ClawSweeper P2): correct, and the PR body over-claimed. I took the "explicitly narrow the behavior and tests" option rather than keying timeout rows. Reconstructing a timeout row from last-good would require re-resolving accounts (including Codex's visibleCodexAccounts()) outside the per-provider budget — reintroducing the blocking call this PR removes — and it contradicts the existing rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and the current tests assert unkeyed timeout rows are not reconstructed). So a timeout row is intentionally account-agnostic and not restored; per-account error rows that carry a cache key still merge with last-good exactly as before. The code comments, CHANGELOG, and PR body are corrected, and the test now asserts the timeout row has cacheAccountKey == nil / account == nil.

On real behavior proof (P1): added to the PR body. Ran the fixed codexbar serve on a spare port; on this machine claude/cursor are genuinely wedged. With --request-timeout 30 (24s budget) /usage returns in 25.3s with codex + antigravity rendering real data and claude/cursor as timeout error rows — pre-fix that hang produced an empty 504 that lost the healthy providers too. With --request-timeout 2 (1.6s budget) /usage returns in 1.70s, proving the per-provider budget is the cutoff.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant