serve: bound /usage per provider so one slow provider can't stall the response by enieuwy · Pull Request #1748 · steipete/CodexBar

enieuwy · 2026-06-25T06:18:44Z

Problem

codexbar serve's /usage handler (serveUsage) collected providers in a sequential loop with no per-provider timeout. A single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the entire handler.

The only backstop was the outer request deadline (serveResponseWithDeadline, default 30s), which returns a 504 with an empty body and discards every provider already collected. Worse, the last-known-good merge (CLIServeResponseCache.mergeLastGoodUsageItems) only runs when response.status == .ok, so the 504 path produces zero data — no fresh providers, no cached fallback.

Net effect: one stuck provider makes the whole endpoint return nothing, which pushes shell/Zellij consumers (e.g. showy-quota) onto degraded CLI fallback. Observed live: a wedged serve held port 8080 with /health returning ok while /usage timed out for the full 30s.

Fix

Collect providers concurrently, bounding each with BoundedTaskJoin at a budget strictly below the outer request deadline (serveProviderTimeout(requestTimeout:) = requestTimeout * 0.8, or 25s when the deadline is disabled/non-finite).
A provider that exceeds its budget contributes its own provider error row instead of blocking the others, so the response still renders every healthy provider with fresh data.
Each provider's timeout clock starts when its task is spawned, so a hung provider can't serialize the others' deadlines.
Results merge in caller-provided provider order regardless of completion order.
The serve usage context's webTimeout is aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline).
The outer request deadline is retained as a last-resort safety net.

Last-known-good behavior (scope clarification)

Per-account error rows that carry a cache key are still merged with last-known-good by CLIServeResponseCache exactly as before. A timeout row is account-agnostic (no cacheAccountKey) and is intentionally not reconstructed from last-good — this matches the existing cache rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and tests assert unkeyed timeout rows are not reconstructed). This PR's win is that a hung provider no longer takes the healthy providers down with it; it does not claim to restore the hung provider's own last value.

Real behavior proof

Built the debug CLI and ran the fixed codexbar serve on a spare port (8092, leaving the app's 8080 untouched). On this machine claude and cursor are genuinely wedged — the same providers that triggered the original incident.

Normal --request-timeout 30 (per-provider budget 24s): /usage returns in 25.3s with the healthy providers intact:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  DATA   source=cli

Pre-fix, the claude/cursor hang would have produced an empty 504 and lost codex + antigravity too. Post-fix they render fresh.

Small --request-timeout 2 (per-provider budget 1.6s): /usage returns in 1.70s — proving the per-provider budget, not the outer deadline, is the cutoff:

codex        DATA   source=oauth
claude       ERROR  msg="claude usage timed out" kind=provider
cursor       ERROR  msg="cursor usage timed out" kind=provider
antigravity  ERROR  msg="antigravity usage timed out" kind=provider

(elapsed measured with curl; provider values/account labels redacted.)

Tests

CLIServeRouterTests:

serveProviderTimeout budget mapping, including the strict-below-deadline invariant for sub-second timeouts and the disabled-deadline case.
A hung provider (30s simulated) with a 0.1s budget: collection returns in ~0.1s, healthy providers render in caller order, and the hung one yields a single provider error row that is account-agnostic (cacheAccountKey == nil), documenting the deliberate non-reconstruction.

Verified locally: swift build clean (StrictConcurrency), swift test --filter CLIServeRouterTests green, make check (SwiftFormat + SwiftLint) 0 violations.

clawsweeper · 2026-06-25T06:19:30Z

Codex review: found issues before merge. Reviewed June 25, 2026, 2:43 AM ET / 06:43 UTC.

Summary
The PR changes codexbar serve /usage to collect providers concurrently under a per-provider timeout, aligns provider web timeout to that budget, adds focused CLI serve tests, and updates the changelog.

Reproducibility: yes. from source: current main loops selected /usage providers sequentially and only the outer request deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup, but I did not run live probes in this read-only review.

Review metrics: 2 noteworthy metrics.

Changed surface: 3 files affected; 142 additions, 7 deletions. The implementation is focused on CLI serve behavior, one test file, and one changelog entry.
Documented option affected: 1 CLI option behavior changed. --request-timeout 0 is currently documented as waiting indefinitely, but the PR gives /usage a 25 second provider budget.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Decide whether --request-timeout 0 remains an indefinite wait or becomes a bounded fail-open mode for /usage.
[P2] Add or update focused tests and user-facing docs for the chosen timeout semantics.

Risk before merge

[P2] The PR intentionally changes documented --request-timeout 0 behavior for /usage; existing polling clients that opted into indefinite waits may now receive provider timeout rows after 25 seconds.
[P1] This read-only review did not run local Swift tests, and GitHub macOS test shards were still in progress when inspected.

Maintainer options:

Preserve the disabled-deadline contract (recommended)
Keep --request-timeout 0 as an explicit opt-out from provider bounds for /usage, and cover that compatibility path with a focused test.
Make the contract change explicit
If maintainers want 0 to disable only the outer 504 while /usage still fails open after 25 seconds, update user-facing CLI docs/help and release context before landing.
Pause until timeout semantics are decided
Hold or close this PR if the permanent meaning of --request-timeout 0 is not settled for long-running serve clients.

Next step before merge

[P2] A maintainer needs to choose the compatibility direction for --request-timeout 0 before an automated repair would be safe.

Security
Cleared: No concrete security or supply-chain concern was found; the diff changes Swift CLI serve logic, focused tests, and a changelog entry only.

Review findings

[P1] Honor the disabled request-timeout contract — Sources/CodexBarCLI/CLIServeCommand.swift:775

Review details

Best possible solution:

Keep the per-provider timeout for finite request deadlines, but either preserve the explicit disabled-deadline behavior or make the compatibility change deliberate and documented before merge.

Do we have a high-confidence way to reproduce the issue?

Yes from source: current main loops selected /usage providers sequentially and only the outer request deadline can stop a hung provider. The PR body also includes live after-fix output from a wedged-provider setup, but I did not run live probes in this read-only review.

Is this the best way to solve the issue?

No as submitted: concurrent per-provider collection is the right narrow fix for finite deadlines, but changing the documented disabled-timeout contract needs either preservation or explicit maintainer-approved documentation.

Full review comments:

[P1] Honor the disabled request-timeout contract — Sources/CodexBarCLI/CLIServeCommand.swift:775
This maps disabled or non-finite request deadlines to a 25 second provider budget for /usage, but docs/cli.md still says --request-timeout 0 keeps serve waiting indefinitely. Existing clients that deliberately disabled deadlines can now get timeout error rows after 25 seconds, so preserve that 0 contract or make the compatibility change explicit in user-facing docs before merge.
Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against ada3660e9d61.

Label changes

Label changes:

add merge-risk: 🚨 compatibility: Merging as-is could break existing clients that use documented --request-timeout 0 semantics to let serve wait indefinitely.
add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
add status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.
remove status: 👀 ready for maintainer look: Current PR status label is status: ⏳ waiting on author.
remove rating: 🐚 platinum hermit: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

P2: This is a normal-priority CLI server availability fix with limited blast radius, despite one compatibility blocker before merge.
merge-risk: 🚨 compatibility: Merging as-is could break existing clients that use documented --request-timeout 0 semantics to let serve wait indefinitely.
rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied live /usage output from a real wedged-provider setup showing healthy providers returning and slow providers becoming timeout rows after the fix.

Evidence reviewed

What I checked:

Repository policy read: AGENTS.md was found and read fully; its CLI-focused validation guidance, no-Keychain-prompt constraint, and Swift concurrency caution were relevant to this review. (AGENTS.md:1, ada3660e9d61)
Current main sequential behavior: Current main builds one UsageCommandOutput and loops through selection.asList sequentially for /usage, with only the outer request deadline stopping a hung provider. (Sources/CodexBarCLI/CLIServeCommand.swift:734, ada3660e9d61)
PR bounded concurrent collection: The PR head computes a provider budget, calls serveCollectUsageOutputs, starts provider fetches concurrently, joins each with BoundedTaskJoin, and emits a provider error row on timeout. (Sources/CodexBarCLI/CLIServeCommand.swift:750, 94a53fb3cc5f)
Compatibility blocker: The PR maps disabled or non-finite request deadlines to a 25 second provider budget, while current CLI docs say --request-timeout 0 keeps serve waiting indefinitely. (Sources/CodexBarCLI/CLIServeCommand.swift:775, 94a53fb3cc5f)
Documented timeout contract: docs/cli.md documents --request-timeout 0 as disabling serve request deadlines and waiting indefinitely, which conflicts with the PR's new /usage provider bound. (docs/cli.md:53, ada3660e9d61)
Regression coverage added: The PR adds tests for the timeout budget mapping and for a simulated hung provider returning quickly while healthy providers stay in caller order and the timeout row remains account-agnostic. (Tests/CodexBarTests/CLIServeRouterTests.swift:225, 94a53fb3cc5f)

Likely related people:

enieuwy: Prior merged history includes serve request deadlines/cache coalescing, serve config reload, Antigravity CLI serve fallback, and this PR's current branch. (role: feature-history owner; confidence: high; commits: 06b7de126f1a, 894bbcb8c20b, cd201b9e5035; files: Sources/CodexBarCLI/CLIServeCommand.swift, Tests/CodexBarTests/CLIServeRouterTests.swift, docs/cli.md)
steipete: Recent history and blame connect Peter Steinberger to the release baseline, the bounded join helper, and adjacent CLI/server hardening paths. (role: recent adjacent owner; confidence: high; commits: f380287041b8, ac8159f6b10c, a3134da7a536; files: Sources/CodexBarCore/BoundedTaskJoin.swift, Sources/CodexBarCLI/CLIServeCommand.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)
ThiagoCAltoe: The original localhost codexbar serve command and /usage surface were introduced in the merged serve command history. (role: introduced behavior; confidence: medium; commits: 74a019c4aa65; files: Sources/CodexBarCLI/CLIServeCommand.swift, Sources/CodexBarCLI/CLILocalHTTPServer.swift, Tests/CodexBarTests/CLIServeRouterTests.swift)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52c2585b53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T06:22:49Z

+        output.payload.append(Self.makeProviderErrorPayload(
+            provider: provider,
+            account: nil,
+            source: "auto",


Preserve cache keys for provider timeout rows

When an identified provider (for example Codex with visible accounts or any token-account provider) has a previous good /usage response and this new provider-level timeout fires, this synthetic error row is emitted with account: nil and no cacheAccountKey. mergeLastGoodUsageItems only restores cached usage when it can build the (provider, accountID) key from usageCacheKeys, so these timeout rows never match the saved good rows and the endpoint returns the raw timeout error instead of the intended last-known-good fallback.

Useful? React with 👍 / 👎.

… response `serveUsage` collected providers in a sequential loop with no per-provider timeout, so a single slow or hung provider (e.g. a CLI/web fetch that never returns) blocked the whole `/usage` handler. The only backstop was the outer request deadline, which returns a 504 with an empty body and discards every provider already collected — and because that 504 is not `.ok`, the last-known-good merge (`mergeLastGoodUsageItems`, which requires `.ok`) never ran. Net effect: one stuck provider made the entire endpoint return nothing, which pushed shell/Zellij consumers onto degraded CLI fallback. Collect providers concurrently, bounding each with `BoundedTaskJoin` at a budget strictly below the outer request deadline (`serveProviderTimeout`). A provider over budget now contributes a provider error row instead of blocking the others, so the response stays `.ok`, the cache can restore that row from last-known-good, and every healthy provider still renders. Each provider's timeout clock starts when its task is spawned, so a hung provider cannot serialize the others' deadlines. Results merge in caller-provided provider order regardless of completion order. The serve usage context's `webTimeout` is aligned to the per-provider budget (was a fixed 60s that exceeded the 30s request deadline). Adds CLIServeRouterTests coverage for the timeout budget and for a hung provider degrading to an error row without blocking siblings.

enieuwy · 2026-06-25T06:32:57Z

Thanks both — addressed in 94a53fb3.

On the cache-key finding (Codex P2 / ClawSweeper P2): correct, and the PR body over-claimed. I took the "explicitly narrow the behavior and tests" option rather than keying timeout rows. Reconstructing a timeout row from last-good would require re-resolving accounts (including Codex's visibleCodexAccounts()) outside the per-provider budget — reintroducing the blocking call this PR removes — and it contradicts the existing rule that "a timeout cannot prove which account is currently active" (staleResponse already refuses usage: keys, and the current tests assert unkeyed timeout rows are not reconstructed). So a timeout row is intentionally account-agnostic and not restored; per-account error rows that carry a cache key still merge with last-good exactly as before. The code comments, CHANGELOG, and PR body are corrected, and the test now asserts the timeout row has cacheAccountKey == nil / account == nil.

On real behavior proof (P1): added to the PR body. Ran the fixed codexbar serve on a spare port; on this machine claude/cursor are genuinely wedged. With --request-timeout 30 (24s budget) /usage returns in 25.3s with codex + antigravity rendering real data and claude/cursor as timeout error rows — pre-fix that hang produced an empty 504 that lost the healthy providers too. With --request-timeout 2 (1.6s budget) /usage returns in 1.70s, proving the per-provider budget is the cutoff.

@clawsweeper re-review

clawsweeper · 2026-06-25T06:32:59Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

enieuwy force-pushed the fix/serve-usage-per-provider-timeout branch from 52c2585 to 94a53fb Compare June 25, 2026 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serve: bound /usage per provider so one slow provider can't stall the response#1748

serve: bound /usage per provider so one slow provider can't stall the response#1748
enieuwy wants to merge 1 commit into
steipete:mainfrom
enieuwy:fix/serve-usage-per-provider-timeout

enieuwy commented Jun 25, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

enieuwy commented Jun 25, 2026

Uh oh!

clawsweeper Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

enieuwy commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Last-known-good behavior (scope clarification)

Real behavior proof

Tests

Uh oh!

clawsweeper Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

enieuwy commented Jun 25, 2026

Uh oh!

clawsweeper Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

enieuwy commented Jun 25, 2026 •

edited

Loading

clawsweeper Bot commented Jun 25, 2026 •

edited

Loading

clawsweeper Bot commented Jun 25, 2026 •

edited

Loading