Skip to content

Fix: statusline usage counters break in long-running / multi-session use#1349

Open
lmbagley wants to merge 1 commit into
danielmiessler:mainfrom
lmbagley:fix/statusline-usage-counter-staleness
Open

Fix: statusline usage counters break in long-running / multi-session use#1349
lmbagley wants to merge 1 commit into
danielmiessler:mainfrom
lmbagley:fix/statusline-usage-counter-staleness

Conversation

@lmbagley

Copy link
Copy Markdown

Fix: statusline usage counters break in long-running / multi-session use

File: Releases/v5.0.0/.claude/PAI/statusline-command.sh (single file, no new dependencies — bash + jq + coreutils, already required by the script).

TL;DR

The USE: 5HR: n% ↻… │ WEEK: n% ↻… line dims, freezes on stale numbers, or vanishes the longer a session runs and the more concurrent Claude Code sessions are open. It looks like several unrelated glitches; it is one generative defect:

The usage subsystem treats a single shared, unlocked, non-atomically-written /tmp/pai-usage-$USER.json file as the authority for three different things(a) whether to show the line, (b) the data values, and (c) how fresh they are — across an unbounded number of concurrent statusline processes, behind an endpoint the code's own comment calls "~5 req before 429." The newer native rate_limits path (Claude Code ≥ v2.1.80) needs none of that file, yet still inherits its staleness.

Remove any single piece and a real failure still remains — that's what makes it the root cause rather than a symptom.

Where it bites (observed: Claude Code 2.1.177, Linux, up to 40 concurrent claude processes)

# Symptom Mechanism (line refs are pre-fix)
1 Live native data shown stale — dim gray labels + a growing, bogus (2h) badge Staleness is computed from the OAuth cache file's mtime (:1338-1350). That file is never written on the native path, so any orphaned/aging /tmp/pai-usage-$USER.json makes fresh native data look old. This is the "breaks in a long-running session" report.
2 Genuine 0% / 0% usage makes the whole line vanish The render gate (:1300) uses cache-file existence as a proxy for "do we have data": `…
3 Counters vanish for all sessions at once under load N statuslines share one cache (:22), no lock, non-atomic write (:709 jq … > "$USAGE_CACHE"). At TTL expiry all N fire curl at the 429-prone endpoint in the same ~2s tick → throttled → no refresh → at 1800s the code rms the cache and sets usage_no_data=true (:732-733) → line gone. A reader catching the non-atomic write mid-flight gets truncated JSON → jq fails → garbled tick.
4 (Analyzed, highest production risk) mixed-version rollout During a CC upgrade, native (≥2.1.80) and legacy (<2.1.80) readers/writers share the same /tmp file under different freshness rules. Closed structurally by the fix (native consults the file for nothing). A single-version test bed never exercises this.

Root cause (5-Whys)

  1. Counters dim / freeze / disappear.
  2. Because their visibility and freshness are driven by one /tmp cache file's existence and mtime.
  3. That file is the OAuth path's artifact: the native path produces fresh data without it; concurrent sessions read/write it with no coordination; the 429-prone endpoint can't refresh it under load.
  4. It worsens over time/sessions because mtime only ages between successful fetches, and at 1800s the cache is deleted.
  5. Root: freshness, data-presence, and the data itself are conflated into one shared mutable file, with no per-source freshness, no atomicity, and no cross-process coordination.

The fix (6 changes, all in the one file)

Phase Change Kills
P1 Producers now emit usage_source=native|oauth and a tri-state, per-source usage_state (absent/fresh). A jq presence-probe distinguishes "field absent" from "value 0" ((.five_hour.used_percentage // .five_hour.utilization) != null). foundation for the rest
P2 The OAuth-cache mtime staleness block is wrapped in if usage_source != native. Native rate_limits arrive fresh on stdin every tick, so they never inherit the /tmp cache's age. #1
P3 Render gate is now usage_state != absent, branching by source before any file lookup — no more -f "$USAGE_CACHE" as a data proxy. #2, second route into #1
P4 Single-fetcher coordination: an atomic, portable mkdir mutex (not flock — macOS has no flock(1)), acquired non-blocking so a loser instantly serves last-known-good rather than waiting on the slow endpoint. Plus PID-ownership token, liveness-based reap (kill -0, not a bare timeout — a slow-but-alive winner is never reaped), owner-checked release (never rmdirs another holder's lock), atomic write (temp in same dir → chmod 600 → symlink guard → mv -f, gated on non-empty temp), and a fail-closed fetch (jq -e '.five_hour' on the raw body, so a 429/HTML body can't overwrite good data). #3 (429 herd + torn reads)
P5 Removed the 1800s rm-and-vanish cliff. Show last-known-good (dimmed + honest stale badge) until a 6h USAGE_HARD_EXPIRY, then hide. The cache is never deleted — a later successful fetch resumes from it. #3 (graceful degrade, not disappearance)
P6 Stamp fetched_at (epoch) into the cache JSON; both the fetch decision and the render staleness compute data age from fetched_at (mtime fallback), so rsync/backup/touch can't fake freshness or staleness. mtime-churn robustness; closes #4

Sequencing / invariant: P1 → (P2 ∥ P3) → P4 → P5 → P6. P1-P3 resolve the two native-path phenotypes most users see; P4-P5 harden the OAuth path for the multi-session reality. Documented invariant: curl --max-time 3 (total, incl. connect) ≪ the 15s lock-reap floor, so a live fetcher can never be misclassified as a crashed one.

Verification

  • 20 offline assertions, two suites (see Reproduce below). The P4-P6 suite is network-free: token-less HOME + a pre-held lock force the non-blocking-loser path, and several cases assert the cache's fetched_at is unchanged — positive proof that no fetch occurred.
  • Real-API end-to-end: forcing a refresh, the fix acquired the mkdir lock, fetched (5h 29 → 35 → 40%), atomically wrote the cache with fetched_at stamped, released the lock with no leak and zero stray temp files.
  • A pristine (unpatched) copy fails exactly failures Missing references? #1 and Can PAI be used with Warp #2 under the same suite, confirming the tests discriminate.

Accepted residuals (called out so a reviewer doesn't have to rediscover them)

  1. Native has no freshness ceiling. P2 stops native from inheriting the OAuth cache's age, and native data carries no "as-of" timestamp from Claude Code — so a wedged native producer that keeps emitting the last values would show them as fresh, unmarked. Not fixable in the statusline without CC providing a native timestamp (P6's fetched_at is OAuth-only). If CC exposed a native data timestamp, P2/P6 would extend to cover this.
  2. Reap micro-race. Two processes simultaneously reaping a genuinely dead-owner lock can cause ≤2 extra fetches that tick — never a stampede (PID-ownership + liveness reap + owner-checked release prevent the cascading "release nukes another's lock" case). Self-heals next tick.
  3. Temp-leak window. …tmp.$$ orphans only if SIGKILL lands in the microsecond between write and the immediately-following rm, on the successful-fetch path only. A leaked lock self-heals via the liveness reap.
  4. Silent-blank past 6h. Beyond USAGE_HARD_EXPIRY the line hides rather than showing a "stale >6h" tombstone — a deliberate product choice; the dim+badge covers the in-between. Easy to change if a tombstone is preferred.

Test matrix the reviewer should care about

A green run does not exonerate this unless it exercises all three production triggers, none of which a default CI hits:

  1. macOS (no flock(1); different stat/date flavors — the script already branches on these).
  2. High concurrency (≥20 simultaneous statuslines on one shared cache).
  3. Mixed CC version (native + legacy writers on the same /tmp file).

Reproduce / verify

Before-fix reproduction (run against an unpatched script)

Demonstrates failures #1 and #2 with synthetic Claude Code statusline JSON on stdin (no network, no credentials):

bash repro.sh /path/to/statusline-command.sh

Failure 1 prints the live line in stale-gray with a bogus (2h); Failure 2 prints <no usage line printed> for genuine 0%/0%.

After-fix verification (20 assertions, network-free)
bash verify-fix.sh   /path/to/statusline-command.sh   # 8  — P1-P3 (incl. malformed-input robustness)
bash verify-p4p6.sh  /path/to/statusline-command.sh   # 12 — P4-P6 (mutex/liveness/atomic-write/fetched_at)

The P4-P6 suite uses a token-less HOME and pre-held locks so the OAuth fetch is structurally impossible; cases also assert fetched_at is unchanged to prove no fetch fired.

The reproduction + verification scripts are available on request (kept alongside the change; not committed here to avoid adding test fixtures to a release snapshot — happy to include them under whatever path you prefer).


Authored from a real incident: usage counters degrading across ~40 concurrent sessions on Linux. Diagnosed reproduce-first, verified per-fix, and run through two independent adversarial design reviews (the mkdir-vs-flock cross-platform call, the tri-state presence model, and the lock ownership/liveness/fail-closed hardening all came out of those). Glad to split this into smaller commits or adjust the USAGE_HARD_EXPIRY / tombstone behavior to match maintainer preference.

The usage line (5HR/WEEK) treats one shared, unlocked, non-atomically-written
/tmp/pai-usage-$USER.json as the authority for data, presence, AND freshness
across N concurrent statuslines -- and the native rate_limits path inherits its
mtime-based staleness. Result: live data shown stale with a bogus "(Nh)" badge,
genuine 0% usage vanishing, and counters disappearing for all sessions at once
under multi-session load.

Six fixes in statusline-command.sh:
- P1 tri-state, per-source presence flags (distinguishes "absent" from "0%")
- P2 native path no longer inherits the OAuth cache's mtime staleness
- P3 render gate branches by source before any file lookup (drops the -f cache proxy)
- P4 single-fetcher: atomic mkdir mutex (non-blocking, PID-owned, kill -0 liveness
     reap, owner-checked release), atomic write (same-dir temp, 0600, symlink guard,
     mv -f, non-empty gate), fail-closed fetch (jq -e on the raw body)
- P5 last-known-good with stale badge until a 6h hard-expiry (was: rm + vanish at 30m)
- P6 fetched_at stamped in-cache; staleness derived from data age, not file mtime

Verified: 20 offline assertions (network-free) + a real-API end-to-end run.
Full root-cause analysis and reproduction steps in the PR description.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Load Dynamic Requirements runs repeatedly?

2 participants