Fix: statusline usage counters break in long-running / multi-session use#1349
Open
lmbagley wants to merge 1 commit into
Open
Fix: statusline usage counters break in long-running / multi-session use#1349lmbagley wants to merge 1 commit into
lmbagley wants to merge 1 commit into
Conversation
The usage line (5HR/WEEK) treats one shared, unlocked, non-atomically-written
/tmp/pai-usage-$USER.json as the authority for data, presence, AND freshness
across N concurrent statuslines -- and the native rate_limits path inherits its
mtime-based staleness. Result: live data shown stale with a bogus "(Nh)" badge,
genuine 0% usage vanishing, and counters disappearing for all sessions at once
under multi-session load.
Six fixes in statusline-command.sh:
- P1 tri-state, per-source presence flags (distinguishes "absent" from "0%")
- P2 native path no longer inherits the OAuth cache's mtime staleness
- P3 render gate branches by source before any file lookup (drops the -f cache proxy)
- P4 single-fetcher: atomic mkdir mutex (non-blocking, PID-owned, kill -0 liveness
reap, owner-checked release), atomic write (same-dir temp, 0600, symlink guard,
mv -f, non-empty gate), fail-closed fetch (jq -e on the raw body)
- P5 last-known-good with stale badge until a 6h hard-expiry (was: rm + vanish at 30m)
- P6 fetched_at stamped in-cache; staleness derived from data age, not file mtime
Verified: 20 offline assertions (network-free) + a real-API end-to-end run.
Full root-cause analysis and reproduction steps in the PR description.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: statusline usage counters break in long-running / multi-session use
File:
Releases/v5.0.0/.claude/PAI/statusline-command.sh(single file, no new dependencies — bash +jq+ coreutils, already required by the script).TL;DR
The
USE: 5HR: n% ↻… │ WEEK: n% ↻…line dims, freezes on stale numbers, or vanishes the longer a session runs and the more concurrent Claude Code sessions are open. It looks like several unrelated glitches; it is one generative defect:Remove any single piece and a real failure still remains — that's what makes it the root cause rather than a symptom.
Where it bites (observed: Claude Code 2.1.177, Linux, up to 40 concurrent
claudeprocesses)(2h)badge:1338-1350). That file is never written on the native path, so any orphaned/aging/tmp/pai-usage-$USER.jsonmakes fresh native data look old. This is the "breaks in a long-running session" report.0% / 0%usage makes the whole line vanish:1300) uses cache-file existence as a proxy for "do we have data": `…Nstatuslines share one cache (:22), no lock, non-atomic write (:709jq … > "$USAGE_CACHE"). At TTL expiry allNfirecurlat the 429-prone endpoint in the same ~2s tick → throttled → no refresh → at 1800s the coderms the cache and setsusage_no_data=true(:732-733) → line gone. A reader catching the non-atomic write mid-flight gets truncated JSON →jqfails → garbled tick./tmpfile under different freshness rules. Closed structurally by the fix (native consults the file for nothing). A single-version test bed never exercises this.Root cause (5-Whys)
/tmpcache file's existence and mtime.The fix (6 changes, all in the one file)
usage_source=native|oauthand a tri-state, per-sourceusage_state(absent/fresh). A jq presence-probe distinguishes "field absent" from "value 0" ((.five_hour.used_percentage // .five_hour.utilization) != null).if usage_source != native. Nativerate_limitsarrive fresh on stdin every tick, so they never inherit the/tmpcache's age.usage_state != absent, branching by source before any file lookup — no more-f "$USAGE_CACHE"as a data proxy.mkdirmutex (notflock— macOS has noflock(1)), acquired non-blocking so a loser instantly serves last-known-good rather than waiting on the slow endpoint. Plus PID-ownership token, liveness-based reap (kill -0, not a bare timeout — a slow-but-alive winner is never reaped), owner-checked release (neverrmdirs another holder's lock), atomic write (temp in same dir →chmod 600→ symlink guard →mv -f, gated on non-empty temp), and a fail-closed fetch (jq -e '.five_hour'on the raw body, so a 429/HTML body can't overwrite good data).rm-and-vanish cliff. Show last-known-good (dimmed + honest stale badge) until a 6hUSAGE_HARD_EXPIRY, then hide. The cache is never deleted — a later successful fetch resumes from it.fetched_at(epoch) into the cache JSON; both the fetch decision and the render staleness compute data age fromfetched_at(mtime fallback), sorsync/backup/touchcan't fake freshness or staleness.Sequencing / invariant: P1 → (P2 ∥ P3) → P4 → P5 → P6. P1-P3 resolve the two native-path phenotypes most users see; P4-P5 harden the OAuth path for the multi-session reality. Documented invariant:
curl --max-time 3(total, incl. connect) ≪ the 15s lock-reap floor, so a live fetcher can never be misclassified as a crashed one.Verification
HOME+ a pre-held lock force the non-blocking-loser path, and several cases assert the cache'sfetched_atis unchanged — positive proof that no fetch occurred.5h 29 → 35 → 40%), atomically wrote the cache withfetched_atstamped, released the lock with no leak and zero stray temp files.Accepted residuals (called out so a reviewer doesn't have to rediscover them)
fetched_atis OAuth-only). If CC exposed a native data timestamp, P2/P6 would extend to cover this.…tmp.$$orphans only if SIGKILL lands in the microsecond between write and the immediately-followingrm, on the successful-fetch path only. A leaked lock self-heals via the liveness reap.USAGE_HARD_EXPIRYthe line hides rather than showing a "stale >6h" tombstone — a deliberate product choice; the dim+badge covers the in-between. Easy to change if a tombstone is preferred.Test matrix the reviewer should care about
A green run does not exonerate this unless it exercises all three production triggers, none of which a default CI hits:
flock(1); differentstat/dateflavors — the script already branches on these)./tmpfile).Reproduce / verify
Before-fix reproduction (run against an unpatched script)
Demonstrates failures #1 and #2 with synthetic Claude Code statusline JSON on stdin (no network, no credentials):
Failure 1 prints the live line in stale-gray with a bogus
(2h); Failure 2 prints<no usage line printed>for genuine0%/0%.After-fix verification (20 assertions, network-free)
The P4-P6 suite uses a token-less
HOMEand pre-held locks so the OAuth fetch is structurally impossible; cases also assertfetched_atis unchanged to prove no fetch fired.The reproduction + verification scripts are available on request (kept alongside the change; not committed here to avoid adding test fixtures to a release snapshot — happy to include them under whatever path you prefer).
Authored from a real incident: usage counters degrading across ~40 concurrent sessions on Linux. Diagnosed reproduce-first, verified per-fix, and run through two independent adversarial design reviews (the mkdir-vs-flock cross-platform call, the tri-state presence model, and the lock ownership/liveness/fail-closed hardening all came out of those). Glad to split this into smaller commits or adjust the
USAGE_HARD_EXPIRY/ tombstone behavior to match maintainer preference.