Symptom
Agents fail with "Authentication failed: Invalid or expired API key" even though OAuth credentials are valid. Subsequent claude -p invocations show:
Claude configuration file at /home/automaker/.claude.json is corrupted: JSON Parse error: Unterminated string
This persists across container restarts and resists every recovery attempt — restoring backups, writing fresh config, copying from the host. Every claude invocation reports the file as corrupted within seconds.
Root cause
/home/automaker is a tmpfs capped at 64 MB in the staging compose. The .npm cache fills it (~64M of npm content-addressable cache), leaving zero free bytes. Every subsequent write to /home/automaker/.claude.json is silently truncated by the kernel:
$ docker exec automaker-server df -h /home/automaker
Filesystem Size Used Avail Use%
tmpfs 64M 64M 0 100%
$ docker exec -i automaker-server sh -c 'cat > /home/automaker/.claude.json' < good-file.json
cat: write error: No space left on device
The Claude CLI reads the truncated file, sees invalid JSON, and reports "corruption" — masking the real disk-full failure.
Impact
- Agents fail dispatch with confusing auth errors (the credentials are fine; the file just got truncated mid-write)
- Auto-mode keeps respawning failed agents, each one further corrupting the shared
.claude.json
- "Recovery" by writing a clean file appears to work (
cat shows valid JSON briefly) but the next claude invocation fails again
- Diagnosis is difficult —
df is the only signal, and it's not in any standard health check
Reproduction
- Run automaker-staging with the published staging compose
- Let auto-mode dispatch ~10–20 agents
- Observe
/home/automaker tmpfs fills with .npm/_cacache content
- All subsequent claude invocations fail with "corrupted" errors
Suggested fixes
Pick one or layer them:
- Bump the tmpfs size — 64M is far too small with the npm cache living there. 512M or 1G would prevent this entirely.
- Move
.npm cache off tmpfs — set NPM_CONFIG_CACHE=/home/automaker/.cache/npm (which is on a real volume), or to /tmp/npm-cache.
- Move
.claude.json to the persistent .claude/ volume — the file lives at /home/automaker/.claude.json (next to .claude/, not inside it). If it lived inside .claude/ it would be on the persistent volume and survive disk pressure.
- Add a startup health check that warns when
/home/automaker is >80% full.
- Make agent-failure error messages distinguish between "credentials invalid" and "credentials file unreadable / truncated" — currently both manifest as "Invalid or expired API key" which sends operators down the wrong rabbit hole.
Workaround (immediate, applied to ava staging)
docker exec automaker-server find /home/automaker/.npm/_cacache -mindepth 1 -delete
docker exec -i automaker-server sh -c 'cat > /home/automaker/.claude/.credentials.json' < ~/.claude/.credentials.json
This frees ~36M and re-installs OAuth credentials. Agents resume working until the cache fills again.
Discovery
Found while debugging mythxengine MYTHX-4 / MYTHX-5 / MYTHX-6 dispatch failures on 2026-05-06. Three agents in a row failed with auth errors despite a fresh OAuth token.
Symptom
Agents fail with "Authentication failed: Invalid or expired API key" even though OAuth credentials are valid. Subsequent
claude -pinvocations show:This persists across container restarts and resists every recovery attempt — restoring backups, writing fresh config, copying from the host. Every claude invocation reports the file as corrupted within seconds.
Root cause
/home/automakeris a tmpfs capped at 64 MB in the staging compose. The.npmcache fills it (~64M of npm content-addressable cache), leaving zero free bytes. Every subsequent write to/home/automaker/.claude.jsonis silently truncated by the kernel:The Claude CLI reads the truncated file, sees invalid JSON, and reports "corruption" — masking the real disk-full failure.
Impact
.claude.jsoncatshows valid JSON briefly) but the nextclaudeinvocation fails againdfis the only signal, and it's not in any standard health checkReproduction
/home/automakertmpfs fills with.npm/_cacachecontentSuggested fixes
Pick one or layer them:
.npmcache off tmpfs — setNPM_CONFIG_CACHE=/home/automaker/.cache/npm(which is on a real volume), or to/tmp/npm-cache..claude.jsonto the persistent.claude/volume — the file lives at/home/automaker/.claude.json(next to.claude/, not inside it). If it lived inside.claude/it would be on the persistent volume and survive disk pressure./home/automakeris >80% full.Workaround (immediate, applied to ava staging)
This frees ~36M and re-installs OAuth credentials. Agents resume working until the cache fills again.
Discovery
Found while debugging mythxengine MYTHX-4 / MYTHX-5 / MYTHX-6 dispatch failures on 2026-05-06. Three agents in a row failed with auth errors despite a fresh OAuth token.