Skip to content

fix(server): claude.json corruption when /home/automaker tmpfs fills (64M cap) #3564

@mabry1985

Description

@mabry1985

Symptom

Agents fail with "Authentication failed: Invalid or expired API key" even though OAuth credentials are valid. Subsequent claude -p invocations show:

Claude configuration file at /home/automaker/.claude.json is corrupted: JSON Parse error: Unterminated string

This persists across container restarts and resists every recovery attempt — restoring backups, writing fresh config, copying from the host. Every claude invocation reports the file as corrupted within seconds.

Root cause

/home/automaker is a tmpfs capped at 64 MB in the staging compose. The .npm cache fills it (~64M of npm content-addressable cache), leaving zero free bytes. Every subsequent write to /home/automaker/.claude.json is silently truncated by the kernel:

$ docker exec automaker-server df -h /home/automaker
Filesystem      Size  Used Avail Use%
tmpfs            64M   64M     0 100%

$ docker exec -i automaker-server sh -c 'cat > /home/automaker/.claude.json' < good-file.json
cat: write error: No space left on device

The Claude CLI reads the truncated file, sees invalid JSON, and reports "corruption" — masking the real disk-full failure.

Impact

  • Agents fail dispatch with confusing auth errors (the credentials are fine; the file just got truncated mid-write)
  • Auto-mode keeps respawning failed agents, each one further corrupting the shared .claude.json
  • "Recovery" by writing a clean file appears to work (cat shows valid JSON briefly) but the next claude invocation fails again
  • Diagnosis is difficult — df is the only signal, and it's not in any standard health check

Reproduction

  1. Run automaker-staging with the published staging compose
  2. Let auto-mode dispatch ~10–20 agents
  3. Observe /home/automaker tmpfs fills with .npm/_cacache content
  4. All subsequent claude invocations fail with "corrupted" errors

Suggested fixes

Pick one or layer them:

  1. Bump the tmpfs size — 64M is far too small with the npm cache living there. 512M or 1G would prevent this entirely.
  2. Move .npm cache off tmpfs — set NPM_CONFIG_CACHE=/home/automaker/.cache/npm (which is on a real volume), or to /tmp/npm-cache.
  3. Move .claude.json to the persistent .claude/ volume — the file lives at /home/automaker/.claude.json (next to .claude/, not inside it). If it lived inside .claude/ it would be on the persistent volume and survive disk pressure.
  4. Add a startup health check that warns when /home/automaker is >80% full.
  5. Make agent-failure error messages distinguish between "credentials invalid" and "credentials file unreadable / truncated" — currently both manifest as "Invalid or expired API key" which sends operators down the wrong rabbit hole.

Workaround (immediate, applied to ava staging)

docker exec automaker-server find /home/automaker/.npm/_cacache -mindepth 1 -delete
docker exec -i automaker-server sh -c 'cat > /home/automaker/.claude/.credentials.json' < ~/.claude/.credentials.json

This frees ~36M and re-installs OAuth credentials. Agents resume working until the cache fills again.

Discovery

Found while debugging mythxengine MYTHX-4 / MYTHX-5 / MYTHX-6 dispatch failures on 2026-05-06. Three agents in a row failed with auth errors despite a fresh OAuth token.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstatus: needs-triageNew issue awaiting triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions