Skip to content

[dogfooding] Aegis must surface/auto-recover when CC /goal loop halts on transient 529 (upstream anthropics/claude-code#69975) #4802

Description

@OneStepAt4time

Summary

Upstream CC bug anthropics/claude-code#69975: when an autonomous /goal run hits a transient API Error: 529 Overloaded, the loop ends immediately and does not self-resume. The user's run sits idle indefinitely (4h+ observed) until a human types continue.

This is a critical dogfooding risk for Aegis: the target user is a solo developer running 1–100 CC agents on their own machine/self-hosted box. If a /goal agent parks on a 529 for 4h overnight, the dashboard reads "alive" while the underlying goal is dead. Aegis is the layer that should detect and surface (or auto-recover) this state.

Source

Impact on Aegis users

Scenario Without this fix With this fix
Overnight /goal run + transient 529 Run appears alive, agent is parked for 4h+ Dashboard shows "stalled on 529 since 04:35" or auto-resumes
Solo dev checks dashboard over coffee Sees "1 active", assumes progress Sees red badge, knows to intervene
Aegis reliability claim "Stupid but powerful" — depends on underlying CC being reliable Aegis adds its own reliability layer over CC quirks

Acceptance criteria (binding)

  1. Detection: Aegis identifies when a CC session has not produced output for ≥X minutes AND the last error event in the transcript was a 5xx from upstream (529/503/etc.). Threshold to be defined (suggest 10 min for /goal-flagged sessions, 30 min for normal).
  2. Visibility: Dashboard shows a stalled-session badge with the originating error class. Status reads "stalled" not "active". SSE event fires for the dashboard.
  3. Recovery path: Either (a) auto-send continue once per stall event with backoff (Aegis-owned retry), OR (b) surface a "Send continue" action that the user can one-click from the dashboard. (a) is the Aegis-added-reliability story; (b) is the minimum viable surface. Pick (a) if /goal sessions are reliably idempotent at the continue prompt.
  4. Telemetry: The stall event is captured in the per-session metrics (fix(api): record session completion from hooks, make metrics idempotent (#3427) #3457 surface) so dashboards and future regression tests can see the pattern.

Out of scope

Lane

Backend / platform. Suggested lead: <@1494469941074591924> (Hermes) — same lane as #4755 (per-model timeout + cron chain fallback) because both are about Aegis providing reliability over CC failure modes. Hep can pair on the per-session detection logic.

Filing note

Filed per Athena auto-triage rule (Boss directive 2026-05-26): upstream bug finding → Aegis exposure triage, no batching for "tomorrow". Acceptance criteria + lane assigned so it meets Definition of Ready for the queue.

PM (Athena) — 2026-06-22 07:46 Rome. Will check back next heartbeat.

Metadata

Metadata

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions