[dogfooding] Aegis must surface/auto-recover when CC /goal loop halts on transient 529 (upstream anthropics/claude-code#69975)

## Summary

Upstream CC bug `anthropics/claude-code#69975`: when an autonomous `/goal` run hits a transient `API Error: 529 Overloaded`, the loop ends immediately and **does not self-resume**. The user's run sits idle indefinitely (4h+ observed) until a human types `continue`.

This is a **critical dogfooding risk for Aegis**: the target user is a solo developer running 1–100 CC agents on their own machine/self-hosted box. If a `/goal` agent parks on a 529 for 4h overnight, the dashboard reads "alive" while the underlying goal is dead. Aegis is the layer that should detect and surface (or auto-recover) this state.

## Source

- **Upstream:** https://github.com/anthropics/claude-code/issues/69975 (filed 2026-06-22)
- **Aegis reference:** #4683 (endurance test — explicitly tests unattended CC runs through Aegis)

## Impact on Aegis users

| Scenario | Without this fix | With this fix |
|---|---|---|
| Overnight `/goal` run + transient 529 | Run appears alive, agent is parked for 4h+ | Dashboard shows "stalled on 529 since 04:35" or auto-resumes |
| Solo dev checks dashboard over coffee | Sees "1 active", assumes progress | Sees red badge, knows to intervene |
| Aegis reliability claim | "Stupid but powerful" — depends on underlying CC being reliable | Aegis adds its own reliability layer over CC quirks |

## Acceptance criteria (binding)

1. **Detection:** Aegis identifies when a CC session has not produced output for ≥X minutes AND the last error event in the transcript was a 5xx from upstream (529/503/etc.). Threshold to be defined (suggest 10 min for /goal-flagged sessions, 30 min for normal).
2. **Visibility:** Dashboard shows a stalled-session badge with the originating error class. Status reads "stalled" not "active". SSE event fires for the dashboard.
3. **Recovery path:** Either (a) auto-send `continue` once per stall event with backoff (Aegis-owned retry), OR (b) surface a "Send continue" action that the user can one-click from the dashboard. (a) is the Aegis-added-reliability story; (b) is the minimum viable surface. Pick (a) if `/goal` sessions are reliably idempotent at the continue prompt.
4. **Telemetry:** The stall event is captured in the per-session metrics (#3457 surface) so dashboards and future regression tests can see the pattern.

## Out of scope

- Fixing the upstream CC behavior (track separately at anthropics/claude-code#69975)
- Auto-retry on non-`/goal` interactive sessions (lower priority, lower dogfooding impact)
- Cross-session orchestration rollback (separate scope)

## Lane

**Backend / platform.** Suggested lead: <@1494469941074591924> (Hermes) — same lane as #4755 (per-model timeout + cron chain fallback) because both are about Aegis providing reliability over CC failure modes. Hep can pair on the per-session detection logic.

## Filing note

Filed per Athena auto-triage rule (Boss directive 2026-05-26): upstream bug finding → Aegis exposure triage, no batching for "tomorrow". Acceptance criteria + lane assigned so it meets Definition of Ready for the queue.

PM (Athena) — 2026-06-22 07:46 Rome. Will check back next heartbeat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dogfooding] Aegis must surface/auto-recover when CC /goal loop halts on transient 529 (upstream anthropics/claude-code#69975) #4802

Summary

Source

Impact on Aegis users

Acceptance criteria (binding)

Out of scope

Lane

Filing note

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	Without this fix	With this fix
Overnight `/goal` run + transient 529	Run appears alive, agent is parked for 4h+	Dashboard shows "stalled on 529 since 04:35" or auto-resumes
Solo dev checks dashboard over coffee	Sees "1 active", assumes progress	Sees red badge, knows to intervene
Aegis reliability claim	"Stupid but powerful" — depends on underlying CC being reliable	Aegis adds its own reliability layer over CC quirks

Uh oh!

[dogfooding] Aegis must surface/auto-recover when CC /goal loop halts on transient 529 (upstream anthropics/claude-code#69975) #4802

Description

Summary

Source

Impact on Aegis users

Acceptance criteria (binding)

Out of scope

Lane

Filing note

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions