Skip to content

Decouple agent-session lifetime from daemon lifetime (survive daemon restart/upgrade) #2335

Description

@harshitsinghbhandari

Note (2026-07-02): original framing corrected after tracing the runtime path. The daemon is already disposable and sessions are already durable in current code; the "build a cross-platform fd-inheriting PTY host" plan is unnecessary. This issue is narrowed to the small remaining gaps. Original problem/solution text preserved at the bottom for history.

Current reality

The premise the original issue was built on (on Unix the daemon holds the PTY master *os.File in-process, so daemon death kills every agent) does not match the code. The runtime is out-of-process on both platforms, the daemon survives its own death, and boot adopts surviving agents.

Concern Where Verdict
Unix runtime is tmux, not a daemon-held PTY runtimeselect.go:34 (tmux.New) Agent lives in the tmux server, a separate long-lived process.
Only daemon-side fd on Unix is an ephemeral attach client tmux.go:239 (ptyexec.Spawn of tmux attach-session) Daemon death kills the attach client, not the agent.
Windows host spawned detached, survives daemon exit conpty/spawn_windows.go:54 (DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP) Comment literally: "so the host survives daemon exit."
Windows host addr+pid persisted for recovery conpty/ptyregistry/registry.go + conpty/runtime.go:196 New daemon re-resolves the session from the registry.
Graceful shutdown does NOT tear down sessions daemon/daemon.go:174 "We deliberately do NOT tear down sessions here… next boot's Reconcile adopts them"; teardown-on-shutdown is a compile error.
Boot adopts surviving agents daemon/daemon.go:151 -> session_manager/manager.go:658 reconcileLive, adopt at :669 Alive -> no-op adopt. Dead -> capture work + relaunch.
Dead-runtime work preserved manager.go:675 StashUncommitted Uncommitted files captured into a preserve ref before relaunch.

All three original failure modes (daemon crash/restart, graceful shutdown, upgrade to a new binary) already keep agents alive and adopt them on the next boot.

Remaining work

  1. No per-session crash circuit breaker. Confirmed absent (only attach backoff exists in terminal/attachment.go). A conpty host that crashes on startup could respawn-storm. Add a rolling-window breaker (e.g. 3 crashes / 60s -> stop respawning, surface a typed error, require explicit retry). Low practical exposure today, so it is a guardrail not a fire.

  2. No automated proof durability holds (do first). No test kills the daemon and asserts the agent survives + is adopted, and no upgrade test. Add: kill-daemon-agent-survives-adopt (Unix, plus Windows ptyregistry recovery) and graceful ao stop && ao start. If green, the bulk of this issue is closeable as already-done.

  3. Terminal scrollback continuity on re-attach (minor UX). On restart the tmux attach client dies and re-attach redraws the current screen; there is no daemon-side replay ring (terminal/attachment.go:36; tmux owns scrollback). Decide whether a bounded ring is worth it or acceptable as-is.

Explicitly out of scope

  • Extracting a cross-platform PTY host / manifest. Unix already has one (tmux); Windows already has one (detached conpty host + ptyregistry).
  • fd-inheritance handoff (SCM_RIGHTS / ExtraFiles / ack-before-release). The runtime never lives in the daemon, so upgrade = restart + adopt; there are no live fds to transplant.

Reintroduce either only if a concrete, tested failure shows adoption is insufficient.


Original issue text (superseded, kept for history)

Problem

An agent's life is currently chained to the daemon's life. On Unix (adapters/runtime/ptyexec/spawn_unix.go) the daemon holds the PTY master *os.File in-process, so when the daemon stops for any reason, every running agent dies with it, even though nothing was wrong with the agents.

The daemon is the component that restarts most often, and it's a single point of death for the entire fleet it exists to supervise. Concretely this bites in three ways:

  • Shipping a release is destructive. A user with N agents mid-task updates AO -> the daemon restarts to load the new binary -> all N agents die.
  • No bulkhead. A single panic/OOM anywhere in the daemon takes down all agents, including the healthy ones.
  • Long jobs evaporate. A machine reboot, an Electron-shell bounce, or ao stop wipes a 40-minute in-flight task at minute 38.

We already solve half of this on Windows: adapters/runtime/conpty/host_main.go runs the PTY in a separate host process. The original proposal was to generalize that pattern with an adopt-don't-respawn supervisor (Phase 2) and fd-inheritance handoff (Phase 3). See the reassessment above: adoption already exists and the fd-host generalization is unnecessary.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions