Decouple agent-session lifetime from daemon lifetime (survive daemon restart/upgrade)

> **Note (2026-07-02):** original framing corrected after tracing the runtime path. The daemon is already disposable and sessions are already durable in current code; the "build a cross-platform fd-inheriting PTY host" plan is unnecessary. This issue is narrowed to the small remaining gaps. Original problem/solution text preserved at the bottom for history.

## Current reality

The premise the original issue was built on (on Unix the daemon holds the PTY master `*os.File` in-process, so daemon death kills every agent) does not match the code. The runtime is out-of-process on both platforms, the daemon survives its own death, and boot adopts surviving agents.

| Concern | Where | Verdict |
|---|---|---|
| Unix runtime is tmux, not a daemon-held PTY | `runtimeselect.go:34` (`tmux.New`) | Agent lives in the tmux server, a separate long-lived process. |
| Only daemon-side fd on Unix is an ephemeral attach client | `tmux.go:239` (`ptyexec.Spawn` of `tmux attach-session`) | Daemon death kills the attach client, not the agent. |
| Windows host spawned detached, survives daemon exit | `conpty/spawn_windows.go:54` (`DETACHED_PROCESS \| CREATE_NEW_PROCESS_GROUP`) | Comment literally: "so the host survives daemon exit." |
| Windows host addr+pid persisted for recovery | `conpty/ptyregistry/registry.go` + `conpty/runtime.go:196` | New daemon re-resolves the session from the registry. |
| Graceful shutdown does NOT tear down sessions | `daemon/daemon.go:174` | "We deliberately do NOT tear down sessions here… next boot's Reconcile adopts them"; teardown-on-shutdown is a compile error. |
| Boot adopts surviving agents | `daemon/daemon.go:151` -> `session_manager/manager.go:658` `reconcileLive`, adopt at `:669` | Alive -> no-op adopt. Dead -> capture work + relaunch. |
| Dead-runtime work preserved | `manager.go:675` `StashUncommitted` | Uncommitted files captured into a preserve ref before relaunch. |

All three original failure modes (daemon crash/restart, graceful shutdown, upgrade to a new binary) already keep agents alive and adopt them on the next boot.

## Remaining work

1. **No per-session crash circuit breaker.** Confirmed absent (only attach backoff exists in `terminal/attachment.go`). A conpty host that crashes on startup could respawn-storm. Add a rolling-window breaker (e.g. 3 crashes / 60s -> stop respawning, surface a typed error, require explicit retry). Low practical exposure today, so it is a guardrail not a fire.

2. **No automated proof durability holds (do first).** No test kills the daemon and asserts the agent survives + is adopted, and no upgrade test. Add: kill-daemon-agent-survives-adopt (Unix, plus Windows `ptyregistry` recovery) and graceful `ao stop && ao start`. If green, the bulk of this issue is closeable as already-done.

3. **Terminal scrollback continuity on re-attach (minor UX).** On restart the `tmux attach` client dies and re-attach redraws the current screen; there is no daemon-side replay ring (`terminal/attachment.go:36`; tmux owns scrollback). Decide whether a bounded ring is worth it or acceptable as-is.

## Explicitly out of scope

- Extracting a cross-platform PTY host / manifest. Unix already has one (tmux); Windows already has one (detached conpty host + `ptyregistry`).
- fd-inheritance handoff (`SCM_RIGHTS` / `ExtraFiles` / ack-before-release). The runtime never lives in the daemon, so upgrade = restart + adopt; there are no live fds to transplant.

Reintroduce either only if a concrete, tested failure shows adoption is insufficient.

---

<details>
<summary>Original issue text (superseded, kept for history)</summary>

## Problem

An agent's life is currently chained to the daemon's life. On Unix (`adapters/runtime/ptyexec/spawn_unix.go`) the daemon holds the PTY master `*os.File` **in-process**, so when the daemon stops for any reason, every running agent dies with it, even though nothing was wrong with the agents.

The daemon is the component that restarts most often, and it's a single point of death for the entire fleet it exists to supervise. Concretely this bites in three ways:

- **Shipping a release is destructive.** A user with N agents mid-task updates AO -> the daemon restarts to load the new binary -> all N agents die.
- **No bulkhead.** A single panic/OOM anywhere in the daemon takes down all agents, including the healthy ones.
- **Long jobs evaporate.** A machine reboot, an Electron-shell bounce, or `ao stop` wipes a 40-minute in-flight task at minute 38.

We already solve half of this on Windows: `adapters/runtime/conpty/host_main.go` runs the PTY in a separate host process. The original proposal was to generalize that pattern with an adopt-don't-respawn supervisor (Phase 2) and fd-inheritance handoff (Phase 3). See the reassessment above: adoption already exists and the fd-host generalization is unnecessary.

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple agent-session lifetime from daemon lifetime (survive daemon restart/upgrade) #2335

Current reality

Remaining work

Explicitly out of scope

Problem

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Concern	Where	Verdict
Unix runtime is tmux, not a daemon-held PTY	`runtimeselect.go:34` (`tmux.New`)	Agent lives in the tmux server, a separate long-lived process.
Only daemon-side fd on Unix is an ephemeral attach client	`tmux.go:239` (`ptyexec.Spawn` of `tmux attach-session`)	Daemon death kills the attach client, not the agent.
Windows host spawned detached, survives daemon exit	`conpty/spawn_windows.go:54` (`DETACHED_PROCESS \| CREATE_NEW_PROCESS_GROUP`)	Comment literally: "so the host survives daemon exit."
Windows host addr+pid persisted for recovery	`conpty/ptyregistry/registry.go` + `conpty/runtime.go:196`	New daemon re-resolves the session from the registry.
Graceful shutdown does NOT tear down sessions	`daemon/daemon.go:174`	"We deliberately do NOT tear down sessions here… next boot's Reconcile adopts them"; teardown-on-shutdown is a compile error.
Boot adopts surviving agents	`daemon/daemon.go:151` -> `session_manager/manager.go:658` `reconcileLive`, adopt at `:669`	Alive -> no-op adopt. Dead -> capture work + relaunch.
Dead-runtime work preserved	`manager.go:675` `StashUncommitted`	Uncommitted files captured into a preserve ref before relaunch.

Decouple agent-session lifetime from daemon lifetime (survive daemon restart/upgrade) #2335

Description

Current reality

Remaining work

Explicitly out of scope

Problem

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions