fix(supervisor): exit container when last tmux session ends#388
Merged
Conversation
Stop containers automatically when all agent sessions exit, eliminating the need for `jackin eject` after a clean session. - supervisor.sh now monitors /tmp/tmux-$(id -u)/default with a 1-second poll loop; exits 0 when the socket disappears (last session ended), exits 1 if no socket appears within a 60-second startup grace period - launch.rs: when docker exec returns with the container still Running, call inspect_agent_sessions; if no sessions remain treat it as Stopped/0 and run the full DinD/network/certs teardown path immediately - Roadmap: mark jackin-container-binary as Partially implemented; update Phase 1 to describe the bash approach and defer the Rust binary to Phase 2 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Scope the roadmap item to its long-term goal: jackin-container evolves from bash supervisor → Rust PID 1 → in-container multiplexer server that replaces tmux entirely. Phase 3 adds PTY session management, agent-state inference (working / blocked / done / idle via PTY output), and a full session control API (session.create, session.kill, session.title, session.attach, events). Phase 4 connects the multiplexer to the daemon and desktop app. References Herdr (herdr.dev) as design prior art for agent-state inference and Unix socket semantics. No Herdr source reused (AGPL-3.0 conflict). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
… item Consolidate herdr-research.mdx into jackin-container-binary.mdx — the research informs the multiplexer design directly, so a separate page adds navigation overhead without clarity benefit. The merged item now covers: - Current state: tmux-based minimal approach, why it's temporary - Research motivation: Herdr validated agent-aware multiplexer concept - What to implement independently (two-stage done/idle, PTY inference, workspace roll-up, notification suppression, socket wait semantics) - What not to borrow (AGPL-3.0 license conflict, architecture mismatch) - Full 4-phase implementation plan through desktop app bridge Move sidebar entry from Infrastructure to Reactive daemon program; update agent-orchestrator-research to point to the combined item. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Three match arms (Running+no-sessions, Stopped/0, NotFound) shared the same write_instance_status(CleanExited)+cleanup.run() pattern with an inline Preserved guard. Extract into a named helper to remove duplication and make the shared semantics explicit. Also name the sessions-empty condition (`no_sessions`) for clarity, and drop the now-unnecessary `#[allow(clippy::match_same_arms)]` annotation (clippy does not flag the Stopped/0 and NotFound arms as same since their patterns are semantically distinct). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
…improve comments Three review findings addressed: - Running+no-sessions+Preserved: guard with !is_preserved so container and DinD stay alive for `jackin hardline` reconnect. Previously run_clean_exit_teardown fired unconditionally, removing the container before the operator could address preserved isolation worktrees. - AgentSessionInventory::Unavailable: emit debug_log with the reason string instead of silently treating it as sessions-present. - supervisor.sh exit-1: write a diagnostic to stderr naming the missing socket path before exiting so diagnose_premature_exit log capture includes supervisor context. Also fix comment accuracy: "once Phase 2 justifies overhead" → "will be removed in Phase 2"; remove ≤1 s polling-specific bound from match classifier comment; add NotFound/InspectUnavailable arms to the comment; tighten run_clean_exit_teardown doc comment. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
The helper was taking FinalizeDecision and re-deriving the Preserved check internally, duplicating the matches! expression already computed as `is_preserved` at the call site. Passing the bool removes the redundancy and removes run_clean_exit_teardown's dependency on FinalizeDecision. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
The NotFound+Preserved guard arm was silently empty — a container removed externally while finalization was still running left a stale preserved-status record on disk with zero diagnostic trace. Add a debug_log so --debug output captures the anomaly. Also fix the match classifier comment: "NotFound → Treat as Stopped/0" was wrong for the Preserved case. Split into two bullets making the distinct behaviors explicit. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
`[ -S socket ]` only checks if the socket file exists. When tmux crashes or is killed without cleanup, it leaves a stale socket on disk — the file-existence check returns true forever and the supervisor never exits. Replace the monitoring loop condition with `tmux list-sessions &>/dev/null` which probes the server itself: if the socket is stale or the server has exited, the command fails and the supervisor exits 0, triggering host-side cleanup. The grace-period check (waiting for the socket file to appear) keeps the plain file test — at that stage the server hasn't started yet so list-sessions would always fail. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
When the docker exec returns but the container is still Running, the old
code immediately returned Preserved, treating it the same as a detach.
This was incorrect for the common case where the agent exited cleanly:
the container stays Running for up to 1 s while the supervisor's poll
loop detects the tmux server exit and stops the container.
The symptom: DinD sidecars and networks left running after the agent
exits cleanly because cleanup was disarmed (is_preserved=true).
Fix: when exit_code=None and not OOM-killed, probe the container for live
tmux sessions before deciding:
- sessions present → real detach (Ctrl-B D), preserve as before
- no sessions → supervisor lag after clean agent exit, fall through
to finalize_clean_exit so isolation worktrees are
swept and the container/DinD/network are torn down
Add has_tmux_sessions() helper that runs the same docker exec command as
inspect_agent_sessions but returns a plain bool. Returns false on exec
failure (container stopping) so the clean-exit path fires rather than
silently preserving a terminating container.
Update tests:
- still_running_preserves_records → split into two tests: one with a
session name in the queue (real detach → Preserved) and one with an
empty response (supervisor lag → Cleaned)
- load_agent_writes_instance_manifest: expected status changes from
restore_available to clean_exited, reflecting the correct post-exit
teardown when no sessions remain
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Previously DinD was kept alive in two cases to support jackin hardline:
- Stopped with non-zero exit (crash)
- Running + no sessions + Preserved (dirty worktrees)
Neither case is necessary: isolation worktrees live on the host filesystem
and are accessible without DinD. Keeping DinD alive only created orphaned
sidecars when the operator exited the agent.
New rule: tear down DinD/network whenever the container is no longer
running with active sessions. The only case that keeps DinD alive is
Running + sessions present (real detach via Ctrl-B D — operator can
reconnect via jackin hardline).
Changes:
- Running + no sessions: drop the `&& !is_preserved` guard so teardown
fires regardless of preserved isolation state
- Stopped (non-zero / OOM): replace cleanup.disarm() with cleanup.run()
- NotFound + Preserved: add cleanup.run() after the anomaly log
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
When finalize_foreground_session returns Cleaned (!is_preserved), it already confirmed no tmux sessions exist via has_tmux_sessions. If inspect_container_state then shows Running (supervisor lag), calling inspect_agent_sessions immediately repeats the identical docker exec against a container where sessions are already known to be absent. Short-circuit: when !is_preserved in the Running branch, run teardown directly. Only re-query when is_preserved (finalize saw sessions at check-time and needs to detect sessions that may have ended in the interval before this inspect fired). Also reword the Unavailable debug message from "treating as sessions-present" to "treating conservatively as sessions-present" to clarify that the outcome is a fallback, not a positive detection. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
47eccc9 to
b6b92e5
Compare
The Phase 1 "What shipped" section described socket-file-disappear
monitoring, but the supervisor now polls tmux list-sessions instead (more
robust to stale sockets). Also missing were the two companion Rust fixes
that make teardown actually fire:
- finalize.rs: has_tmux_sessions check to distinguish supervisor lag
from a real detach
- launch.rs: always tear down DinD on exit; skip redundant session
re-query on the clean-exit path
Update the Problem section's first bullet to note it is resolved in
Phase 1. Update Phase 1 status and heading to reflect the full scope.
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
- has_tmux_sessions: map_or(false, ...) → is_ok_and (clippy::unnecessary-map-or)
- Running branch: if !is_preserved { A } else { B } → if is_preserved { B } else { A }
(clippy::if-not-else)
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Containers now stop automatically when the last agent session exits, eliminating the manual
jackin ejectstep. Three fixes work together: the bash supervisor (supervisor.sh) pollstmux list-sessionsrather than watching the socket file disappear, which makes it robust to stale sockets left when tmux crashes;finalize_foreground_sessioninfinalize.rsnow probes for live sessions when the container is stillRunningafterdocker execreturns so it treats an empty session list as supervisor lag (not a detach) and falls through tofinalize_clean_exit; andlaunch.rsnow always tears down the DinD sidecar and Docker network when the container has exited — crash exits included — rather than preserving them for ajackin hardlinereconnect that cannot work without a live DinD. The only case that keeps DinD alive is a real detach: container stillRunningAND sessions confirmed present. Thejackin-container-binaryroadmap item is also expanded to its full multiplexer server vision (Phases 2–4) and the Herdr research item is merged into it.Hard rule: never mutate the host machine silently
All changes are inside the container supervisor script and the host-side Rust cleanup logic. No new writes to host files, git config, remotes, or any other host-side state.
What's deferred (follow-up PRs)
jackin hardlinereconnect path andspawn_agent_sessionpath do not auto-cleanup DinD/network after clean container exit — those call sites have noLoadCleanupequivalentjackin-containerbinary withinotify-based socket watching (roadmap Phase 2)Verify locally
Checkout
Paste this first to bypass the
tirithpaste scanner for the rest of the session:export TIRITH=0Then paste the checkout block:
Static checks
Tests
User smoke
Launch a role, enter an agent session, then exit the agent cleanly (e.g.
/exitin Claude Code). Within 1–2 seconds,docker psshould show no containers for the role — the role container, DinD sidecar, certs volume, and network are all removed automatically.To confirm detach still works: use
Ctrl-B Dto detach from tmux instead of exiting — the container should remain running and be reconnectable viajackin hardline.Documentation
cd docs bun install --frozen-lockfile bun run devAstro serves at
http://localhost:4321/. Pages to walk:http://localhost:4321/reference/roadmap/jackin-container-binary/
Updated Phase 1 section now describes the
tmux list-sessionspolling approach, the stale-socket fix, and the companionfinalize.rs/launch.rsfixes that make teardown actually fire.http://localhost:4321/reference/roadmap/
Confirm
jackin-containerappears in the Partially implemented section.