fix(supervisor): exit container when last tmux session ends by donbeave · Pull Request #388 · jackin-project/jackin

donbeave · 2026-05-18T17:42:29Z

Summary

Containers now stop automatically when the last agent session exits, eliminating the manual jackin eject step. Three fixes work together: the bash supervisor (supervisor.sh) polls tmux list-sessions rather than watching the socket file disappear, which makes it robust to stale sockets left when tmux crashes; finalize_foreground_session in finalize.rs now probes for live sessions when the container is still Running after docker exec returns so it treats an empty session list as supervisor lag (not a detach) and falls through to finalize_clean_exit; and launch.rs now always tears down the DinD sidecar and Docker network when the container has exited — crash exits included — rather than preserving them for a jackin hardline reconnect that cannot work without a live DinD. The only case that keeps DinD alive is a real detach: container still Running AND sessions confirmed present. The jackin-container-binary roadmap item is also expanded to its full multiplexer server vision (Phases 2–4) and the Herdr research item is merged into it.

Hard rule: never mutate the host machine silently

All changes are inside the container supervisor script and the host-side Rust cleanup logic. No new writes to host files, git config, remotes, or any other host-side state.

What's deferred (follow-up PRs)

jackin hardline reconnect path and spawn_agent_session path do not auto-cleanup DinD/network after clean container exit — those call sites have no LoadCleanup equivalent
Rust jackin-container binary with inotify-based socket watching (roadmap Phase 2)
In-container PTY multiplexer replacing tmux (roadmap Phase 3)

Verify locally

Checkout

Paste this first to bypass the tirith paste scanner for the rest of the session:

export TIRITH=0

Then paste the checkout block:

mkdir -p "$HOME/Projects/jackin-project/test"
cd "$HOME/Projects/jackin-project/test"

if [ ! -d jackin/.git ]; then
  git clone https://github.com/jackin-project/jackin.git
fi

cd jackin
mise trust
git fetch -f origin jackin/scratch/jk-x0yrqmdh-jackin-thearchitect:refs/remotes/origin/jackin/scratch/jk-x0yrqmdh-jackin-thearchitect
git checkout -B jackin/scratch/jk-x0yrqmdh-jackin-thearchitect refs/remotes/origin/jackin/scratch/jk-x0yrqmdh-jackin-thearchitect

Static checks

cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings

Tests

cargo nextest run --all-features

User smoke

cargo run --bin jackin -- console --debug

Launch a role, enter an agent session, then exit the agent cleanly (e.g. /exit in Claude Code). Within 1–2 seconds, docker ps should show no containers for the role — the role container, DinD sidecar, certs volume, and network are all removed automatically.

To confirm detach still works: use Ctrl-B D to detach from tmux instead of exiting — the container should remain running and be reconnectable via jackin hardline.

Documentation

cd docs
bun install --frozen-lockfile
bun run dev

Astro serves at http://localhost:4321/. Pages to walk:

http://localhost:4321/reference/roadmap/jackin-container-binary/
Updated Phase 1 section now describes the tmux list-sessions polling approach, the stale-socket fix, and the companion finalize.rs/launch.rs fixes that make teardown actually fire.

http://localhost:4321/reference/roadmap/
Confirm jackin-container appears in the Partially implemented section.

Stop containers automatically when all agent sessions exit, eliminating the need for `jackin eject` after a clean session. - supervisor.sh now monitors /tmp/tmux-$(id -u)/default with a 1-second poll loop; exits 0 when the socket disappears (last session ended), exits 1 if no socket appears within a 60-second startup grace period - launch.rs: when docker exec returns with the container still Running, call inspect_agent_sessions; if no sessions remain treat it as Stopped/0 and run the full DinD/network/certs teardown path immediately - Roadmap: mark jackin-container-binary as Partially implemented; update Phase 1 to describe the bash approach and defer the Rust binary to Phase 2 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

Scope the roadmap item to its long-term goal: jackin-container evolves from bash supervisor → Rust PID 1 → in-container multiplexer server that replaces tmux entirely. Phase 3 adds PTY session management, agent-state inference (working / blocked / done / idle via PTY output), and a full session control API (session.create, session.kill, session.title, session.attach, events). Phase 4 connects the multiplexer to the daemon and desktop app. References Herdr (herdr.dev) as design prior art for agent-state inference and Unix socket semantics. No Herdr source reused (AGPL-3.0 conflict). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

… item Consolidate herdr-research.mdx into jackin-container-binary.mdx — the research informs the multiplexer design directly, so a separate page adds navigation overhead without clarity benefit. The merged item now covers: - Current state: tmux-based minimal approach, why it's temporary - Research motivation: Herdr validated agent-aware multiplexer concept - What to implement independently (two-stage done/idle, PTY inference, workspace roll-up, notification suppression, socket wait semantics) - What not to borrow (AGPL-3.0 license conflict, architecture mismatch) - Full 4-phase implementation plan through desktop app bridge Move sidebar entry from Infrastructure to Reactive daemon program; update agent-orchestrator-research to point to the combined item. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

Three match arms (Running+no-sessions, Stopped/0, NotFound) shared the same write_instance_status(CleanExited)+cleanup.run() pattern with an inline Preserved guard. Extract into a named helper to remove duplication and make the shared semantics explicit. Also name the sessions-empty condition (`no_sessions`) for clarity, and drop the now-unnecessary `#[allow(clippy::match_same_arms)]` annotation (clippy does not flag the Stopped/0 and NotFound arms as same since their patterns are semantically distinct). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

…improve comments Three review findings addressed: - Running+no-sessions+Preserved: guard with !is_preserved so container and DinD stay alive for `jackin hardline` reconnect. Previously run_clean_exit_teardown fired unconditionally, removing the container before the operator could address preserved isolation worktrees. - AgentSessionInventory::Unavailable: emit debug_log with the reason string instead of silently treating it as sessions-present. - supervisor.sh exit-1: write a diagnostic to stderr naming the missing socket path before exiting so diagnose_premature_exit log capture includes supervisor context. Also fix comment accuracy: "once Phase 2 justifies overhead" → "will be removed in Phase 2"; remove ≤1 s polling-specific bound from match classifier comment; add NotFound/InspectUnavailable arms to the comment; tighten run_clean_exit_teardown doc comment. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

The helper was taking FinalizeDecision and re-deriving the Preserved check internally, duplicating the matches! expression already computed as `is_preserved` at the call site. Passing the bool removes the redundancy and removes run_clean_exit_teardown's dependency on FinalizeDecision. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

The NotFound+Preserved guard arm was silently empty — a container removed externally while finalization was still running left a stale preserved-status record on disk with zero diagnostic trace. Add a debug_log so --debug output captures the anomaly. Also fix the match classifier comment: "NotFound → Treat as Stopped/0" was wrong for the Preserved case. Split into two bullets making the distinct behaviors explicit. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

`[ -S socket ]` only checks if the socket file exists. When tmux crashes or is killed without cleanup, it leaves a stale socket on disk — the file-existence check returns true forever and the supervisor never exits. Replace the monitoring loop condition with `tmux list-sessions &>/dev/null` which probes the server itself: if the socket is stale or the server has exited, the command fails and the supervisor exits 0, triggering host-side cleanup. The grace-period check (waiting for the socket file to appear) keeps the plain file test — at that stage the server hasn't started yet so list-sessions would always fail. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

When the docker exec returns but the container is still Running, the old code immediately returned Preserved, treating it the same as a detach. This was incorrect for the common case where the agent exited cleanly: the container stays Running for up to 1 s while the supervisor's poll loop detects the tmux server exit and stops the container. The symptom: DinD sidecars and networks left running after the agent exits cleanly because cleanup was disarmed (is_preserved=true). Fix: when exit_code=None and not OOM-killed, probe the container for live tmux sessions before deciding: - sessions present → real detach (Ctrl-B D), preserve as before - no sessions → supervisor lag after clean agent exit, fall through to finalize_clean_exit so isolation worktrees are swept and the container/DinD/network are torn down Add has_tmux_sessions() helper that runs the same docker exec command as inspect_agent_sessions but returns a plain bool. Returns false on exec failure (container stopping) so the clean-exit path fires rather than silently preserving a terminating container. Update tests: - still_running_preserves_records → split into two tests: one with a session name in the queue (real detach → Preserved) and one with an empty response (supervisor lag → Cleaned) - load_agent_writes_instance_manifest: expected status changes from restore_available to clean_exited, reflecting the correct post-exit teardown when no sessions remain Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

Previously DinD was kept alive in two cases to support jackin hardline: - Stopped with non-zero exit (crash) - Running + no sessions + Preserved (dirty worktrees) Neither case is necessary: isolation worktrees live on the host filesystem and are accessible without DinD. Keeping DinD alive only created orphaned sidecars when the operator exited the agent. New rule: tear down DinD/network whenever the container is no longer running with active sessions. The only case that keeps DinD alive is Running + sessions present (real detach via Ctrl-B D — operator can reconnect via jackin hardline). Changes: - Running + no sessions: drop the `&& !is_preserved` guard so teardown fires regardless of preserved isolation state - Stopped (non-zero / OOM): replace cleanup.disarm() with cleanup.run() - NotFound + Preserved: add cleanup.run() after the anomaly log Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

When finalize_foreground_session returns Cleaned (!is_preserved), it already confirmed no tmux sessions exist via has_tmux_sessions. If inspect_container_state then shows Running (supervisor lag), calling inspect_agent_sessions immediately repeats the identical docker exec against a container where sessions are already known to be absent. Short-circuit: when !is_preserved in the Running branch, run teardown directly. Only re-query when is_preserved (finalize saw sessions at check-time and needs to detect sessions that may have ended in the interval before this inspect fired). Also reword the Unavailable debug message from "treating as sessions-present" to "treating conservatively as sessions-present" to clarify that the outcome is a fallback, not a positive detection. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

The Phase 1 "What shipped" section described socket-file-disappear monitoring, but the supervisor now polls tmux list-sessions instead (more robust to stale sockets). Also missing were the two companion Rust fixes that make teardown actually fire: - finalize.rs: has_tmux_sessions check to distinguish supervisor lag from a real detach - launch.rs: always tear down DinD on exit; skip redundant session re-query on the clean-exit path Update the Problem section's first bullet to note it is resolved in Phase 1. Update Phase 1 status and heading to reflect the full scope. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

- has_tmux_sessions: map_or(false, ...) → is_ok_and (clippy::unnecessary-map-or) - Running branch: if !is_preserved { A } else { B } → if is_preserved { B } else { A } (clippy::if-not-else) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

donbeave and others added 13 commits May 20, 2026 01:50

style: apply rustfmt to launch.rs

4b484fe

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

style: apply rustfmt to finalize.rs

b6b92e5

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

donbeave force-pushed the jackin/scratch/jk-x0yrqmdh-jackin-thearchitect branch from 47eccc9 to b6b92e5 Compare May 19, 2026 18:50

donbeave and others added 2 commits May 20, 2026 01:53

donbeave merged commit 0d4498d into main May 19, 2026
18 checks passed

donbeave deleted the jackin/scratch/jk-x0yrqmdh-jackin-thearchitect branch May 19, 2026 19:01

donbeave mentioned this pull request May 19, 2026

fix(caffeinate): log PID before kill; docs: update container-session-lifecycle #392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(supervisor): exit container when last tmux session ends#388

fix(supervisor): exit container when last tmux session ends#388
donbeave merged 15 commits into
mainfrom
jackin/scratch/jk-x0yrqmdh-jackin-thearchitect

donbeave commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

donbeave commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Hard rule: never mutate the host machine silently

What's deferred (follow-up PRs)

Verify locally

Checkout

Static checks

Tests

User smoke

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

donbeave commented May 18, 2026 •

edited

Loading