Skip to content

fix(supervisor): exit container when last tmux session ends#388

Merged
donbeave merged 15 commits into
mainfrom
jackin/scratch/jk-x0yrqmdh-jackin-thearchitect
May 19, 2026
Merged

fix(supervisor): exit container when last tmux session ends#388
donbeave merged 15 commits into
mainfrom
jackin/scratch/jk-x0yrqmdh-jackin-thearchitect

Conversation

@donbeave
Copy link
Copy Markdown
Member

@donbeave donbeave commented May 18, 2026

Summary

Containers now stop automatically when the last agent session exits, eliminating the manual jackin eject step. Three fixes work together: the bash supervisor (supervisor.sh) polls tmux list-sessions rather than watching the socket file disappear, which makes it robust to stale sockets left when tmux crashes; finalize_foreground_session in finalize.rs now probes for live sessions when the container is still Running after docker exec returns so it treats an empty session list as supervisor lag (not a detach) and falls through to finalize_clean_exit; and launch.rs now always tears down the DinD sidecar and Docker network when the container has exited — crash exits included — rather than preserving them for a jackin hardline reconnect that cannot work without a live DinD. The only case that keeps DinD alive is a real detach: container still Running AND sessions confirmed present. The jackin-container-binary roadmap item is also expanded to its full multiplexer server vision (Phases 2–4) and the Herdr research item is merged into it.

Hard rule: never mutate the host machine silently

All changes are inside the container supervisor script and the host-side Rust cleanup logic. No new writes to host files, git config, remotes, or any other host-side state.

What's deferred (follow-up PRs)

  • jackin hardline reconnect path and spawn_agent_session path do not auto-cleanup DinD/network after clean container exit — those call sites have no LoadCleanup equivalent
  • Rust jackin-container binary with inotify-based socket watching (roadmap Phase 2)
  • In-container PTY multiplexer replacing tmux (roadmap Phase 3)

Verify locally

Checkout

Paste this first to bypass the tirith paste scanner for the rest of the session:

export TIRITH=0

Then paste the checkout block:

mkdir -p "$HOME/Projects/jackin-project/test"
cd "$HOME/Projects/jackin-project/test"

if [ ! -d jackin/.git ]; then
  git clone https://github.com/jackin-project/jackin.git
fi

cd jackin
mise trust
git fetch -f origin jackin/scratch/jk-x0yrqmdh-jackin-thearchitect:refs/remotes/origin/jackin/scratch/jk-x0yrqmdh-jackin-thearchitect
git checkout -B jackin/scratch/jk-x0yrqmdh-jackin-thearchitect refs/remotes/origin/jackin/scratch/jk-x0yrqmdh-jackin-thearchitect

Static checks

cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings

Tests

cargo nextest run --all-features

User smoke

cargo run --bin jackin -- console --debug

Launch a role, enter an agent session, then exit the agent cleanly (e.g. /exit in Claude Code). Within 1–2 seconds, docker ps should show no containers for the role — the role container, DinD sidecar, certs volume, and network are all removed automatically.

To confirm detach still works: use Ctrl-B D to detach from tmux instead of exiting — the container should remain running and be reconnectable via jackin hardline.

Documentation

cd docs
bun install --frozen-lockfile
bun run dev

Astro serves at http://localhost:4321/. Pages to walk:

http://localhost:4321/reference/roadmap/jackin-container-binary/
Updated Phase 1 section now describes the tmux list-sessions polling approach, the stale-socket fix, and the companion finalize.rs/launch.rs fixes that make teardown actually fire.

http://localhost:4321/reference/roadmap/
Confirm jackin-container appears in the Partially implemented section.

donbeave and others added 13 commits May 20, 2026 01:50
Stop containers automatically when all agent sessions exit, eliminating
the need for `jackin eject` after a clean session.

- supervisor.sh now monitors /tmp/tmux-$(id -u)/default with a 1-second
  poll loop; exits 0 when the socket disappears (last session ended),
  exits 1 if no socket appears within a 60-second startup grace period
- launch.rs: when docker exec returns with the container still Running,
  call inspect_agent_sessions; if no sessions remain treat it as Stopped/0
  and run the full DinD/network/certs teardown path immediately
- Roadmap: mark jackin-container-binary as Partially implemented; update
  Phase 1 to describe the bash approach and defer the Rust binary to Phase 2

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Scope the roadmap item to its long-term goal: jackin-container evolves
from bash supervisor → Rust PID 1 → in-container multiplexer server that
replaces tmux entirely.

Phase 3 adds PTY session management, agent-state inference (working /
blocked / done / idle via PTY output), and a full session control API
(session.create, session.kill, session.title, session.attach, events).
Phase 4 connects the multiplexer to the daemon and desktop app.

References Herdr (herdr.dev) as design prior art for agent-state inference
and Unix socket semantics. No Herdr source reused (AGPL-3.0 conflict).

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
… item

Consolidate herdr-research.mdx into jackin-container-binary.mdx — the
research informs the multiplexer design directly, so a separate page
adds navigation overhead without clarity benefit.

The merged item now covers:
- Current state: tmux-based minimal approach, why it's temporary
- Research motivation: Herdr validated agent-aware multiplexer concept
- What to implement independently (two-stage done/idle, PTY inference,
  workspace roll-up, notification suppression, socket wait semantics)
- What not to borrow (AGPL-3.0 license conflict, architecture mismatch)
- Full 4-phase implementation plan through desktop app bridge

Move sidebar entry from Infrastructure to Reactive daemon program;
update agent-orchestrator-research to point to the combined item.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Three match arms (Running+no-sessions, Stopped/0, NotFound) shared the
same write_instance_status(CleanExited)+cleanup.run() pattern with an
inline Preserved guard. Extract into a named helper to remove duplication
and make the shared semantics explicit.

Also name the sessions-empty condition (`no_sessions`) for clarity, and
drop the now-unnecessary `#[allow(clippy::match_same_arms)]` annotation
(clippy does not flag the Stopped/0 and NotFound arms as same since their
patterns are semantically distinct).

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
…improve comments

Three review findings addressed:

- Running+no-sessions+Preserved: guard with !is_preserved so container
  and DinD stay alive for `jackin hardline` reconnect. Previously
  run_clean_exit_teardown fired unconditionally, removing the container
  before the operator could address preserved isolation worktrees.

- AgentSessionInventory::Unavailable: emit debug_log with the reason
  string instead of silently treating it as sessions-present.

- supervisor.sh exit-1: write a diagnostic to stderr naming the missing
  socket path before exiting so diagnose_premature_exit log capture
  includes supervisor context.

Also fix comment accuracy: "once Phase 2 justifies overhead" →
"will be removed in Phase 2"; remove ≤1 s polling-specific bound
from match classifier comment; add NotFound/InspectUnavailable arms
to the comment; tighten run_clean_exit_teardown doc comment.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
The helper was taking FinalizeDecision and re-deriving the Preserved
check internally, duplicating the matches! expression already computed
as `is_preserved` at the call site. Passing the bool removes the
redundancy and removes run_clean_exit_teardown's dependency on
FinalizeDecision.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
The NotFound+Preserved guard arm was silently empty — a container
removed externally while finalization was still running left a stale
preserved-status record on disk with zero diagnostic trace. Add a
debug_log so --debug output captures the anomaly.

Also fix the match classifier comment: "NotFound → Treat as Stopped/0"
was wrong for the Preserved case. Split into two bullets making the
distinct behaviors explicit.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
`[ -S socket ]` only checks if the socket file exists. When tmux crashes
or is killed without cleanup, it leaves a stale socket on disk — the
file-existence check returns true forever and the supervisor never exits.

Replace the monitoring loop condition with `tmux list-sessions &>/dev/null`
which probes the server itself: if the socket is stale or the server has
exited, the command fails and the supervisor exits 0, triggering host-side
cleanup.

The grace-period check (waiting for the socket file to appear) keeps the
plain file test — at that stage the server hasn't started yet so
list-sessions would always fail.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
When the docker exec returns but the container is still Running, the old
code immediately returned Preserved, treating it the same as a detach.
This was incorrect for the common case where the agent exited cleanly:
the container stays Running for up to 1 s while the supervisor's poll
loop detects the tmux server exit and stops the container.

The symptom: DinD sidecars and networks left running after the agent
exits cleanly because cleanup was disarmed (is_preserved=true).

Fix: when exit_code=None and not OOM-killed, probe the container for live
tmux sessions before deciding:
  - sessions present → real detach (Ctrl-B D), preserve as before
  - no sessions     → supervisor lag after clean agent exit, fall through
                      to finalize_clean_exit so isolation worktrees are
                      swept and the container/DinD/network are torn down

Add has_tmux_sessions() helper that runs the same docker exec command as
inspect_agent_sessions but returns a plain bool. Returns false on exec
failure (container stopping) so the clean-exit path fires rather than
silently preserving a terminating container.

Update tests:
  - still_running_preserves_records → split into two tests: one with a
    session name in the queue (real detach → Preserved) and one with an
    empty response (supervisor lag → Cleaned)
  - load_agent_writes_instance_manifest: expected status changes from
    restore_available to clean_exited, reflecting the correct post-exit
    teardown when no sessions remain

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Previously DinD was kept alive in two cases to support jackin hardline:
  - Stopped with non-zero exit (crash)
  - Running + no sessions + Preserved (dirty worktrees)

Neither case is necessary: isolation worktrees live on the host filesystem
and are accessible without DinD. Keeping DinD alive only created orphaned
sidecars when the operator exited the agent.

New rule: tear down DinD/network whenever the container is no longer
running with active sessions. The only case that keeps DinD alive is
Running + sessions present (real detach via Ctrl-B D — operator can
reconnect via jackin hardline).

Changes:
  - Running + no sessions: drop the `&& !is_preserved` guard so teardown
    fires regardless of preserved isolation state
  - Stopped (non-zero / OOM): replace cleanup.disarm() with cleanup.run()
  - NotFound + Preserved: add cleanup.run() after the anomaly log

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
When finalize_foreground_session returns Cleaned (!is_preserved), it
already confirmed no tmux sessions exist via has_tmux_sessions. If
inspect_container_state then shows Running (supervisor lag), calling
inspect_agent_sessions immediately repeats the identical docker exec
against a container where sessions are already known to be absent.

Short-circuit: when !is_preserved in the Running branch, run teardown
directly. Only re-query when is_preserved (finalize saw sessions at
check-time and needs to detect sessions that may have ended in the
interval before this inspect fired).

Also reword the Unavailable debug message from "treating as
sessions-present" to "treating conservatively as sessions-present"
to clarify that the outcome is a fallback, not a positive detection.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
@donbeave donbeave force-pushed the jackin/scratch/jk-x0yrqmdh-jackin-thearchitect branch from 47eccc9 to b6b92e5 Compare May 19, 2026 18:50
donbeave and others added 2 commits May 20, 2026 01:53
The Phase 1 "What shipped" section described socket-file-disappear
monitoring, but the supervisor now polls tmux list-sessions instead (more
robust to stale sockets). Also missing were the two companion Rust fixes
that make teardown actually fire:
  - finalize.rs: has_tmux_sessions check to distinguish supervisor lag
    from a real detach
  - launch.rs: always tear down DinD on exit; skip redundant session
    re-query on the clean-exit path

Update the Problem section's first bullet to note it is resolved in
Phase 1. Update Phase 1 status and heading to reflect the full scope.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
- has_tmux_sessions: map_or(false, ...) → is_ok_and (clippy::unnecessary-map-or)
- Running branch: if !is_preserved { A } else { B } → if is_preserved { B } else { A }
  (clippy::if-not-else)

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
@donbeave donbeave merged commit 0d4498d into main May 19, 2026
18 checks passed
@donbeave donbeave deleted the jackin/scratch/jk-x0yrqmdh-jackin-thearchitect branch May 19, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant