Skip to content

Commit 55c7209

Browse files
authored
Simplify graceful worker shutdown (#71)
Close worker stdin as the sole graceful shutdown request. Remove interpreter shutdown snippets from worker-ready advertising and drop the server-to-worker sideband shutdown command. Add regression coverage and ADR documentation for the simplified lifecycle contract.
1 parent 2dbc6ac commit 55c7209

12 files changed

Lines changed: 206 additions & 116 deletions
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# ADR 0001: Use stdin close for graceful worker shutdown
2+
3+
## Status
4+
5+
Accepted
6+
7+
## Date
8+
9+
2026-05-19
10+
11+
## Context
12+
13+
The server previously allowed workers to advertise interpreter-specific
14+
graceful shutdown text through `worker_ready.graceful_shutdown.stdin`, such as
15+
R `quit("no")` or Python `exit()`. During reset or shutdown, the server could
16+
write that text to worker stdin as the graceful request.
17+
18+
That design had three problems:
19+
20+
- It created a race between shutdown text sent on stdin and `session_end`
21+
notifications sent on sideband.
22+
- It allowed user-level input operations, such as R `readline()` or Python
23+
`input()`, to consume the graceful shutdown request as ordinary user input.
24+
- It made the lifecycle contract harder to explain, reason about, and
25+
communicate to a model or human because graceful shutdown used a mixture of
26+
stdin code, sideband notification, and OS escalation.
27+
28+
## Decision
29+
30+
Graceful worker shutdown is requested only by closing the worker stdin
31+
transport. The server must not send interpreter shutdown code over stdin and
32+
must not add or use a sideband shutdown command to deliver graceful shutdown.
33+
34+
Workers must no longer advertise `worker_ready.graceful_shutdown.stdin`.
35+
Built-in workers do not advertise `quit("no")`, `exit()`, or equivalent
36+
shutdown text.
37+
38+
On reset, respawn, or server shutdown, the server closes worker stdin and waits
39+
for the worker to exit naturally within the existing shutdown timeout. If the
40+
worker is busy, ignores EOF, or otherwise does not exit, the existing
41+
SIGTERM/SIGKILL or platform-equivalent escalation remains the fallback.
42+
43+
Worker-to-server `session_end` remains a notification that the worker session is
44+
ending. It is not a shutdown request from the server to the worker.
45+
46+
## Consequences
47+
48+
- There is no race between stdin shutdown code and sideband shutdown delivery.
49+
- A blocked user prompt cannot receive interpreter shutdown text as an answer.
50+
- Idle interpreters can still exit normally on EOF and run normal finalization
51+
where supported.
52+
- Graceful finalization is best effort. User code can handle EOF and continue,
53+
or remain busy, in which case OS escalation terminates the worker.
54+
- The server/worker lifecycle is simpler: stdin carries user input and EOF,
55+
sideband carries facts observed by the worker, and process control remains
56+
server-owned.

docs/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ checked-in execution plans without relying on stale notes.
1212
- `docs/debugging.md`: debug logs, `--debug-repl`, and wire tracing.
1313
- `docs/sandbox.md`: sandbox modes, writable roots, and Codex per-tool-call sandbox metadata.
1414
- `docs/worker_sideband_protocol.md`: server/worker IPC contract.
15+
- `docs/adr/0001-stdin-close-graceful-shutdown.md`: accepted decision that
16+
graceful worker shutdown is requested only by closing worker stdin.
1517
- `docs/plans/AGENTS.md`: when to write a checked-in execution plan and where it lives.
1618
- `docs/plans/completed/codex-sandbox-state-meta-migration.md`: completed plan for migrating Codex `--sandbox inherit` from async updates to per-tool-call sandbox metadata.
1719

@@ -23,6 +25,7 @@ checked-in execution plans without relying on stale notes.
2325
- `docs/tool-descriptions/repl_tool_python.md`: Python `repl` behavior for the files-mode oversized-output path.
2426
- `docs/tool-descriptions/repl_tool_python_pager.md`: Python `repl` behavior for pager mode.
2527
- `docs/tool-descriptions/repl_reset_tool.md`: `repl_reset` behavior.
28+
- `docs/adr/`: accepted architecture decision records.
2629
- `README.md`: user-facing overview and installation guide. Treat it as product documentation, not the engineering source of truth.
2730

2831
## Exploratory Docs

docs/plans/active/worker-server-protocol-zod.md

Lines changed: 14 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -243,9 +243,6 @@ The first worker-to-server message must be:
243243
"worker": { "name": "zod", "version": "0.1.0" },
244244
"capabilities": {
245245
"images": false
246-
},
247-
"graceful_shutdown": {
248-
"stdin": "quit()\n"
249246
}
250247
}
251248
```
@@ -259,20 +256,9 @@ interpreter-specific steady-state code paths.
259256
The `worker.name` field is descriptive. It is useful for logs and
260257
diagnostics, but server request handling must not switch on it.
261258

262-
`graceful_shutdown` is optional. If present, it contains the exact stdin text
263-
the server may write as a graceful shutdown/reset request when server-owned
264-
shutdown rules say a stdin request is appropriate. For example, R might ask for
265-
`quit("no")\n`.
266-
267259
Handshake field notes:
268260

269261
- `capabilities.images`: whether the worker emits `output_image` events.
270-
- `graceful_shutdown.stdin`: exact text the worker wants sent on stdin
271-
when the server asks it to exit gracefully. This is not base64-encoded
272-
because it is a small human-readable interpreter command. The worker
273-
controls only this text; it does not control whether the command is
274-
sent, when interrupts are sent, how long the server waits, or whether
275-
OS termination follows.
276262

277263
Raw stdout/stderr capture is not a negotiated capability. The server
278264
will always capture raw worker stdout/stderr as unowned fallback output,
@@ -520,18 +506,19 @@ delivering OS controls. If a worker exists, the server delivers the
520506
control. If no worker exists, the server must not spawn a worker solely
521507
to interrupt or shut down nothing.
522508

523-
Workers cannot opt out of OS controls. The handshake may describe a
524-
graceful stdin request for shutdown/reset, but OS escalation is common
525-
across all workers and remains server-owned.
509+
Workers cannot opt out of OS controls. Graceful shutdown is a server-owned
510+
stdin close followed by a bounded wait; OS escalation is common across all
511+
workers and remains server-owned.
526512

527513
For reset and shutdown, if the server considers the worker busy, it
528514
sends an interrupt first. This is not configurable by the worker. After
529515
the interrupt, the server waits for the normal bounded server-owned
530516
grace period for the worker to return to an unsatisfied `readline_start`
531-
or end the session. If the worker becomes idle and provided a
532-
`graceful_shutdown.stdin` command, the server may write that command to
533-
stdin. If the worker does not exit within the server-owned graceful
534-
shutdown timeout, the server escalates to OS termination.
517+
or end the session. The server then closes worker stdin and waits for
518+
natural exit. If the worker does not exit within the server-owned graceful
519+
shutdown timeout, the server escalates to OS termination. The server must not
520+
write interpreter shutdown code to stdin and must not use sideband shutdown
521+
commands to deliver graceful shutdown.
535522

536523
### Interrupt
537524

@@ -586,11 +573,9 @@ Reset means end the current worker session and start a fresh one for the
586573
next tool call.
587574

588575
If the worker is busy, the server interrupts first. If the worker is
589-
then known to be waiting at an unsatisfied `readline_start`, and the
590-
handshake provided a `graceful_shutdown.stdin` command, the server may
591-
write that command to stdin. If the worker does not exit within the
592-
server-owned timeout, the server escalates to OS termination. The worker
593-
controls only the graceful command text.
576+
then known to be waiting at an unsatisfied `readline_start`, the server closes
577+
worker stdin and waits for natural exit. If the worker does not exit within
578+
the server-owned timeout, the server escalates to OS termination.
594579

595580
If a leading reset prefix has a non-empty tail, the server writes the
596581
tail only after reset has completed and the replacement worker has
@@ -806,11 +791,10 @@ surface with Zod as the worker:
806791
- Interrupt completion is driven by worker events, not prompt parsing.
807792
- Worker-owned stderr preserves ordering relative to stdout.
808793
- Raw stdout/stderr never affects prompt or completion state.
809-
- Reset interrupts first when the worker is busy, then uses graceful
810-
stdin shutdown only after an unsatisfied `readline_start`, then
794+
- Reset interrupts first when the worker is busy, then closes worker stdin and
811795
escalates to OS termination if needed.
812-
- Shutdown follows the same interrupt-then-graceful-stdin-then-OS rule
813-
and cannot be opted out of.
796+
- Shutdown follows the same interrupt-then-stdin-close-then-OS rule and cannot
797+
be opted out of.
814798
- Protocol errors fail fast and do not leave the server in a shadow
815799
interpreter state.
816800

docs/worker_sideband_protocol.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ Runtime stdin transport is a launch-time worker setting, not a sideband
2626
negotiation. A worker may use ordinary pipes or a PTY for its C stdio, but the
2727
server still writes accepted request bytes to worker stdin and relies on
2828
sideband events for prompt, input, discard, output, and session facts.
29+
For graceful reset and shutdown, the server closes the worker stdin transport
30+
and then waits for normal worker exit before escalating to OS termination.
31+
Workers must not advertise interpreter-specific shutdown text, and the server
32+
does not send shutdown code or a sideband shutdown command. See
33+
`docs/adr/0001-stdin-close-graceful-shutdown.md`.
2934

3035
Built-in Unix Python uses PTY-backed C stdin/stdout/stderr so CPython calls
3136
`PyOS_ReadlineFunctionPointer`. The Python callback emits readline accounting
@@ -42,28 +47,18 @@ facts from that CPython path. Sideband IPC stays separate from the PTY.
4247
- The worker may emit `readline_discard` for exact active-turn stdin bytes it
4348
discarded before delivering them to the runtime.
4449

45-
`session_end`
46-
- `{ "type": "session_end" }`
47-
- Sent when the server is ending the current session, for example during reset
48-
or shutdown.
49-
- Worker treats this as shutdown intent and stops consuming further stdin
50-
payloads.
51-
5250
## Direction: worker -> server
5351

5452
Worker-to-server messages are strict: unknown fields, invalid enum values,
5553
invalid base64, and unknown message types are protocol errors.
5654

5755
`worker_ready`
58-
- `{ "type": "worker_ready", "protocol": { "name": "mcp-repl-worker", "version": 1 }, "worker": { "name": <string>, "version": <string> }, "capabilities": { "images": <bool> }, "graceful_shutdown": { "stdin": <string> } }`
56+
- `{ "type": "worker_ready", "protocol": { "name": "mcp-repl-worker", "version": 1 }, "worker": { "name": <string>, "version": <string> }, "capabilities": { "images": <bool> } }`
5957
- Must be the first worker-to-server message for protocol workers.
6058
- The server rejects unsupported protocol names or versions before sending user
6159
input.
6260
- `worker.name` is diagnostic metadata. Server request handling must not branch
6361
on it.
64-
- `graceful_shutdown` is optional. If present, the server may write that exact
65-
stdin text only when server-owned shutdown/reset rules decide a graceful
66-
stdin request is appropriate.
6762

6863
`readline_start`
6964
- `{ "type": "readline_start", "prompt": <string> }`

src/ipc.rs

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,6 @@ pub enum ServerToWorkerIpcMessage {
102102
request_generation: u64,
103103
},
104104
Interrupt,
105-
SessionEnd,
106105
}
107106

108107
#[derive(Debug, Clone, Serialize, Deserialize)]
@@ -112,8 +111,6 @@ pub enum WorkerToServerIpcMessage {
112111
protocol: WorkerProtocol,
113112
worker: WorkerIdentity,
114113
capabilities: WorkerCapabilities,
115-
#[serde(default)]
116-
graceful_shutdown: Option<WorkerGracefulShutdown>,
117114
},
118115
BackendInfo {
119116
#[serde(default)]
@@ -182,12 +179,6 @@ pub struct WorkerCapabilities {
182179
pub images: bool,
183180
}
184181

185-
#[derive(Debug, Clone, Serialize, Deserialize)]
186-
#[serde(deny_unknown_fields)]
187-
pub struct WorkerGracefulShutdown {
188-
pub stdin: String,
189-
}
190-
191182
#[derive(Default)]
192183
struct ServerIpcInbox {
193184
queue: VecDeque<WorkerToServerIpcMessage>,
@@ -1750,11 +1741,7 @@ pub fn emit_plot_image(mime_type: &str, data: &str, is_update: bool, source: Opt
17501741
}
17511742
}
17521743

1753-
pub fn emit_worker_ready(
1754-
worker_name: &str,
1755-
supports_images: bool,
1756-
graceful_shutdown_stdin: Option<&str>,
1757-
) {
1744+
pub fn emit_worker_ready(worker_name: &str, supports_images: bool) {
17581745
if let Some(ipc) = global_ipc() {
17591746
let _ = ipc.send(WorkerToServerIpcMessage::WorkerReady {
17601747
protocol: WorkerProtocol {
@@ -1768,9 +1755,6 @@ pub fn emit_worker_ready(
17681755
capabilities: WorkerCapabilities {
17691756
images: supports_images,
17701757
},
1771-
graceful_shutdown: graceful_shutdown_stdin.map(|stdin| WorkerGracefulShutdown {
1772-
stdin: stdin.to_string(),
1773-
}),
17741758
});
17751759
}
17761760
}

src/python_session.rs

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -171,16 +171,6 @@ struct PythonRuntime {
171171
stdin: *mut libc::FILE,
172172
}
173173

174-
pub fn request_shutdown() -> bool {
175-
let Some(state) = SESSION_STATE.get() else {
176-
return false;
177-
};
178-
let mut guard = state.inner.lock().unwrap();
179-
guard.shutdown = true;
180-
state.cvar.notify_all();
181-
true
182-
}
183-
184174
fn request_exit() {
185175
let Some(state) = SESSION_STATE.get() else {
186176
return;
@@ -631,7 +621,7 @@ fn run_session_on_current_thread(init: Arc<SessionInit>) -> Result<(), String> {
631621
}
632622

633623
init.mark_ready();
634-
ipc::emit_worker_ready("python", plot_capable(), Some("exit()\n"));
624+
ipc::emit_worker_ready("python", plot_capable());
635625

636626
let result = run_repl(&runtime);
637627
let finalize_result = finalize_python(api, thread_state);

src/python_worker.rs

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,10 @@ use crate::python_session::{self, PythonSession};
1111

1212
struct WorkerState {
1313
busy: AtomicBool,
14-
shutting_down: AtomicBool,
1514
}
1615

1716
impl WorkerState {
1817
fn try_mark_busy(&self) -> bool {
19-
if self.is_shutting_down() {
20-
return false;
21-
}
2218
self.busy
2319
.compare_exchange(false, true, Ordering::SeqCst, Ordering::SeqCst)
2420
.is_ok()
@@ -27,21 +23,12 @@ impl WorkerState {
2723
fn mark_idle(&self) {
2824
self.busy.store(false, Ordering::SeqCst);
2925
}
30-
31-
fn begin_shutdown(&self) {
32-
self.shutting_down.store(true, Ordering::SeqCst);
33-
}
34-
35-
fn is_shutting_down(&self) -> bool {
36-
self.shutting_down.load(Ordering::SeqCst)
37-
}
3826
}
3927

4028
impl Default for WorkerState {
4129
fn default() -> Self {
4230
Self {
4331
busy: AtomicBool::new(false),
44-
shutting_down: AtomicBool::new(false),
4532
}
4633
}
4734
}
@@ -128,10 +115,6 @@ fn init_ipc(
128115
python_session::interrupt_request_generation(request_generation);
129116
emit_python_interrupt_ack();
130117
}
131-
Some(ServerToWorkerIpcMessage::SessionEnd) => {
132-
state.begin_shutdown();
133-
let _ = python_session::request_shutdown();
134-
}
135118
None => {
136119
std::process::exit(0);
137120
}
@@ -163,10 +146,6 @@ fn handle_write_stdin(
163146
state: Arc<WorkerState>,
164147
request_tx: &mpsc::SyncSender<QueuedRequest>,
165148
) {
166-
if state.is_shutting_down() {
167-
return;
168-
}
169-
170149
if !state.try_mark_busy() {
171150
emit_stderr_message("worker is busy; request already running");
172151
return;

src/worker.rs

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,11 @@ pub fn run() -> Result<(), Box<dyn std::error::Error>> {
4747
fn run_r_worker() -> Result<(), Box<dyn std::error::Error>> {
4848
crate::diagnostics::startup_log("worker: run begin");
4949
let state = Arc::new(WorkerState::default());
50-
init_ipc(state.clone()).map_err(|err| {
50+
init_ipc().map_err(|err| {
5151
eprintln!("worker ipc init error: {err}");
5252
err
5353
})?;
54-
emit_worker_ready("r", true, Some("quit(\"no\")\n"));
54+
emit_worker_ready("r", true);
5555

5656
let stdin_state = state.clone();
5757
let _stdin_thread = thread::Builder::new()
@@ -82,7 +82,7 @@ fn wait_for_r_session() -> Result<&'static RSession, String> {
8282
}
8383
}
8484

85-
fn init_ipc(state: Arc<WorkerState>) -> Result<(), Box<dyn std::error::Error>> {
85+
fn init_ipc() -> Result<(), Box<dyn std::error::Error>> {
8686
let conn = connect_from_env(Duration::from_secs(2))?;
8787
set_global_ipc(conn.clone());
8888
if let Err(err) = thread::Builder::new()
@@ -100,11 +100,6 @@ fn init_ipc(state: Arc<WorkerState>) -> Result<(), Box<dyn std::error::Error>> {
100100
Some(ServerToWorkerIpcMessage::PythonInterrupt { .. }) => {
101101
crate::r_session::clear_pending_input();
102102
}
103-
Some(ServerToWorkerIpcMessage::SessionEnd) => {
104-
state.begin_shutdown();
105-
crate::r_session::clear_pending_input();
106-
let _ = crate::r_session::request_shutdown();
107-
}
108103
None => {
109104
// Without IPC, the worker cannot participate in turn accounting (prompt,
110105
// request boundaries, etc). Exit immediately so the server can respawn.

0 commit comments

Comments
 (0)