Skip to content

Commit 7c85571

Browse files
authored
Add job cancellation support to the debugger command relay (#1262)
## Summary - Add a `cancel` subcommand to the client that terminates the currently running command on the server - Server now runs commands in the background with PID tracking, enabling cancellation mid-execution - Client-side timeouts automatically cancel the running command on the server (previously the server process was left running) - Hardened against race conditions through 4 rounds of adversarial review (15 fixes total) ### Key changes **server.sh:** - Commands run in background with PID tracked in `$RELAY_DIR/running` (atomic tmp+mv write) - Cancel detection loop checks for `$RELAY_DIR/cancel` file with cmd_id verification - SIGTERM with 5s grace period, then SIGKILL escalation for stuck processes - `.exit` file written before `running` marker removed (ordering guarantee) - `set -e`-safe: `wait` uses `|| exit_code=$?` pattern; cleanup trap fully guarded - Stale cancel files cleared at command start; mismatched/empty signals rejected - Command file read into memory and removed before execution (eliminates TOCTOU with client timeout) **client.sh:** - New `cancel` subcommand: writes target cmd_id to cancel file, waits for server acknowledgment (30s timeout) - `run` timeout now sends targeted cancel signal (verifies cmd_id match to avoid killing wrong command) - `run` timeout cleans up orphaned result files - `status` shows currently running command - `flush` rejects if a command is currently running (prevents state corruption) - Exit code validated as numeric before use ### Protocol additions ``` .relay/ ├── running # server writes cmd_id:pid while executing (atomic) ├── cancel # client writes target cmd_id to request cancellation ``` ## Test plan - [ ] Start server in Docker, handshake from host - [ ] Run a command (`client.sh run "sleep 30"`), cancel it (`client.sh cancel`), verify exit code 130 - [ ] Run a command with short timeout (`--timeout 5 run "sleep 30"`), verify auto-cancel - [ ] Run a command that exits non-zero, verify server stays alive - [ ] Run `status` during execution, verify it shows the running command - [ ] Attempt `flush` during execution, verify it is rejected - [ ] Cancel when nothing is running, verify clean message 🤖 Generated with [Claude Code](https://claude.com/claude-code) ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (bash scripts for dev tooling, tested manually) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (internal tooling) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added a comprehensive debug skill guide and protocol reference with quick CLI examples and a new “Cancelling Commands” section. * **New Features** * Client-side `cancel` command to terminate the currently running remote command. * Status now reports active command (or `(idle)`). * **Improvements** * Stronger startup/validation guidance, safer shutdown/cleanup, deterministic cancel exit semantics (130), and auto-cancel on client-side timeout. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent 952a62b commit 7c85571

5 files changed

Lines changed: 256 additions & 22 deletions

File tree

.claude/skills/debug/SKILL.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
name: debug
3+
description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
4+
---
5+
6+
# Remote Docker Debugger
7+
8+
Execute commands inside a Docker container from the host using the file-based command relay.
9+
10+
**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol and examples.
11+
12+
## Quick Reference
13+
14+
```bash
15+
# Check connection
16+
bash tools/debugger/client.sh status
17+
18+
# Connect to server (user must start server.sh in Docker first)
19+
bash tools/debugger/client.sh handshake
20+
21+
# Run a command
22+
bash tools/debugger/client.sh run "<command>"
23+
24+
# Long-running command (default timeout is 600s)
25+
bash tools/debugger/client.sh --timeout 1800 run "<command>"
26+
27+
# Cancel the currently running command
28+
bash tools/debugger/client.sh cancel
29+
30+
# Reconnect after server restart
31+
bash tools/debugger/client.sh flush
32+
bash tools/debugger/client.sh handshake
33+
```

tools/debugger/CLAUDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,20 @@ bash tools/debugger/client.sh run "nvidia-smi"
5353
bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B"
5454
```
5555

56+
### Cancelling Commands
57+
58+
```bash
59+
# Cancel the currently running command
60+
bash tools/debugger/client.sh cancel
61+
62+
# Client-side timeout also auto-cancels the running command
63+
```
64+
5665
### Important Notes
5766

5867
- The server must be started by the user manually inside Docker before the handshake.
5968
- Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks.
6069
- Commands execute sequentially — one at a time.
70+
- A running command can be cancelled; cancelled commands exit with code 130.
6171
- All commands run with the auto-detected repo root as the working directory.
6272
- The `.relay/` directory is ephemeral and git-ignored.

tools/debugger/README.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"
5252
# Run with a long timeout (default is 600s)
5353
bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py"
5454

55+
# Cancel a running command
56+
bash tools/debugger/client.sh cancel
57+
5558
# Check status
5659
bash tools/debugger/client.sh status
5760
```
@@ -65,6 +68,8 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
6568
├── server.ready # Written by server on startup
6669
├── client.ready # Written by client during handshake
6770
├── handshake.done # Written by server to confirm handshake
71+
├── running # Written by server while a command is executing (cmd_id:pid)
72+
├── cancel # Written by client to request cancellation of the running command
6873
├── cmd/ # Client writes command .sh files here
6974
│ └── <id>.sh # Command to execute
7075
└── result/ # Server writes results here
@@ -81,10 +86,18 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
8186

8287
### Command Execution
8388

84-
1. Client writes a command to `.relay/cmd/<timestamp>.sh`
85-
2. Server detects the file, runs `bash <file>` in the workdir, captures output
86-
3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
87-
4. Server removes the `.sh` file; client reads results and cleans up
89+
1. Client writes a command to `.relay/cmd/<id>.sh`
90+
2. Server detects the file, reads the command content, and removes the `.sh` file
91+
3. Server runs `bash -c <content>` in a new process group, writes `.relay/running`
92+
4. Server writes `.relay/result/<id>.exit` and `.relay/result/<id>.log`, then removes `.relay/running`
93+
5. Client reads results and cleans up
94+
95+
### Cancellation
96+
97+
1. Client writes the target `cmd_id` to `.relay/cancel`
98+
2. Server verifies the `cmd_id` matches, then kills the command's process group
99+
3. Server writes exit code 130 and removes `.relay/running` and `.relay/cancel`
100+
4. Client-side timeout also triggers cancellation automatically
88101

89102
## Options
90103

@@ -107,3 +120,5 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
107120
- The `.relay/` directory is in `.gitignore` — it is not checked in.
108121
- Only one server should run at a time (startup clears the relay directory).
109122
- Commands run sequentially in the order the server discovers them.
123+
- A running command can be cancelled via `client.sh cancel`. Cancelled commands exit with code 130.
124+
- Client-side timeouts automatically cancel the running command on the server.

tools/debugger/client.sh

Lines changed: 88 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
# Usage:
2222
# bash client.sh handshake - Connect to server
2323
# bash client.sh run <command...> - Run a command and print output
24+
# bash client.sh cancel - Cancel the running command
2425
# bash client.sh status - Check server status
2526
#
2627
# Options:
@@ -54,6 +55,22 @@ RESULT_DIR="$RELAY_DIR/result"
5455
SUBCOMMAND="${1:-}"
5556
shift || true
5657

58+
# Helper: wait for a specific command to finish (running marker gone or cmd_id changed)
59+
wait_for_cancel_completion() {
60+
local target_id="$1" wait_timeout="$2" elapsed=0 current_info current_id
61+
while [[ $elapsed -lt $wait_timeout ]]; do
62+
if [[ ! -f "$RELAY_DIR/running" ]]; then
63+
return 0
64+
fi
65+
current_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
66+
current_id="${current_info%%:*}"
67+
[[ "$current_id" != "$target_id" ]] && return 0
68+
sleep "$POLL_INTERVAL"
69+
elapsed=$((elapsed + POLL_INTERVAL))
70+
done
71+
return 1
72+
}
73+
5774
case "$SUBCOMMAND" in
5875
handshake)
5976
# Check server is ready
@@ -91,6 +108,8 @@ case "$SUBCOMMAND" in
91108
# Generate a unique command ID (timestamp + PID to avoid collisions)
92109
cmd_id="$(date +%s%N)_$$"
93110

111+
echo "[client] Running: $*"
112+
94113
# Write the command file atomically (tmp + mv)
95114
echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
96115
mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"
@@ -107,15 +126,35 @@ case "$SUBCOMMAND" in
107126
sleep "$POLL_INTERVAL"
108127
elapsed=$((elapsed + POLL_INTERVAL))
109128
if [[ $elapsed -ge $TIMEOUT ]]; then
129+
# Result might have arrived during the last sleep
130+
[[ -f "$RESULT_DIR/$cmd_id.exit" ]] && break
110131
echo "ERROR: Command timed out after ${TIMEOUT}s."
111-
# Clean up the pending command
132+
# Cancel the running command only if it is OUR command
133+
if [[ -f "$RELAY_DIR/running" ]]; then
134+
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
135+
if [[ -n "$running_info" && "$running_info" == *:* ]]; then
136+
running_id="${running_info%%:*}"
137+
if [[ "$running_id" == "$cmd_id" ]]; then
138+
echo "Sending cancel signal..."
139+
echo "$cmd_id" > "$RELAY_DIR/cancel.tmp"
140+
mv "$RELAY_DIR/cancel.tmp" "$RELAY_DIR/cancel"
141+
wait_for_cancel_completion "$cmd_id" 10 || true
142+
fi
143+
fi
144+
fi
145+
# Clean up command and any orphaned result files
112146
rm -f "$CMD_DIR/$cmd_id.sh"
147+
rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log"
113148
exit 1
114149
fi
115150
done
116151

117152
# Read and display results
118153
exit_code=$(cat "$RESULT_DIR/$cmd_id.exit")
154+
if ! [[ "$exit_code" =~ ^[0-9]+$ ]]; then
155+
echo "WARNING: Invalid exit code '$exit_code', defaulting to 1."
156+
exit_code=1
157+
fi
119158
if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then
120159
cat "$RESULT_DIR/$cmd_id.log"
121160
fi
@@ -139,6 +178,12 @@ case "$SUBCOMMAND" in
139178
else
140179
echo "Handshake: not started"
141180
fi
181+
if [[ -f "$RELAY_DIR/running" ]]; then
182+
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || running_info="(disappeared)"
183+
echo "Running: $running_info"
184+
else
185+
echo "Running: (idle)"
186+
fi
142187
if [[ -d "$CMD_DIR" ]]; then
143188
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
144189
else
@@ -148,6 +193,12 @@ case "$SUBCOMMAND" in
148193
;;
149194

150195
flush)
196+
# Block flush if a command is actually running (server alive + running marker)
197+
# Allow flush if server is dead (stale running marker from crash)
198+
if [[ -f "$RELAY_DIR/running" ]] && [[ -f "$RELAY_DIR/server.ready" ]]; then
199+
echo "ERROR: A command is currently running. Cancel it first or wait for it to finish."
200+
exit 1
201+
fi
151202
if [[ -d "$RELAY_DIR" ]]; then
152203
# Clear handshake and command/result files, but keep server.ready
153204
rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done"
@@ -159,12 +210,48 @@ case "$SUBCOMMAND" in
159210
fi
160211
;;
161212

213+
cancel)
214+
# Check if there's a running command
215+
if [[ -f "$RELAY_DIR/running" ]]; then
216+
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
217+
if [[ -z "$running_info" || "$running_info" != *:* ]]; then
218+
echo "WARNING: Running marker is corrupt or empty. Cannot identify command to cancel."
219+
exit 1
220+
fi
221+
running_id="${running_info%%:*}"
222+
echo "Cancelling running command: $running_id"
223+
224+
# Write cancel signal atomically with cmd_id so server can verify the target
225+
echo "$running_id" > "$RELAY_DIR/cancel.tmp"
226+
mv "$RELAY_DIR/cancel.tmp" "$RELAY_DIR/cancel"
227+
228+
# Wait for the server to process the cancellation
229+
if wait_for_cancel_completion "$running_id" 30; then
230+
echo "Command cancelled."
231+
else
232+
echo "WARNING: Cancel signal sent but command still running after 30s."
233+
exit 1
234+
fi
235+
else
236+
echo "No command is currently running."
237+
fi
238+
239+
# Report pending commands
240+
if [[ -d "$CMD_DIR" ]]; then
241+
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
242+
if [[ "$pending" -gt 0 ]]; then
243+
echo "$pending pending command(s) in queue. Use 'flush' to clear them."
244+
fi
245+
fi
246+
;;
247+
162248
*)
163249
echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>"
164250
echo ""
165251
echo "Subcommands:"
166252
echo " handshake Connect to the server"
167253
echo " run <cmd> Execute a command on the server"
254+
echo " cancel Cancel the currently running command"
168255
echo " status Check connection status"
169256
echo " flush Clear the relay directory"
170257
exit 1

0 commit comments

Comments
 (0)