Add job cancellation support to the debugger command relay (#1262)

cjluo-nv · web-flow · commit 7c8557158dd3 · 2026-04-15T16:49:42.000Z
## Summary - Add a `cancel` subcommand to the client that terminates the currently running command on the server - Server now runs commands in the background with PID tracking, enabling cancellation mid-execution - Client-side timeouts automatically cancel the running command on the server (previously the server process was left running) - Hardened against race conditions through 4 rounds of adversarial review (15 fixes total) ### Key changes **server.sh:** - Commands run in background with PID tracked in `$RELAY_DIR/running` (atomic tmp+mv write) - Cancel detection loop checks for `$RELAY_DIR/cancel` file with cmd_id verification - SIGTERM with 5s grace period, then SIGKILL escalation for stuck processes - `.exit` file written before `running` marker removed (ordering guarantee) - `set -e`-safe: `wait` uses `|| exit_code=$?` pattern; cleanup trap fully guarded - Stale cancel files cleared at command start; mismatched/empty signals rejected - Command file read into memory and removed before execution (eliminates TOCTOU with client timeout) **client.sh:** - New `cancel` subcommand: writes target cmd_id to cancel file, waits for server acknowledgment (30s timeout) - `run` timeout now sends targeted cancel signal (verifies cmd_id match to avoid killing wrong command) - `run` timeout cleans up orphaned result files - `status` shows currently running command - `flush` rejects if a command is currently running (prevents state corruption) - Exit code validated as numeric before use ### Protocol additions ``` .relay/ ├── running # server writes cmd_id:pid while executing (atomic) ├── cancel # client writes target cmd_id to request cancellation ``` ## Test plan - [ ] Start server in Docker, handshake from host - [ ] Run a command (`client.sh run "sleep 30"`), cancel it (`client.sh cancel`), verify exit code 130 - [ ] Run a command with short timeout (`--timeout 5 run "sleep 30"`), verify auto-cancel - [ ] Run a command that exits non-zero, verify server stays alive - [ ] Run `status` during execution, verify it shows the running command - [ ] Attempt `flush` during execution, verify it is rejected - [ ] Cancel when nothing is running, verify clean message 🤖 Generated with [Claude Code](https://claude.com/claude-code) ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (bash scripts for dev tooling, tested manually) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (internal tooling)  ## Summary by CodeRabbit * **Documentation** * Added a comprehensive debug skill guide and protocol reference with quick CLI examples and a new “Cancelling Commands” section. * **New Features** * Client-side `cancel` command to terminate the currently running remote command. * Status now reports active command (or `(idle)`). * **Improvements** * Stronger startup/validation guidance, safer shutdown/cleanup, deterministic cancel exit semantics (130), and auto-cancel on client-side timeout.  --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
diff --git a/.claude/skills/debug/SKILL.md b/.claude/skills/debug/SKILL.md
@@ -0,0 +1,33 @@
+---
+name: debug
+description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
+---
+
+# Remote Docker Debugger
+
+Execute commands inside a Docker container from the host using the file-based command relay.
+
+**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol and examples.
+
+## Quick Reference
+
+```bash
+# Check connection
+bash tools/debugger/client.sh status
+
+# Connect to server (user must start server.sh in Docker first)
+bash tools/debugger/client.sh handshake
+
+# Run a command
+bash tools/debugger/client.sh run "<command>"
+
+# Long-running command (default timeout is 600s)
+bash tools/debugger/client.sh --timeout 1800 run "<command>"
+
+# Cancel the currently running command
+bash tools/debugger/client.sh cancel
+
+# Reconnect after server restart
+bash tools/debugger/client.sh flush
+bash tools/debugger/client.sh handshake
+```
diff --git a/tools/debugger/CLAUDE.md b/tools/debugger/CLAUDE.md
@@ -53,10 +53,20 @@ bash tools/debugger/client.sh run "nvidia-smi"
 bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B"
 ```
 
+### Cancelling Commands
+
+```bash
+# Cancel the currently running command
+bash tools/debugger/client.sh cancel
+
+# Client-side timeout also auto-cancels the running command
+```
+
 ### Important Notes
 
 - The server must be started by the user manually inside Docker before the handshake.
 - Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks.
 - Commands execute sequentially — one at a time.
+- A running command can be cancelled; cancelled commands exit with code 130.
 - All commands run with the auto-detected repo root as the working directory.
 - The `.relay/` directory is ephemeral and git-ignored.
diff --git a/tools/debugger/README.md b/tools/debugger/README.md
@@ -52,6 +52,9 @@ bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"
 # Run with a long timeout (default is 600s)
 bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py"
 
+# Cancel a running command
+bash tools/debugger/client.sh cancel
+
 # Check status
 bash tools/debugger/client.sh status
 ```
@@ -65,6 +68,8 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
 ├── server.ready      # Written by server on startup
 ├── client.ready      # Written by client during handshake
 ├── handshake.done    # Written by server to confirm handshake
+├── running           # Written by server while a command is executing (cmd_id:pid)
+├── cancel            # Written by client to request cancellation of the running command
 ├── cmd/              # Client writes command .sh files here
 │   └── <id>.sh       # Command to execute
 └── result/           # Server writes results here
@@ -81,10 +86,18 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
 
 ### Command Execution
 
-1. Client writes a command to `.relay/cmd/<timestamp>.sh`
-2. Server detects the file, runs `bash <file>` in the workdir, captures output
-3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
-4. Server removes the `.sh` file; client reads results and cleans up
+1. Client writes a command to `.relay/cmd/<id>.sh`
+2. Server detects the file, reads the command content, and removes the `.sh` file
+3. Server runs `bash -c <content>` in a new process group, writes `.relay/running`
+4. Server writes `.relay/result/<id>.exit` and `.relay/result/<id>.log`, then removes `.relay/running`
+5. Client reads results and cleans up
+
+### Cancellation
+
+1. Client writes the target `cmd_id` to `.relay/cancel`
+2. Server verifies the `cmd_id` matches, then kills the command's process group
+3. Server writes exit code 130 and removes `.relay/running` and `.relay/cancel`
+4. Client-side timeout also triggers cancellation automatically
 
 ## Options
 
@@ -107,3 +120,5 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
 - The `.relay/` directory is in `.gitignore` — it is not checked in.
 - Only one server should run at a time (startup clears the relay directory).
 - Commands run sequentially in the order the server discovers them.
+- A running command can be cancelled via `client.sh cancel`. Cancelled commands exit with code 130.
+- Client-side timeouts automatically cancel the running command on the server.
diff --git a/tools/debugger/client.sh b/tools/debugger/client.sh
@@ -21,6 +21,7 @@
 # Usage:
 #   bash client.sh handshake              - Connect to server
 #   bash client.sh run <command...>        - Run a command and print output
+#   bash client.sh cancel                 - Cancel the running command
 #   bash client.sh status                  - Check server status
 #
 # Options:
@@ -54,6 +55,22 @@ RESULT_DIR="$RELAY_DIR/result"
 SUBCOMMAND="${1:-}"
 shift || true
 
+# Helper: wait for a specific command to finish (running marker gone or cmd_id changed)
+wait_for_cancel_completion() {
+    local target_id="$1" wait_timeout="$2" elapsed=0 current_info current_id
+    while [[ $elapsed -lt $wait_timeout ]]; do
+        if [[ ! -f "$RELAY_DIR/running" ]]; then
+            return 0
+        fi
+        current_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
+        current_id="${current_info%%:*}"
+        [[ "$current_id" != "$target_id" ]] && return 0
+        sleep "$POLL_INTERVAL"
+        elapsed=$((elapsed + POLL_INTERVAL))
+    done
+    return 1
+}
+
 case "$SUBCOMMAND" in
     handshake)
         # Check server is ready
@@ -91,6 +108,8 @@ case "$SUBCOMMAND" in
         # Generate a unique command ID (timestamp + PID to avoid collisions)
         cmd_id="$(date +%s%N)_$$"
 
+        echo "[client] Running: $*"
+
         # Write the command file atomically (tmp + mv)
         echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
         mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"
@@ -107,15 +126,35 @@ case "$SUBCOMMAND" in
             sleep "$POLL_INTERVAL"
             elapsed=$((elapsed + POLL_INTERVAL))
             if [[ $elapsed -ge $TIMEOUT ]]; then
+                # Result might have arrived during the last sleep
+                [[ -f "$RESULT_DIR/$cmd_id.exit" ]] && break
                 echo "ERROR: Command timed out after ${TIMEOUT}s."
-                # Clean up the pending command
+                # Cancel the running command only if it is OUR command
+                if [[ -f "$RELAY_DIR/running" ]]; then
+                    running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
+                    if [[ -n "$running_info" && "$running_info" == *:* ]]; then
+                        running_id="${running_info%%:*}"
+                        if [[ "$running_id" == "$cmd_id" ]]; then
+                            echo "Sending cancel signal..."
+                            echo "$cmd_id" > "$RELAY_DIR/cancel.tmp"
+                            mv "$RELAY_DIR/cancel.tmp" "$RELAY_DIR/cancel"
+                            wait_for_cancel_completion "$cmd_id" 10 || true
+                        fi
+                    fi
+                fi
+                # Clean up command and any orphaned result files
                 rm -f "$CMD_DIR/$cmd_id.sh"
+                rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log"
                 exit 1
             fi
         done
 
         # Read and display results
         exit_code=$(cat "$RESULT_DIR/$cmd_id.exit")
+        if ! [[ "$exit_code" =~ ^[0-9]+$ ]]; then
+            echo "WARNING: Invalid exit code '$exit_code', defaulting to 1."
+            exit_code=1
+        fi
         if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then
             cat "$RESULT_DIR/$cmd_id.log"
         fi
@@ -139,6 +178,12 @@ case "$SUBCOMMAND" in
         else
             echo "Handshake: not started"
         fi
+        if [[ -f "$RELAY_DIR/running" ]]; then
+            running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || running_info="(disappeared)"
+            echo "Running: $running_info"
+        else
+            echo "Running: (idle)"
+        fi
         if [[ -d "$CMD_DIR" ]]; then
             pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
         else
@@ -148,6 +193,12 @@ case "$SUBCOMMAND" in
         ;;
 
     flush)
+        # Block flush if a command is actually running (server alive + running marker)
+        # Allow flush if server is dead (stale running marker from crash)
+        if [[ -f "$RELAY_DIR/running" ]] && [[ -f "$RELAY_DIR/server.ready" ]]; then
+            echo "ERROR: A command is currently running. Cancel it first or wait for it to finish."
+            exit 1
+        fi
         if [[ -d "$RELAY_DIR" ]]; then
             # Clear handshake and command/result files, but keep server.ready
             rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done"
@@ -159,12 +210,48 @@ case "$SUBCOMMAND" in
         fi
         ;;
 
+    cancel)
+        # Check if there's a running command
+        if [[ -f "$RELAY_DIR/running" ]]; then
+            running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
+            if [[ -z "$running_info" || "$running_info" != *:* ]]; then
+                echo "WARNING: Running marker is corrupt or empty. Cannot identify command to cancel."
+                exit 1
+            fi
+            running_id="${running_info%%:*}"
+            echo "Cancelling running command: $running_id"
+
+            # Write cancel signal atomically with cmd_id so server can verify the target
+            echo "$running_id" > "$RELAY_DIR/cancel.tmp"
+            mv "$RELAY_DIR/cancel.tmp" "$RELAY_DIR/cancel"
+
+            # Wait for the server to process the cancellation
+            if wait_for_cancel_completion "$running_id" 30; then
+                echo "Command cancelled."
+            else
+                echo "WARNING: Cancel signal sent but command still running after 30s."
+                exit 1
+            fi
+        else
+            echo "No command is currently running."
+        fi
+
+        # Report pending commands
+        if [[ -d "$CMD_DIR" ]]; then
+            pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
+            if [[ "$pending" -gt 0 ]]; then
+                echo "$pending pending command(s) in queue. Use 'flush' to clear them."
+            fi
+        fi
+        ;;
+
     *)
         echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>"
         echo ""
         echo "Subcommands:"
         echo "  handshake   Connect to the server"
         echo "  run <cmd>   Execute a command on the server"
+        echo "  cancel      Cancel the currently running command"
         echo "  status      Check connection status"
         echo "  flush       Clear the relay directory"
         exit 1
diff --git a/tools/debugger/server.sh b/tools/debugger/server.sh