Skip to content

Commit c4e320e

Browse files
committed
Update
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent 08a838c commit c4e320e

File tree

5 files changed

+180
-11
lines changed

5 files changed

+180
-11
lines changed

.claude/skills/debug/SKILL.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ bash tools/debugger/client.sh run "<command>"
2424
# Long-running command (default timeout is 600s)
2525
bash tools/debugger/client.sh --timeout 1800 run "<command>"
2626

27+
# Cancel the currently running command
28+
bash tools/debugger/client.sh cancel
29+
2730
# Reconnect after server restart
2831
bash tools/debugger/client.sh flush
2932
bash tools/debugger/client.sh handshake

tools/debugger/CLAUDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,20 @@ bash tools/debugger/client.sh run "nvidia-smi"
5353
bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B"
5454
```
5555

56+
### Cancelling Commands
57+
58+
```bash
59+
# Cancel the currently running command
60+
bash tools/debugger/client.sh cancel
61+
62+
# Client-side timeout also auto-cancels the running command
63+
```
64+
5665
### Important Notes
5766

5867
- The server must be started by the user manually inside Docker before the handshake.
5968
- Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks.
6069
- Commands execute sequentially — one at a time.
70+
- A running command can be cancelled; cancelled commands exit with code 130.
6171
- All commands run with the auto-detected repo root as the working directory.
6272
- The `.relay/` directory is ephemeral and git-ignored.

tools/debugger/README.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"
5252
# Run with a long timeout (default is 600s)
5353
bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py"
5454

55+
# Cancel a running command
56+
bash tools/debugger/client.sh cancel
57+
5558
# Check status
5659
bash tools/debugger/client.sh status
5760
```
@@ -65,6 +68,8 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
6568
├── server.ready # Written by server on startup
6669
├── client.ready # Written by client during handshake
6770
├── handshake.done # Written by server to confirm handshake
71+
├── running # Written by server while a command is executing (cmd_id:pid)
72+
├── cancel # Written by client to request cancellation of the running command
6873
├── cmd/ # Client writes command .sh files here
6974
│ └── <id>.sh # Command to execute
7075
└── result/ # Server writes results here
@@ -82,9 +87,16 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
8287
### Command Execution
8388

8489
1. Client writes a command to `.relay/cmd/<timestamp>.sh`
85-
2. Server detects the file, runs `bash <file>` in the workdir, captures output
90+
2. Server detects the file, runs `bash <file>` in the workdir in background, writes `.relay/running`
8691
3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
87-
4. Server removes the `.sh` file; client reads results and cleans up
92+
4. Server removes the `.sh` file and `.relay/running`; client reads results and cleans up
93+
94+
### Cancellation
95+
96+
1. Client writes `.relay/cancel`
97+
2. Server detects the cancel signal, kills the running command process tree
98+
3. Server writes exit code 130 and removes `.relay/running` and `.relay/cancel`
99+
4. Client-side timeout also triggers cancellation automatically
88100

89101
## Options
90102

@@ -107,3 +119,5 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
107119
- The `.relay/` directory is in `.gitignore` — it is not checked in.
108120
- Only one server should run at a time (startup clears the relay directory).
109121
- Commands run sequentially in the order the server discovers them.
122+
- A running command can be cancelled via `client.sh cancel`. Cancelled commands exit with code 130.
123+
- Client-side timeouts automatically cancel the running command on the server.

tools/debugger/client.sh

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
# Usage:
2222
# bash client.sh handshake - Connect to server
2323
# bash client.sh run <command...> - Run a command and print output
24+
# bash client.sh cancel - Cancel the running command
2425
# bash client.sh status - Check server status
2526
#
2627
# Options:
@@ -110,14 +111,32 @@ case "$SUBCOMMAND" in
110111
elapsed=$((elapsed + POLL_INTERVAL))
111112
if [[ $elapsed -ge $TIMEOUT ]]; then
112113
echo "ERROR: Command timed out after ${TIMEOUT}s."
113-
# Clean up the pending command
114+
# Cancel the running command only if it is OUR command
115+
if [[ -f "$RELAY_DIR/running" ]]; then
116+
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
117+
running_id="${running_info%%:*}"
118+
if [[ "$running_id" == "$cmd_id" ]]; then
119+
echo "Sending cancel signal..."
120+
echo "$cmd_id" > "$RELAY_DIR/cancel"
121+
for _ in $(seq 1 10); do
122+
[[ -f "$RELAY_DIR/running" ]] || break
123+
sleep 1
124+
done
125+
fi
126+
fi
127+
# Clean up command and any orphaned result files
114128
rm -f "$CMD_DIR/$cmd_id.sh"
129+
rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log"
115130
exit 1
116131
fi
117132
done
118133

119134
# Read and display results
120135
exit_code=$(cat "$RESULT_DIR/$cmd_id.exit")
136+
if ! [[ "$exit_code" =~ ^[0-9]+$ ]]; then
137+
echo "WARNING: Invalid exit code '$exit_code', defaulting to 1."
138+
exit_code=1
139+
fi
121140
if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then
122141
cat "$RESULT_DIR/$cmd_id.log"
123142
fi
@@ -141,6 +160,11 @@ case "$SUBCOMMAND" in
141160
else
142161
echo "Handshake: not started"
143162
fi
163+
if [[ -f "$RELAY_DIR/running" ]]; then
164+
echo "Running: $(cat "$RELAY_DIR/running")"
165+
else
166+
echo "Running: (idle)"
167+
fi
144168
if [[ -d "$CMD_DIR" ]]; then
145169
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
146170
else
@@ -150,6 +174,10 @@ case "$SUBCOMMAND" in
150174
;;
151175

152176
flush)
177+
if [[ -f "$RELAY_DIR/running" ]]; then
178+
echo "ERROR: A command is currently running. Cancel it first or wait for it to finish."
179+
exit 1
180+
fi
153181
if [[ -d "$RELAY_DIR" ]]; then
154182
# Clear handshake and command/result files, but keep server.ready
155183
rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done"
@@ -161,12 +189,47 @@ case "$SUBCOMMAND" in
161189
fi
162190
;;
163191

192+
cancel)
193+
# Check if there's a running command
194+
if [[ -f "$RELAY_DIR/running" ]]; then
195+
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
196+
running_id="${running_info%%:*}"
197+
echo "Cancelling running command: $running_id"
198+
199+
# Write cancel signal with cmd_id so server can verify the target
200+
echo "$running_id" > "$RELAY_DIR/cancel"
201+
202+
# Wait for the server to process the cancellation
203+
elapsed=0
204+
while [[ -f "$RELAY_DIR/running" ]]; do
205+
sleep "$POLL_INTERVAL"
206+
elapsed=$((elapsed + POLL_INTERVAL))
207+
if [[ $elapsed -ge 30 ]]; then
208+
echo "WARNING: Cancel signal sent but command still running after 30s."
209+
exit 1
210+
fi
211+
done
212+
echo "Command cancelled."
213+
else
214+
echo "No command is currently running."
215+
fi
216+
217+
# Report pending commands
218+
if [[ -d "$CMD_DIR" ]]; then
219+
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
220+
if [[ "$pending" -gt 0 ]]; then
221+
echo "$pending pending command(s) in queue. Use 'flush' to clear them."
222+
fi
223+
fi
224+
;;
225+
164226
*)
165227
echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>"
166228
echo ""
167229
echo "Subcommands:"
168230
echo " handshake Connect to the server"
169231
echo " run <cmd> Execute a command on the server"
232+
echo " cancel Cancel the currently running command"
170233
echo " status Check connection status"
171234
echo " flush Clear the relay directory"
172235
exit 1

tools/debugger/server.sh

Lines changed: 87 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,19 @@ RESULT_DIR="$RELAY_DIR/result"
6363

6464
cleanup() {
6565
echo "[server] Shutting down..."
66+
# Kill any running command (guard all reads with || true to prevent set -e
67+
# from aborting the trap and leaving stale marker files)
68+
running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
69+
if [[ -n "$running_pid" ]]; then
70+
pkill -P "$running_pid" 2>/dev/null || true
71+
kill "$running_pid" 2>/dev/null || true
72+
fi
6673
# Kill any child processes in our process group
6774
pkill -P $$ 2>/dev/null || true
6875
rm -f "$RELAY_DIR/server.ready"
6976
rm -f "$RELAY_DIR/handshake.done"
77+
rm -f "$RELAY_DIR/running"
78+
rm -f "$RELAY_DIR/cancel"
7079
exit 0
7180
}
7281
trap cleanup SIGINT SIGTERM
@@ -140,20 +149,90 @@ while true; do
140149
fi
141150

142151
for cmd_file in "$CMD_DIR"/*.sh; do
152+
# Guard against command files deleted by the client between glob expansion
153+
# and processing (e.g., client timeout on a queued command)
154+
[[ -f "$cmd_file" ]] || continue
155+
143156
cmd_id="$(basename "$cmd_file" .sh)"
144-
cmd_content=$(cat "$cmd_file")
157+
# Tolerate file disappearing between guard and read (TOCTOU with client timeout)
158+
cmd_content=$(cat "$cmd_file" 2>/dev/null) || continue
159+
# Remove command file immediately after reading to prevent re-execution
160+
# and to avoid TOCTOU with client timeout deleting it during execution
161+
rm -f "$cmd_file"
145162
echo "[server] Executing command $cmd_id: $cmd_content"
146163

147-
# Execute the command, tee stdout+stderr to console and result file
148-
(cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log" || true
149-
exit_code=${PIPESTATUS[0]}
150-
151-
# Atomic write of exit code (signal to client that result is ready)
164+
# Clear any stale cancel file from a previous timed-out client
165+
rm -f "$RELAY_DIR/cancel"
166+
167+
# Create log file and stream output to server console via tail
168+
: > "$RESULT_DIR/$cmd_id.log"
169+
tail -f "$RESULT_DIR/$cmd_id.log" &
170+
tail_pid=$!
171+
172+
# Run from cmd_content (not the file) since we already removed it
173+
(cd "$WORKDIR" && bash -c "$cmd_content") >> "$RESULT_DIR/$cmd_id.log" 2>&1 &
174+
cmd_pid=$!
175+
176+
# Track the running command (ID and PID) — atomic write to prevent partial reads
177+
echo "$cmd_id:$cmd_pid" > "$RELAY_DIR/running.tmp"
178+
mv "$RELAY_DIR/running.tmp" "$RELAY_DIR/running"
179+
180+
# Wait for completion or cancellation
181+
cancelled=""
182+
while kill -0 "$cmd_pid" 2>/dev/null; do
183+
if [[ -f "$RELAY_DIR/cancel" ]]; then
184+
# Verify cancel targets this command (reject empty or mismatched signals)
185+
cancel_target=$(cat "$RELAY_DIR/cancel" 2>/dev/null) || true
186+
if [[ "$cancel_target" != "$cmd_id" ]]; then
187+
rm -f "$RELAY_DIR/cancel"
188+
sleep "$POLL_INTERVAL"
189+
continue
190+
fi
191+
echo "[server] Cancelling command $cmd_id (PID $cmd_pid)..."
192+
# Send SIGTERM to children first, then parent
193+
pkill -P "$cmd_pid" 2>/dev/null || true
194+
kill "$cmd_pid" 2>/dev/null || true
195+
# Wait up to 5s for graceful exit, then escalate to SIGKILL
196+
for _ in $(seq 1 5); do
197+
kill -0 "$cmd_pid" 2>/dev/null || break
198+
sleep 1
199+
done
200+
if kill -0 "$cmd_pid" 2>/dev/null; then
201+
echo "[server] Process $cmd_pid did not exit, sending SIGKILL..."
202+
pkill -9 -P "$cmd_pid" 2>/dev/null || true
203+
kill -9 "$cmd_pid" 2>/dev/null || true
204+
fi
205+
wait "$cmd_pid" 2>/dev/null || true
206+
cancelled="true"
207+
rm -f "$RELAY_DIR/cancel"
208+
echo "[cancelled]" >> "$RESULT_DIR/$cmd_id.log"
209+
echo "[server] Command $cmd_id cancelled."
210+
break
211+
fi
212+
sleep "$POLL_INTERVAL"
213+
done
214+
215+
# Determine exit code (|| exit_code=$? prevents set -e from killing the
216+
# server when the command exits non-zero)
217+
if [[ -n "$cancelled" ]]; then
218+
exit_code=130
219+
else
220+
exit_code=0
221+
wait "$cmd_pid" 2>/dev/null || exit_code=$?
222+
fi
223+
224+
# Stop console streaming
225+
kill "$tail_pid" 2>/dev/null || true
226+
wait "$tail_pid" 2>/dev/null || true
227+
228+
# Write exit code BEFORE removing the running marker, so any observer
229+
# that sees running disappear can immediately find the result
152230
echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit.tmp"
153231
mv "$RESULT_DIR/$cmd_id.exit.tmp" "$RESULT_DIR/$cmd_id.exit"
154232

155-
# Remove the command file to mark it as processed
156-
rm -f "$cmd_file"
233+
# Now safe to remove markers
234+
rm -f "$RELAY_DIR/running"
235+
rm -f "$RELAY_DIR/cancel"
157236

158237
echo "[server] Command $cmd_id finished (exit=$exit_code)"
159238
done

0 commit comments

Comments
 (0)