Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .claude/skills/debug/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
name: debug
description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
---

# Remote Docker Debugger

Execute commands inside a Docker container from the host using the file-based command relay.

**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol, examples, and troubleshooting.

## Quick Reference

```bash
# Check connection
bash tools/debugger/client.sh status

# Connect to server (user must start server.sh in Docker first)
bash tools/debugger/client.sh handshake

# Run a command
bash tools/debugger/client.sh run "<command>"

# Long-running command (default timeout is 600s)
bash tools/debugger/client.sh --timeout 1800 run "<command>"

# Cancel the currently running command
bash tools/debugger/client.sh cancel

# Reconnect after server restart
bash tools/debugger/client.sh flush
bash tools/debugger/client.sh handshake
```
10 changes: 10 additions & 0 deletions tools/debugger/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,20 @@ bash tools/debugger/client.sh run "nvidia-smi"
bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B"
```

### Cancelling Commands

```bash
# Cancel the currently running command
bash tools/debugger/client.sh cancel

# Client-side timeout also auto-cancels the running command
```

### Important Notes

- The server must be started by the user manually inside Docker before the handshake.
- Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks.
- Commands execute sequentially — one at a time.
- A running command can be cancelled; cancelled commands exit with code 130.
- All commands run with the auto-detected repo root as the working directory.
- The `.relay/` directory is ephemeral and git-ignored.
18 changes: 16 additions & 2 deletions tools/debugger/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"
# Run with a long timeout (default is 600s)
bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py"

# Cancel a running command
bash tools/debugger/client.sh cancel

# Check status
bash tools/debugger/client.sh status
```
Expand All @@ -65,6 +68,8 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
├── server.ready # Written by server on startup
├── client.ready # Written by client during handshake
├── handshake.done # Written by server to confirm handshake
├── running # Written by server while a command is executing (cmd_id:pid)
├── cancel # Written by client to request cancellation of the running command
├── cmd/ # Client writes command .sh files here
│ └── <id>.sh # Command to execute
└── result/ # Server writes results here
Expand All @@ -82,9 +87,16 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
### Command Execution

1. Client writes a command to `.relay/cmd/<timestamp>.sh`
2. Server detects the file, runs `bash <file>` in the workdir, captures output
2. Server detects the file, runs `bash <file>` in the workdir in background, writes `.relay/running`
3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
4. Server removes the `.sh` file; client reads results and cleans up
4. Server removes the `.sh` file and `.relay/running`; client reads results and cleans up

### Cancellation

1. Client writes `.relay/cancel`
2. Server detects the cancel signal, kills the running command process tree
3. Server writes exit code 130 and removes `.relay/running` and `.relay/cancel`
4. Client-side timeout also triggers cancellation automatically
Comment on lines 89 to +99
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Protocol section still describes the pre-buffered execution flow.

tools/debugger/server.sh no longer runs bash <file> and deletes the .sh file at the end; it reads the file, removes it immediately, and executes bash -c "$cmd_content". tools/debugger/client.sh also writes the target cmd_id into .relay/cancel. This section should be updated so the documented on-disk state transitions match what users will actually see.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/README.md` around lines 89 - 99, The Protocol docs describe an
outdated pre-buffered flow; update the README protocol to match the current
behavior in tools/debugger/server.sh and tools/debugger/client.sh by documenting
that the server reads the command file contents, immediately removes the
`.relay/cmd/<timestamp>.sh` file, then executes the command via `bash -c
"$cmd_content"` (rather than running `bash <file>` and deleting at end), and
that the client writes the target `cmd_id` into `.relay/cancel` for
cancellation; also adjust the described on-disk transitions for `.relay/cmd`,
`.relay/running`, `.relay/result/*`, and `.relay/cancel` to reflect these exact
steps and exit code 130 behavior.


## Options

Expand All @@ -107,3 +119,5 @@ The relay uses a directory at `tools/debugger/.relay/` with this structure:
- The `.relay/` directory is in `.gitignore` — it is not checked in.
- Only one server should run at a time (startup clears the relay directory).
- Commands run sequentially in the order the server discovers them.
- A running command can be cancelled via `client.sh cancel`. Cancelled commands exit with code 130.
- Client-side timeouts automatically cancel the running command on the server.
67 changes: 66 additions & 1 deletion tools/debugger/client.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
# Usage:
# bash client.sh handshake - Connect to server
# bash client.sh run <command...> - Run a command and print output
# bash client.sh cancel - Cancel the running command
# bash client.sh status - Check server status
#
# Options:
Expand Down Expand Up @@ -91,6 +92,8 @@ case "$SUBCOMMAND" in
# Generate a unique command ID (timestamp + PID to avoid collisions)
cmd_id="$(date +%s%N)_$$"

echo "[client] Running: $*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw command text by default.

Line 94 prints the full command, which can expose inline credentials/tokens in shell/CI logs. Log only cmd_id by default and gate full command logging behind an explicit debug flag.

🔧 Proposed change
-        echo "[client] Running: $*"
+        if [[ "${DEBUGGER_LOG_COMMANDS:-0}" == "1" ]]; then
+            echo "[client] Running command id=$cmd_id: $*"
+        else
+            echo "[client] Running command id=$cmd_id"
+        fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` at line 94, The current echo "[client] Running: $*"
logs the entire command (which may contain secrets); change it to log only the
command identifier (cmd_id) by default and gate full command text behind an
explicit debug flag (e.g., DEBUG or DEBUG_CLIENT). Update the client.sh location
where echo "[client] Running: $*" appears to: log "[client] Running
cmd_id=<cmd_id>" by default, and add a guarded conditional that, when DEBUG (or
DEBUG_CLIENT) is set (true/1), emits the full command string (the previous "$*")
to the logs; ensure any added flag checks are robust for both "1"/"true" values
and reference the existing cmd_id variable name so reviewers can find the change
easily.


# Write the command file atomically (tmp + mv)
echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"
Expand All @@ -108,14 +111,32 @@ case "$SUBCOMMAND" in
elapsed=$((elapsed + POLL_INTERVAL))
if [[ $elapsed -ge $TIMEOUT ]]; then
echo "ERROR: Command timed out after ${TIMEOUT}s."
# Clean up the pending command
# Cancel the running command only if it is OUR command
if [[ -f "$RELAY_DIR/running" ]]; then
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
running_id="${running_info%%:*}"
if [[ "$running_id" == "$cmd_id" ]]; then
echo "Sending cancel signal..."
echo "$cmd_id" > "$RELAY_DIR/cancel"
Comment on lines +119 to +120
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Write .relay/cancel atomically.

tools/debugger/server.sh removes empty or mismatched cancel markers, but both of these sites create .relay/cancel with plain >. A poll that lands between truncate and write can read an empty file and drop a real cancel request. Use a per-call temp file plus mv, like the server already does for .relay/running and .exit.

🔧 Proposed change
-                        echo "$cmd_id" > "$RELAY_DIR/cancel"
+                        tmp_cancel="$RELAY_DIR/cancel.$$"
+                        echo "$cmd_id" > "$tmp_cancel"
+                        mv "$tmp_cancel" "$RELAY_DIR/cancel"
...
-            echo "$running_id" > "$RELAY_DIR/cancel"
+            tmp_cancel="$RELAY_DIR/cancel.$$"
+            echo "$running_id" > "$tmp_cancel"
+            mv "$tmp_cancel" "$RELAY_DIR/cancel"

Also applies to: 199-200

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 119 - 120, Replace the non-atomic
writes to the cancel marker in tools/debugger/client.sh (the block that echoes
"Sending cancel signal..." and writes "$cmd_id" > "$RELAY_DIR/cancel") with an
atomic write: write the payload to a per-call temp file in $RELAY_DIR (e.g.,
"$RELAY_DIR/.cancel.$$" or similar), fsync if available, then atomically mv the
temp file to "$RELAY_DIR/cancel"; apply the same change to the other occurrence
around lines 199-200 to mirror how .relay/running and .exit are handled by the
server.

for _ in $(seq 1 10); do
[[ -f "$RELAY_DIR/running" ]] || break
sleep 1
done
Comment on lines +121 to +124
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not wait for the relay to go idle after cancelling one command.

Both loops only wait for .relay/running to disappear. If the cancelled command finishes and the server immediately starts the next queued command, the marker can be recreated before the next poll and these waits will falsely time out even though the target cmd_id is already gone. Break once .relay/running is missing or its cmd_id no longer matches the cancelled one.

🔧 Suggested shape
wait_for_running_id_to_clear() {
    local target_id="$1" timeout="$2" elapsed=0 current_info current_id
    while [[ $elapsed -lt $timeout ]]; do
        if [[ ! -f "$RELAY_DIR/running" ]]; then
            return 0
        fi
        current_info=$(cat "$RELAY_DIR/running" 2>/dev/null || true)
        current_id="${current_info%%:*}"
        [[ "$current_id" != "$target_id" ]] && return 0
        sleep "$POLL_INTERVAL"
        elapsed=$((elapsed + POLL_INTERVAL))
    done
    return 1
}

Apply the same check in both the timeout-cancel path and the cancel subcommand.

Also applies to: 204-211

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 121 - 124, The loop that waits for
"$RELAY_DIR/running" to disappear can falsely time out if the cancelled command
finishes and the server immediately starts the next command; update both wait
loops (used in the timeout-cancel path and the cancel subcommand) to break not
only when the running file is missing but also when its recorded cmd_id differs
from the cancelled target_id. Implement a helper like
wait_for_running_id_to_clear(target_id, timeout) that reads
"$RELAY_DIR/running", extracts the current_id (prefix before ":"), returns
success if the file is gone or current_id != target_id, otherwise sleeps
POLL_INTERVAL and retries until timeout; replace the existing seq/sleep loops
with calls to this helper in both places.

fi
fi
# Clean up command and any orphaned result files
rm -f "$CMD_DIR/$cmd_id.sh"
rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log"
exit 1
fi
done

# Read and display results
exit_code=$(cat "$RESULT_DIR/$cmd_id.exit")
if ! [[ "$exit_code" =~ ^[0-9]+$ ]]; then
echo "WARNING: Invalid exit code '$exit_code', defaulting to 1."
exit_code=1
fi
if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then
cat "$RESULT_DIR/$cmd_id.log"
fi
Expand All @@ -139,6 +160,11 @@ case "$SUBCOMMAND" in
else
echo "Handshake: not started"
fi
if [[ -f "$RELAY_DIR/running" ]]; then
echo "Running: $(cat "$RELAY_DIR/running")"
else
echo "Running: (idle)"
fi
if [[ -d "$CMD_DIR" ]]; then
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
else
Expand All @@ -148,6 +174,10 @@ case "$SUBCOMMAND" in
;;

flush)
if [[ -f "$RELAY_DIR/running" ]]; then
echo "ERROR: A command is currently running. Cancel it first or wait for it to finish."
exit 1
fi
if [[ -d "$RELAY_DIR" ]]; then
# Clear handshake and command/result files, but keep server.ready
rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done"
Expand All @@ -159,12 +189,47 @@ case "$SUBCOMMAND" in
fi
;;

cancel)
# Check if there's a running command
if [[ -f "$RELAY_DIR/running" ]]; then
running_info=$(cat "$RELAY_DIR/running" 2>/dev/null) || true
running_id="${running_info%%:*}"
echo "Cancelling running command: $running_id"

# Write cancel signal with cmd_id so server can verify the target
echo "$running_id" > "$RELAY_DIR/cancel"

# Wait for the server to process the cancellation
elapsed=0
while [[ -f "$RELAY_DIR/running" ]]; do
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
if [[ $elapsed -ge 30 ]]; then
echo "WARNING: Cancel signal sent but command still running after 30s."
exit 1
fi
done
echo "Command cancelled."
else
echo "No command is currently running."
fi

# Report pending commands
if [[ -d "$CMD_DIR" ]]; then
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
if [[ "$pending" -gt 0 ]]; then
echo "$pending pending command(s) in queue. Use 'flush' to clear them."
fi
fi
;;

*)
echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>"
echo ""
echo "Subcommands:"
echo " handshake Connect to the server"
echo " run <cmd> Execute a command on the server"
echo " cancel Cancel the currently running command"
echo " status Check connection status"
echo " flush Clear the relay directory"
exit 1
Expand Down
125 changes: 108 additions & 17 deletions tools/debugger/server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,19 @@ RESULT_DIR="$RELAY_DIR/result"

cleanup() {
echo "[server] Shutting down..."
# Kill any running command (guard all reads with || true to prevent set -e
# from aborting the trap and leaving stale marker files)
running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
if [[ -n "$running_pid" ]]; then
pkill -P "$running_pid" 2>/dev/null || true
kill "$running_pid" 2>/dev/null || true
fi
# Kill any child processes in our process group
pkill -P $$ 2>/dev/null || true
rm -f "$RELAY_DIR/server.ready"
rm -f "$RELAY_DIR/handshake.done"
rm -f "$RELAY_DIR/running"
rm -f "$RELAY_DIR/cancel"
exit 0
}
trap cleanup SIGINT SIGTERM
Expand All @@ -87,17 +96,28 @@ fi
rm -rf "$RELAY_DIR"
mkdir -p "$CMD_DIR" "$RESULT_DIR"

# Install modelopt in editable mode (skip if already editable-installed from WORKDIR)
if python -c "
import modelopt, os
assert os.path.realpath(modelopt.__path__[0]).startswith(os.path.realpath('$WORKDIR'))
" 2>/dev/null; then
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
python -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath('$WORKDIR')
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
Comment on lines +105 to +108
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the prefix false-positive behavior vs boundary-safe check.
python - <<'PY'
import os
actual = "/workspace/repo2/modelopt"
expected = "/workspace/repo"
print("startswith:", actual.startswith(expected))  # incorrect acceptance
print("commonpath:", os.path.commonpath([actual, expected]) == expected)  # correct rejection
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 100


🏁 Script executed:

# Examine the actual file to verify the code at lines 90-100 and 144-145
head -150 tools/debugger/server.sh | tail -70

Repository: NVIDIA/Model-Optimizer

Length of output: 2591


Use boundary-safe path validation and avoid logging sensitive command content.

Line 96 uses actual.startswith(expected), which can incorrectly accept sibling paths (e.g., /workspace/repo2 vs /workspace/repo). Use os.path.commonpath for a true directory-boundary check.

Additionally, line 144 logs the full command content without filtering:

cmd_content=$(cat "$cmd_file")
echo "[server] Executing command $cmd_id: $cmd_content"

This exposes any secrets, API keys, or credentials present in command files to server logs, violating the guideline that sensitive information must not be hardcoded or logged.

🔧 Proposed changes

For path validation (lines 96-99):

 check_modelopt_local() {
-    python -c "
+    WORKDIR_ENV=\"$WORKDIR\" python -c "
 import modelopt, os, sys
 actual = os.path.realpath(modelopt.__path__[0])
-expected = os.path.realpath('$WORKDIR')
-if not actual.startswith(expected):
+expected = os.path.realpath(os.environ['WORKDIR_ENV'])
+if os.path.commonpath([actual, expected]) != expected:
     print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
     sys.exit(1)
 " 2>&1
 }

For command logging (lines 144-145), avoid logging the full command content:

-echo "[server] Executing command $cmd_id: $cmd_content"
+echo "[server] Executing command $cmd_id"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 96 - 99, The path check using
actual.startswith(expected) is boundary-unsafe and should be replaced with a
directory-boundary check using a canonical commonpath approach (e.g., call
Python's os.path.commonpath on real paths) to ensure actual is inside expected;
locate the check referencing actual.startswith(expected) and change it to
compute canonical paths (realpath) and compare os.path.commonpath([expected,
actual]) == expected. Also stop echoing full command contents from cmd_file into
logs: find the variables cmd_file and cmd_content and replace the logging line
that prints "[server] Executing command $cmd_id: $cmd_content" with a safer log
that only includes the cmd_id and a non-sensitive summary (e.g., truncated
length or a redacted placeholder), or validate/sanitize cmd_content before
logging so secrets/API keys are never emitted.

}

if check_modelopt_local >/dev/null 2>&1; then
echo "[server] modelopt already editable-installed from $WORKDIR, skipping pip install."
else
echo "[server] Installing modelopt (pip install -e .[dev]) ..."
(cd "$WORKDIR" && pip install -e ".[dev]") || {
echo "[server] WARNING: pip install failed (exit=$?), continuing anyway."
}
(cd "$WORKDIR" && pip install -e ".[dev]")
if ! check_modelopt_local; then
Comment on lines +99 to +116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Relevant server.sh lines:"
rg -n -C2 'export PYTHONPATH|check_modelopt_local|pip install -e' tools/debugger/server.sh

echo
repo_root="$(pwd)"
echo "Import resolution with PYTHONPATH set to the repo root:"
PYTHONPATH="$repo_root" python - <<'PY'
import importlib.util, os
spec = importlib.util.find_spec("modelopt")
print("PYTHONPATH =", os.environ["PYTHONPATH"])
print("spec is None:", spec is None)
if spec and spec.submodule_search_locations:
    print("search locations =", list(spec.submodule_search_locations))
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 1054


check_modelopt_local() no longer validates editable install due to PYTHONPATH pollution.

Line 84 exports PYTHONPATH="$WORKDIR" before check_modelopt_local() runs. Python's module resolution honors PYTHONPATH first, so import modelopt succeeds directly from the source tree without requiring editable install metadata. This allows the "already editable-installed" branch to execute even when pip install -e ".[dev]" has never run, causing the server to skip installing dev dependencies it expects.

The subprocess in check_modelopt_local() must isolate itself from the parent's PYTHONPATH. Use PYTHONPATH="" and the -I flag to ensure the check validates actual installed state:

Suggested fix
 check_modelopt_local() {
-    python -c "
+    WORKDIR_ENV="$WORKDIR" PYTHONPATH="" python -I -c "
 import modelopt, os, sys
 actual = os.path.realpath(modelopt.__path__[0])
-expected = os.path.realpath('$WORKDIR')
+expected = os.path.realpath(os.environ['WORKDIR_ENV'])
 if not actual.startswith(expected):
     print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
     sys.exit(1)
 " 2>&1
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
python -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath('$WORKDIR')
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
}
if check_modelopt_local >/dev/null 2>&1; then
echo "[server] modelopt already editable-installed from $WORKDIR, skipping pip install."
else
echo "[server] Installing modelopt (pip install -e .[dev]) ..."
(cd "$WORKDIR" && pip install -e ".[dev]") || {
echo "[server] WARNING: pip install failed (exit=$?), continuing anyway."
}
(cd "$WORKDIR" && pip install -e ".[dev]")
if ! check_modelopt_local; then
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
WORKDIR_ENV="$WORKDIR" PYTHONPATH="" python -I -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath(os.environ['WORKDIR_ENV'])
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 99 - 116, The check_modelopt_local()
helper is currently fooled by the parent's PYTHONPATH; modify the subprocess
invocation used inside check_modelopt_local to clear PYTHONPATH and run Python
in isolated mode so import resolution ignores the working tree. Concretely, when
launching the inline python check (the python -c "... import modelopt ..." block
referenced in check_modelopt_local), set the environment variable PYTHONPATH=""
for that subprocess and invoke python with the -I flag so the check validates
the actually installed package state rather than the source tree at $WORKDIR.

echo "[server] ERROR: modelopt is not running from the local folder ($WORKDIR)."
echo "[server] Try: pip install -e '.[dev]' inside the container, then restart the server."
exit 1
fi
echo "[server] Install done."
fi

Expand Down Expand Up @@ -129,19 +149,90 @@ while true; do
fi

for cmd_file in "$CMD_DIR"/*.sh; do
cmd_id="$(basename "$cmd_file" .sh)"
echo "[server] Executing command $cmd_id..."

# Execute the command, tee stdout+stderr to console and result file
(cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log" || true
exit_code=${PIPESTATUS[0]}
# Guard against command files deleted by the client between glob expansion
# and processing (e.g., client timeout on a queued command)
[[ -f "$cmd_file" ]] || continue

# Atomic write of exit code (signal to client that result is ready)
cmd_id="$(basename "$cmd_file" .sh)"
# Tolerate file disappearing between guard and read (TOCTOU with client timeout)
cmd_content=$(cat "$cmd_file" 2>/dev/null) || continue
# Remove command file immediately after reading to prevent re-execution
# and to avoid TOCTOU with client timeout deleting it during execution
rm -f "$cmd_file"
echo "[server] Executing command $cmd_id: $cmd_content"

# Clear any stale cancel file from a previous timed-out client
rm -f "$RELAY_DIR/cancel"

# Create log file and stream output to server console via tail
: > "$RESULT_DIR/$cmd_id.log"
tail -f "$RESULT_DIR/$cmd_id.log" &
tail_pid=$!

# Run from cmd_content (not the file) since we already removed it
(cd "$WORKDIR" && bash -c "$cmd_content") >> "$RESULT_DIR/$cmd_id.log" 2>&1 &
cmd_pid=$!

# Track the running command (ID and PID) — atomic write to prevent partial reads
echo "$cmd_id:$cmd_pid" > "$RELAY_DIR/running.tmp"
mv "$RELAY_DIR/running.tmp" "$RELAY_DIR/running"

# Wait for completion or cancellation
cancelled=""
while kill -0 "$cmd_pid" 2>/dev/null; do
if [[ -f "$RELAY_DIR/cancel" ]]; then
# Verify cancel targets this command (reject empty or mismatched signals)
cancel_target=$(cat "$RELAY_DIR/cancel" 2>/dev/null) || true
if [[ "$cancel_target" != "$cmd_id" ]]; then
rm -f "$RELAY_DIR/cancel"
sleep "$POLL_INTERVAL"
continue
fi
echo "[server] Cancelling command $cmd_id (PID $cmd_pid)..."
# Send SIGTERM to children first, then parent
pkill -P "$cmd_pid" 2>/dev/null || true
kill "$cmd_pid" 2>/dev/null || true
# Wait up to 5s for graceful exit, then escalate to SIGKILL
for _ in $(seq 1 5); do
kill -0 "$cmd_pid" 2>/dev/null || break
sleep 1
done
if kill -0 "$cmd_pid" 2>/dev/null; then
echo "[server] Process $cmd_pid did not exit, sending SIGKILL..."
pkill -9 -P "$cmd_pid" 2>/dev/null || true
kill -9 "$cmd_pid" 2>/dev/null || true
fi
wait "$cmd_pid" 2>/dev/null || true
cancelled="true"
rm -f "$RELAY_DIR/cancel"
echo "[cancelled]" >> "$RESULT_DIR/$cmd_id.log"
echo "[server] Command $cmd_id cancelled."
break
fi
sleep "$POLL_INTERVAL"
done

# Determine exit code (|| exit_code=$? prevents set -e from killing the
# server when the command exits non-zero)
if [[ -n "$cancelled" ]]; then
exit_code=130
else
exit_code=0
wait "$cmd_pid" 2>/dev/null || exit_code=$?
fi

# Stop console streaming
kill "$tail_pid" 2>/dev/null || true
wait "$tail_pid" 2>/dev/null || true

# Write exit code BEFORE removing the running marker, so any observer
# that sees running disappear can immediately find the result
echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit.tmp"
mv "$RESULT_DIR/$cmd_id.exit.tmp" "$RESULT_DIR/$cmd_id.exit"

# Remove the command file to mark it as processed
rm -f "$cmd_file"
# Now safe to remove markers
rm -f "$RELAY_DIR/running"
rm -f "$RELAY_DIR/cancel"

echo "[server] Command $cmd_id finished (exit=$exit_code)"
done
Expand Down
Loading