Skip to content

Add job cancellation support to the debugger command relay#1262

Open
cjluo-nv wants to merge 2 commits intomainfrom
feat/debug-skill
Open

Add job cancellation support to the debugger command relay#1262
cjluo-nv wants to merge 2 commits intomainfrom
feat/debug-skill

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 14, 2026

Summary

  • Add a cancel subcommand to the client that terminates the currently running command on the server
  • Server now runs commands in the background with PID tracking, enabling cancellation mid-execution
  • Client-side timeouts automatically cancel the running command on the server (previously the server process was left running)
  • Hardened against race conditions through 4 rounds of adversarial review (15 fixes total)

Key changes

server.sh:

  • Commands run in background with PID tracked in $RELAY_DIR/running (atomic tmp+mv write)
  • Cancel detection loop checks for $RELAY_DIR/cancel file with cmd_id verification
  • SIGTERM with 5s grace period, then SIGKILL escalation for stuck processes
  • .exit file written before running marker removed (ordering guarantee)
  • set -e-safe: wait uses || exit_code=$? pattern; cleanup trap fully guarded
  • Stale cancel files cleared at command start; mismatched/empty signals rejected
  • Command file read into memory and removed before execution (eliminates TOCTOU with client timeout)

client.sh:

  • New cancel subcommand: writes target cmd_id to cancel file, waits for server acknowledgment (30s timeout)
  • run timeout now sends targeted cancel signal (verifies cmd_id match to avoid killing wrong command)
  • run timeout cleans up orphaned result files
  • status shows currently running command
  • flush rejects if a command is currently running (prevents state corruption)
  • Exit code validated as numeric before use

Protocol additions

.relay/
├── running    # server writes cmd_id:pid while executing (atomic)
├── cancel     # client writes target cmd_id to request cancellation

Test plan

  • Start server in Docker, handshake from host
  • Run a command (client.sh run "sleep 30"), cancel it (client.sh cancel), verify exit code 130
  • Run a command with short timeout (--timeout 5 run "sleep 30"), verify auto-cancel
  • Run a command that exits non-zero, verify server stays alive
  • Run status during execution, verify it shows the running command
  • Attempt flush during execution, verify it is rejected
  • Cancel when nothing is running, verify clean message

🤖 Generated with Claude Code

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (bash scripts for dev tooling, tested manually)
  • Did you update Changelog?: N/A (internal tooling)

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive debug skill guide and protocol reference, including quick CLI examples and a new section on cancelling commands.
  • New Features

    • New client-side cancel command to request termination of the currently running remote command.
    • Status now reports active command or idle state.
  • Improvements

    • Clearer validation and startup guidance.
    • Reliable cancellation handling with distinct exit semantics and safer cleanup during shutdown and relay flush.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds a debug skill doc and extends the file-based debugger with client-side command logging and a cancel subcommand, plus server-side command execution rework, cancellation handling, atomic markers, and stricter modelopt validation and install failure behavior.

Changes

Cohort / File(s) Summary
Debug Skill Documentation
.claude/skills/debug/SKILL.md
New skill doc describing host↔container file-relay debug workflow, prerequisites, and CLI quick-reference commands.
Client: control & UX
tools/debugger/client.sh
Added cancel subcommand; run now logs the command, removes pending artifacts on timeout, conditionally issues cancel requests when matching cmd_id, and validates numeric exit codes. status and flush behaviors tightened; help updated.
Server: execution & cancellation
tools/debugger/server.sh
Introduced check_modelopt_local() and enforced pip install + validation; TOCTOU-safe command file read/remove, atomic running marker (cmd_id:cmd_pid), background execution with log tailing, cancellation polling/matching, graceful termination (TERM→KILL), [cancelled] log marker, deterministic exit codes (130 on cancel), and expanded shutdown cleanup of relay markers and running process tree.
Protocol docs
tools/debugger/CLAUDE.md, tools/debugger/README.md
Documented cancelling commands, cancel exit code 130, client-side timeout auto-cancels, and updated relay protocol to include .relay/running and .relay/cancel semantics.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Host Client
    participant FS as Relay Dir (shared FS)
    participant Server as Container Server
    participant Cmd as Command Process

    Client->>FS: write .relay/command (cmd_id, cmd_content)
    Client->>FS: optionally write handshake/status files
    FS->>Server: server polls for command file
    Server->>FS: atomically read+remove command file -> write .relay/running (cmd_id:pid)
    Server->>Cmd: start bash -c "cmd_content" (background), log -> logfile
    Server->>FS: tail logfile -> stream output to Client (via files)
    Client->>FS: (optional) write .relay/cancel with target cmd_id
    FS->>Server: server detects .relay/cancel, reads target
    alt target matches running cmd_id
        Server->>Cmd: send SIGTERM -> wait -> send SIGKILL if needed
        Server->>logfile: append "[cancelled]"
        Server->>FS: write cmd_id.exit (130), remove .relay/running and .relay/cancel
    else target mismatch
        Server->>FS: clear .relay/cancel and continue
    end
    Server->>FS: on normal completion write cmd_id.exit (exit_code), remove .relay/running
    Client->>FS: read cmd_id.exit and logs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding job cancellation support to the debugger command relay system, which is substantiated across multiple modified files in the changeset.
Security Anti-Patterns ✅ Passed PR contains only shell scripts and documentation files; no Python code or dependency changes present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/debug-skill

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1262/

Built to branch gh-pages at 2026-04-14 21:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Line 94: The current echo "[client] Running: $*" logs the entire command
(which may contain secrets); change it to log only the command identifier
(cmd_id) by default and gate full command text behind an explicit debug flag
(e.g., DEBUG or DEBUG_CLIENT). Update the client.sh location where echo
"[client] Running: $*" appears to: log "[client] Running cmd_id=<cmd_id>" by
default, and add a guarded conditional that, when DEBUG (or DEBUG_CLIENT) is set
(true/1), emits the full command string (the previous "$*") to the logs; ensure
any added flag checks are robust for both "1"/"true" values and reference the
existing cmd_id variable name so reviewers can find the change easily.

In `@tools/debugger/server.sh`:
- Around line 144-145: The current code reads the full command into cmd_content
and prints it, which may leak secrets; change the logging so it does not echo
the full payload—stop using cmd_content in the server log and instead log only
the command identifier (cmd_id) or a redacted/hashed form; update the block that
uses cmd_file, cmd_content and the echo "[server] Executing command $cmd_id:
$cmd_content" to either omit cmd_content entirely or replace it with a safe
redaction/hashing of the content before logging while still preserving cmd_id
for traceability.
- Around line 96-99: The path check using actual.startswith(expected) is
boundary-unsafe and should be replaced with a directory-boundary check using a
canonical commonpath approach (e.g., call Python's os.path.commonpath on real
paths) to ensure actual is inside expected; locate the check referencing
actual.startswith(expected) and change it to compute canonical paths (realpath)
and compare os.path.commonpath([expected, actual]) == expected. Also stop
echoing full command contents from cmd_file into logs: find the variables
cmd_file and cmd_content and replace the logging line that prints "[server]
Executing command $cmd_id: $cmd_content" with a safer log that only includes the
cmd_id and a non-sensitive summary (e.g., truncated length or a redacted
placeholder), or validate/sanitize cmd_content before logging so secrets/API
keys are never emitted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b5c9d28-466f-46b9-aa50-b44b975a4a2d

📥 Commits

Reviewing files that changed from the base of the PR and between b6c6ec3 and 08a838c.

📒 Files selected for processing (3)
  • .claude/skills/debug/SKILL.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh

# Generate a unique command ID (timestamp + PID to avoid collisions)
cmd_id="$(date +%s%N)_$$"

echo "[client] Running: $*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw command text by default.

Line 94 prints the full command, which can expose inline credentials/tokens in shell/CI logs. Log only cmd_id by default and gate full command logging behind an explicit debug flag.

🔧 Proposed change
-        echo "[client] Running: $*"
+        if [[ "${DEBUGGER_LOG_COMMANDS:-0}" == "1" ]]; then
+            echo "[client] Running command id=$cmd_id: $*"
+        else
+            echo "[client] Running command id=$cmd_id"
+        fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` at line 94, The current echo "[client] Running: $*"
logs the entire command (which may contain secrets); change it to log only the
command identifier (cmd_id) by default and gate full command text behind an
explicit debug flag (e.g., DEBUG or DEBUG_CLIENT). Update the client.sh location
where echo "[client] Running: $*" appears to: log "[client] Running
cmd_id=<cmd_id>" by default, and add a guarded conditional that, when DEBUG (or
DEBUG_CLIENT) is set (true/1), emits the full command string (the previous "$*")
to the logs; ensure any added flag checks are robust for both "1"/"true" values
and reference the existing cmd_id variable name so reviewers can find the change
easily.

Comment on lines +96 to +99
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the prefix false-positive behavior vs boundary-safe check.
python - <<'PY'
import os
actual = "/workspace/repo2/modelopt"
expected = "/workspace/repo"
print("startswith:", actual.startswith(expected))  # incorrect acceptance
print("commonpath:", os.path.commonpath([actual, expected]) == expected)  # correct rejection
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 100


🏁 Script executed:

# Examine the actual file to verify the code at lines 90-100 and 144-145
head -150 tools/debugger/server.sh | tail -70

Repository: NVIDIA/Model-Optimizer

Length of output: 2591


Use boundary-safe path validation and avoid logging sensitive command content.

Line 96 uses actual.startswith(expected), which can incorrectly accept sibling paths (e.g., /workspace/repo2 vs /workspace/repo). Use os.path.commonpath for a true directory-boundary check.

Additionally, line 144 logs the full command content without filtering:

cmd_content=$(cat "$cmd_file")
echo "[server] Executing command $cmd_id: $cmd_content"

This exposes any secrets, API keys, or credentials present in command files to server logs, violating the guideline that sensitive information must not be hardcoded or logged.

🔧 Proposed changes

For path validation (lines 96-99):

 check_modelopt_local() {
-    python -c "
+    WORKDIR_ENV=\"$WORKDIR\" python -c "
 import modelopt, os, sys
 actual = os.path.realpath(modelopt.__path__[0])
-expected = os.path.realpath('$WORKDIR')
-if not actual.startswith(expected):
+expected = os.path.realpath(os.environ['WORKDIR_ENV'])
+if os.path.commonpath([actual, expected]) != expected:
     print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
     sys.exit(1)
 " 2>&1
 }

For command logging (lines 144-145), avoid logging the full command content:

-echo "[server] Executing command $cmd_id: $cmd_content"
+echo "[server] Executing command $cmd_id"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 96 - 99, The path check using
actual.startswith(expected) is boundary-unsafe and should be replaced with a
directory-boundary check using a canonical commonpath approach (e.g., call
Python's os.path.commonpath on real paths) to ensure actual is inside expected;
locate the check referencing actual.startswith(expected) and change it to
compute canonical paths (realpath) and compare os.path.commonpath([expected,
actual]) == expected. Also stop echoing full command contents from cmd_file into
logs: find the variables cmd_file and cmd_content and replace the logging line
that prints "[server] Executing command $cmd_id: $cmd_content" with a safer log
that only includes the cmd_id and a non-sensitive summary (e.g., truncated
length or a redacted placeholder), or validate/sanitize cmd_content before
logging so secrets/API keys are never emitted.

Comment on lines +144 to +145
cmd_content=$(cat "$cmd_file")
echo "[server] Executing command $cmd_id: $cmd_content"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not log full command payloads in server output.

Lines 144-145 read and print full command content, which can leak credentials/tokens into persisted logs. Prefer logging only command ID (or a redacted form) by default.

🔧 Proposed change
-        cmd_content=$(cat "$cmd_file")
-        echo "[server] Executing command $cmd_id: $cmd_content"
+        echo "[server] Executing command $cmd_id"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cmd_content=$(cat "$cmd_file")
echo "[server] Executing command $cmd_id: $cmd_content"
echo "[server] Executing command $cmd_id"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 144 - 145, The current code reads the
full command into cmd_content and prints it, which may leak secrets; change the
logging so it does not echo the full payload—stop using cmd_content in the
server log and instead log only the command identifier (cmd_id) or a
redacted/hashed form; update the block that uses cmd_file, cmd_content and the
echo "[server] Executing command $cmd_id: $cmd_content" to either omit
cmd_content entirely or replace it with a safe redaction/hashing of the content
before logging while still preserving cmd_id for traceability.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.90%. Comparing base (b6c6ec3) to head (c4e320e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1262   +/-   ##
=======================================
  Coverage   76.90%   76.90%           
=======================================
  Files         350      350           
  Lines       40524    40524           
=======================================
  Hits        31166    31166           
  Misses       9358     9358           
Flag Coverage Δ
unit 55.59% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv changed the title Improve debugger skills and make it available from the top skills dir Add job cancellation support to the debugger command relay Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tools/debugger/server.sh (1)

156-163: ⚠️ Potential issue | 🟠 Major

Raw command text is still being written to server logs.

Line 162 echoes the full payload, so any inline token/password passed through client.sh run ends up in persisted logs. Please log only cmd_id by default and keep raw payload logging behind an explicit debug flag.

🔧 Proposed change
-        echo "[server] Executing command $cmd_id: $cmd_content"
+        echo "[server] Executing command $cmd_id"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 156 - 163, The server currently logs
the full raw command payload (cmd_content) which can leak secrets; change the
logging in the command execution path to only log the command identifier
(cmd_id) by default and remove or redact cmd_content from the echo. Add
conditional logging behind an explicit debug flag (e.g., CHECK_DEBUG or
DEBUG_VERBOSE) so that raw payloads are only printed when that flag is set (use
a test like if [ -n "$DEBUG_VERBOSE" ] to echo cmd_content), and ensure the
existing echo line referencing cmd_content is replaced accordingly in the block
that reads and removes the command file (cmd_file -> cmd_id/cmd_content).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Around line 119-120: Replace the non-atomic writes to the cancel marker in
tools/debugger/client.sh (the block that echoes "Sending cancel signal..." and
writes "$cmd_id" > "$RELAY_DIR/cancel") with an atomic write: write the payload
to a per-call temp file in $RELAY_DIR (e.g., "$RELAY_DIR/.cancel.$$" or
similar), fsync if available, then atomically mv the temp file to
"$RELAY_DIR/cancel"; apply the same change to the other occurrence around lines
199-200 to mirror how .relay/running and .exit are handled by the server.
- Around line 121-124: The loop that waits for "$RELAY_DIR/running" to disappear
can falsely time out if the cancelled command finishes and the server
immediately starts the next command; update both wait loops (used in the
timeout-cancel path and the cancel subcommand) to break not only when the
running file is missing but also when its recorded cmd_id differs from the
cancelled target_id. Implement a helper like
wait_for_running_id_to_clear(target_id, timeout) that reads
"$RELAY_DIR/running", extracts the current_id (prefix before ":"), returns
success if the file is gone or current_id != target_id, otherwise sleeps
POLL_INTERVAL and retries until timeout; replace the existing seq/sleep loops
with calls to this helper in both places.

In `@tools/debugger/README.md`:
- Around line 89-99: The Protocol docs describe an outdated pre-buffered flow;
update the README protocol to match the current behavior in
tools/debugger/server.sh and tools/debugger/client.sh by documenting that the
server reads the command file contents, immediately removes the
`.relay/cmd/<timestamp>.sh` file, then executes the command via `bash -c
"$cmd_content"` (rather than running `bash <file>` and deleting at end), and
that the client writes the target `cmd_id` into `.relay/cancel` for
cancellation; also adjust the described on-disk transitions for `.relay/cmd`,
`.relay/running`, `.relay/result/*`, and `.relay/cancel` to reflect these exact
steps and exit code 130 behavior.

In `@tools/debugger/server.sh`:
- Around line 99-116: The check_modelopt_local() helper is currently fooled by
the parent's PYTHONPATH; modify the subprocess invocation used inside
check_modelopt_local to clear PYTHONPATH and run Python in isolated mode so
import resolution ignores the working tree. Concretely, when launching the
inline python check (the python -c "... import modelopt ..." block referenced in
check_modelopt_local), set the environment variable PYTHONPATH="" for that
subprocess and invoke python with the -I flag so the check validates the
actually installed package state rather than the source tree at $WORKDIR.

---

Duplicate comments:
In `@tools/debugger/server.sh`:
- Around line 156-163: The server currently logs the full raw command payload
(cmd_content) which can leak secrets; change the logging in the command
execution path to only log the command identifier (cmd_id) by default and remove
or redact cmd_content from the echo. Add conditional logging behind an explicit
debug flag (e.g., CHECK_DEBUG or DEBUG_VERBOSE) so that raw payloads are only
printed when that flag is set (use a test like if [ -n "$DEBUG_VERBOSE" ] to
echo cmd_content), and ensure the existing echo line referencing cmd_content is
replaced accordingly in the block that reads and removes the command file
(cmd_file -> cmd_id/cmd_content).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15f5922b-82e3-4c8e-87c9-eefdcbf41ddb

📥 Commits

Reviewing files that changed from the base of the PR and between 08a838c and c4e320e.

📒 Files selected for processing (5)
  • .claude/skills/debug/SKILL.md
  • tools/debugger/CLAUDE.md
  • tools/debugger/README.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh
✅ Files skipped from review due to trivial changes (2)
  • tools/debugger/CLAUDE.md
  • .claude/skills/debug/SKILL.md

Comment on lines +119 to +120
echo "Sending cancel signal..."
echo "$cmd_id" > "$RELAY_DIR/cancel"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Write .relay/cancel atomically.

tools/debugger/server.sh removes empty or mismatched cancel markers, but both of these sites create .relay/cancel with plain >. A poll that lands between truncate and write can read an empty file and drop a real cancel request. Use a per-call temp file plus mv, like the server already does for .relay/running and .exit.

🔧 Proposed change
-                        echo "$cmd_id" > "$RELAY_DIR/cancel"
+                        tmp_cancel="$RELAY_DIR/cancel.$$"
+                        echo "$cmd_id" > "$tmp_cancel"
+                        mv "$tmp_cancel" "$RELAY_DIR/cancel"
...
-            echo "$running_id" > "$RELAY_DIR/cancel"
+            tmp_cancel="$RELAY_DIR/cancel.$$"
+            echo "$running_id" > "$tmp_cancel"
+            mv "$tmp_cancel" "$RELAY_DIR/cancel"

Also applies to: 199-200

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 119 - 120, Replace the non-atomic
writes to the cancel marker in tools/debugger/client.sh (the block that echoes
"Sending cancel signal..." and writes "$cmd_id" > "$RELAY_DIR/cancel") with an
atomic write: write the payload to a per-call temp file in $RELAY_DIR (e.g.,
"$RELAY_DIR/.cancel.$$" or similar), fsync if available, then atomically mv the
temp file to "$RELAY_DIR/cancel"; apply the same change to the other occurrence
around lines 199-200 to mirror how .relay/running and .exit are handled by the
server.

Comment on lines +121 to +124
for _ in $(seq 1 10); do
[[ -f "$RELAY_DIR/running" ]] || break
sleep 1
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not wait for the relay to go idle after cancelling one command.

Both loops only wait for .relay/running to disappear. If the cancelled command finishes and the server immediately starts the next queued command, the marker can be recreated before the next poll and these waits will falsely time out even though the target cmd_id is already gone. Break once .relay/running is missing or its cmd_id no longer matches the cancelled one.

🔧 Suggested shape
wait_for_running_id_to_clear() {
    local target_id="$1" timeout="$2" elapsed=0 current_info current_id
    while [[ $elapsed -lt $timeout ]]; do
        if [[ ! -f "$RELAY_DIR/running" ]]; then
            return 0
        fi
        current_info=$(cat "$RELAY_DIR/running" 2>/dev/null || true)
        current_id="${current_info%%:*}"
        [[ "$current_id" != "$target_id" ]] && return 0
        sleep "$POLL_INTERVAL"
        elapsed=$((elapsed + POLL_INTERVAL))
    done
    return 1
}

Apply the same check in both the timeout-cancel path and the cancel subcommand.

Also applies to: 204-211

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 121 - 124, The loop that waits for
"$RELAY_DIR/running" to disappear can falsely time out if the cancelled command
finishes and the server immediately starts the next command; update both wait
loops (used in the timeout-cancel path and the cancel subcommand) to break not
only when the running file is missing but also when its recorded cmd_id differs
from the cancelled target_id. Implement a helper like
wait_for_running_id_to_clear(target_id, timeout) that reads
"$RELAY_DIR/running", extracts the current_id (prefix before ":"), returns
success if the file is gone or current_id != target_id, otherwise sleeps
POLL_INTERVAL and retries until timeout; replace the existing seq/sleep loops
with calls to this helper in both places.

Comment on lines 89 to +99
1. Client writes a command to `.relay/cmd/<timestamp>.sh`
2. Server detects the file, runs `bash <file>` in the workdir, captures output
2. Server detects the file, runs `bash <file>` in the workdir in background, writes `.relay/running`
3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
4. Server removes the `.sh` file; client reads results and cleans up
4. Server removes the `.sh` file and `.relay/running`; client reads results and cleans up

### Cancellation

1. Client writes `.relay/cancel`
2. Server detects the cancel signal, kills the running command process tree
3. Server writes exit code 130 and removes `.relay/running` and `.relay/cancel`
4. Client-side timeout also triggers cancellation automatically
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Protocol section still describes the pre-buffered execution flow.

tools/debugger/server.sh no longer runs bash <file> and deletes the .sh file at the end; it reads the file, removes it immediately, and executes bash -c "$cmd_content". tools/debugger/client.sh also writes the target cmd_id into .relay/cancel. This section should be updated so the documented on-disk state transitions match what users will actually see.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/README.md` around lines 89 - 99, The Protocol docs describe an
outdated pre-buffered flow; update the README protocol to match the current
behavior in tools/debugger/server.sh and tools/debugger/client.sh by documenting
that the server reads the command file contents, immediately removes the
`.relay/cmd/<timestamp>.sh` file, then executes the command via `bash -c
"$cmd_content"` (rather than running `bash <file>` and deleting at end), and
that the client writes the target `cmd_id` into `.relay/cancel` for
cancellation; also adjust the described on-disk transitions for `.relay/cmd`,
`.relay/running`, `.relay/result/*`, and `.relay/cancel` to reflect these exact
steps and exit code 130 behavior.

Comment on lines +99 to +116
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
python -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath('$WORKDIR')
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
}

if check_modelopt_local >/dev/null 2>&1; then
echo "[server] modelopt already editable-installed from $WORKDIR, skipping pip install."
else
echo "[server] Installing modelopt (pip install -e .[dev]) ..."
(cd "$WORKDIR" && pip install -e ".[dev]") || {
echo "[server] WARNING: pip install failed (exit=$?), continuing anyway."
}
(cd "$WORKDIR" && pip install -e ".[dev]")
if ! check_modelopt_local; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Relevant server.sh lines:"
rg -n -C2 'export PYTHONPATH|check_modelopt_local|pip install -e' tools/debugger/server.sh

echo
repo_root="$(pwd)"
echo "Import resolution with PYTHONPATH set to the repo root:"
PYTHONPATH="$repo_root" python - <<'PY'
import importlib.util, os
spec = importlib.util.find_spec("modelopt")
print("PYTHONPATH =", os.environ["PYTHONPATH"])
print("spec is None:", spec is None)
if spec and spec.submodule_search_locations:
    print("search locations =", list(spec.submodule_search_locations))
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 1054


check_modelopt_local() no longer validates editable install due to PYTHONPATH pollution.

Line 84 exports PYTHONPATH="$WORKDIR" before check_modelopt_local() runs. Python's module resolution honors PYTHONPATH first, so import modelopt succeeds directly from the source tree without requiring editable install metadata. This allows the "already editable-installed" branch to execute even when pip install -e ".[dev]" has never run, causing the server to skip installing dev dependencies it expects.

The subprocess in check_modelopt_local() must isolate itself from the parent's PYTHONPATH. Use PYTHONPATH="" and the -I flag to ensure the check validates actual installed state:

Suggested fix
 check_modelopt_local() {
-    python -c "
+    WORKDIR_ENV="$WORKDIR" PYTHONPATH="" python -I -c "
 import modelopt, os, sys
 actual = os.path.realpath(modelopt.__path__[0])
-expected = os.path.realpath('$WORKDIR')
+expected = os.path.realpath(os.environ['WORKDIR_ENV'])
 if not actual.startswith(expected):
     print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
     sys.exit(1)
 " 2>&1
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
python -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath('$WORKDIR')
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
}
if check_modelopt_local >/dev/null 2>&1; then
echo "[server] modelopt already editable-installed from $WORKDIR, skipping pip install."
else
echo "[server] Installing modelopt (pip install -e .[dev]) ..."
(cd "$WORKDIR" && pip install -e ".[dev]") || {
echo "[server] WARNING: pip install failed (exit=$?), continuing anyway."
}
(cd "$WORKDIR" && pip install -e ".[dev]")
if ! check_modelopt_local; then
# Ensure modelopt is editable-installed from WORKDIR
check_modelopt_local() {
WORKDIR_ENV="$WORKDIR" PYTHONPATH="" python -I -c "
import modelopt, os, sys
actual = os.path.realpath(modelopt.__path__[0])
expected = os.path.realpath(os.environ['WORKDIR_ENV'])
if not actual.startswith(expected):
print(f'modelopt loaded from {actual}, expected under {expected}', file=sys.stderr)
sys.exit(1)
" 2>&1
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 99 - 116, The check_modelopt_local()
helper is currently fooled by the parent's PYTHONPATH; modify the subprocess
invocation used inside check_modelopt_local to clear PYTHONPATH and run Python
in isolated mode so import resolution ignores the working tree. Concretely, when
launching the inline python check (the python -c "... import modelopt ..." block
referenced in check_modelopt_local), set the environment variable PYTHONPATH=""
for that subprocess and invoke python with the -I flag so the check validates
the actually installed package state rather than the source tree at $WORKDIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant