Add file-based command relay for remote Docker testing by cjluo-nv · Pull Request #1174 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-04-03T17:05:05Z

Summary

Adds a lightweight file-based client/server relay (tools/debugger/) that enables Claude Code (or any host-side automation) to execute commands inside a remote Docker container using only a shared filesystem — no networking setup required.
The server auto-detects the repo root, installs modelopt (pip install -e .[dev]), sets PYTHONPATH, and listens for commands.
The client supports handshake, run, status, and flush subcommands.
Includes README.md (full protocol docs) and CLAUDE.md (quick reference for Claude Code).

Tested

Ran Qwen3.5-35B-A3B MoE PTQ with nvfp4_experts_only quantization via the relay:
```
bash examples/llm_ptq/scripts/huggingface_example.sh --model /hf-local/Qwen/Qwen3.5-35B-A3B/ --quant nvfp4_experts_only
```
- 42,140 quantizers inserted, MTP layers correctly excluded
- Quantized checkpoint exported successfully (~208s, 147.61 GB peak GPU memory)

Test plan

Start server.sh inside a Docker container with the repo mounted
Run client.sh handshake from the host
Run client.sh run "echo hello" and verify output
Run client.sh flush and verify .relay/ is cleared
Run a real PTQ workload end-to-end

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a file-based command relay system with host↔container client and server CLIs, supporting handshake, run, status and flush workflows to execute commands inside containers.
Documentation
- Added guides describing the relay protocol, usage examples, CLI options, lifecycle, and operational notes (workdir, timeouts, sequential execution).
Chores
- Updated ignore rules to exclude ephemeral relay artifacts.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-04-03T17:05:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b550aa8e-3831-4b21-a83d-7398c6297a22

📥 Commits

Reviewing files that changed from the base of the PR and between 6362761 and 5806895.

📒 Files selected for processing (2)

tools/debugger/client.sh
tools/debugger/server.sh

✅ Files skipped from review due to trivial changes (1)

tools/debugger/server.sh

🚧 Files skipped from review as they are similar to previous changes (1)

tools/debugger/client.sh

📝 Walkthrough

Walkthrough

A file-based command relay: a Docker-hosted server monitors a shared .relay/ directory and executes timestamped command scripts; a host client performs a handshake, submits commands, polls for results, and cleans up. Documentation and a gitignore entry were added.

Changes

Cohort / File(s)	Summary
Configuration `tools/debugger/.gitignore`	Added `.relay/` to ignore relay artifacts.
Documentation `tools/debugger/README.md`, `tools/debugger/CLAUDE.md`	New docs describing the `.relay/` protocol, handshake and command lifecycle, CLI usage examples, workdir behavior, and operational notes.
Client implementation `tools/debugger/client.sh`	New Bash client CLI with subcommands `handshake`, `run`, `status`, `flush`; supports `--relay-dir` and `--timeout`; writes atomic `cmd/<id>.sh`, polls for `result/<id>.exit`, handles timeouts and cleanup.
Server implementation `tools/debugger/server.sh`	New Bash server daemon for Docker: initializes/clears relay dirs, writes `server.ready`, completes handshake, executes `cmd/*.sh` in configured workdir, writes `result/<id>.log` and `result/<id>.exit`, and cleans up on shutdown.

Sequence Diagram

sequenceDiagram
    participant Host as Host<br/>(client.sh)
    participant FS as Shared<br/>Filesystem
    participant Docker as Docker<br/>(server.sh)

    rect rgba(100,150,255,0.5)
    Note over Host,Docker: Handshake
    Docker->>FS: Write server.ready
    Host->>FS: Read server.ready
    Host->>FS: Write client.ready
    Docker->>FS: Read client.ready
    Docker->>FS: Write handshake.done
    end

    rect rgba(100,200,100,0.5)
    Note over Host,Docker: Command Execution Cycle
    Host->>FS: Write cmd/<id>.sh
    Docker->>FS: Detect cmd/<id>.sh
    Docker->>Docker: Execute cmd/<id>.sh in workdir
    Docker->>FS: Write result/<id>.log (stdout+stderr)
    Docker->>FS: Write result/<id>.exit (exit code)
    Docker->>FS: Delete cmd/<id>.sh
    Host->>FS: Read result/<id>.log and result/<id>.exit
    Host->>FS: Delete result files
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main addition: a file-based command relay system for executing commands in remote Docker containers.
Security Anti-Patterns	✅ Passed	PR `#1174` adds only shell scripts, markdown, and .gitignore to tools/debugger/ with no Python code or security anti-patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/debugger

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

tools/debugger/README.md (1)

20-25: Document the trust boundary for .relay/ explicitly.

Please add a note that any actor with write access to .relay/cmd/ can execute commands in-container, and recommend restricting mount permissions to trusted users/processes only.

Suggested wording

 ## Assumptions
 
 - The ModelOpt repo is accessible from both host and container (e.g., bind-mounted)
 - **HuggingFace models** are mounted at `/hf-local`
 - The server auto-detects the repo root from the location of `server.sh`
+- The shared `.relay/` path is writable only by trusted users/processes; write access to
+  `.relay/cmd/` implies command execution capability inside the container.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/README.md` around lines 20 - 25, Add an explicit
trust-boundary note to tools/debugger/README.md stating that any actor with
write access to .relay/cmd/ can execute commands inside the container (since
server.sh auto-detects the repo root and will run dispatched commands), and
advise restricting mount permissions so only trusted users/processes can write
to .relay/cmd/ (e.g., use read-only mounts where possible, limit host user
ownership/ACLs, and avoid bind-mounting directories from untrusted sources).

tools/debugger/client.sh (1)

39-45: Add validation for option values and timeout argument type.

Lines 41–42 can fail under set -u if the option argument is missing, and line 102 assumes TIMEOUT is numeric without validation. Use a validation helper to ensure options receive valid arguments and that TIMEOUT is an integer before arithmetic comparison.

Suggested hardening

+require_value() {
+    local opt="$1"
+    local val="${2-}"
+    if [[ -z "$val" || "$val" == --* ]]; then
+        echo "ERROR: ${opt} requires a value."
+        exit 1
+    fi
+}
+
 while [[ $# -gt 0 ]]; do
     case "$1" in
-        --relay-dir) RELAY_DIR="$2"; shift 2 ;;
-        --timeout) TIMEOUT="$2"; shift 2 ;;
+        --relay-dir)
+            require_value "$1" "${2-}"
+            RELAY_DIR="$2"
+            shift 2
+            ;;
+        --timeout)
+            require_value "$1" "${2-}"
+            [[ "$2" =~ ^[0-9]+$ ]] || { echo "ERROR: --timeout must be an integer."; exit 1; }
+            TIMEOUT="$2"
+            shift 2
+            ;;
         *) break ;;
     esac
 done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 39 - 45, The option parsing loop can
dereference missing arguments under set -u and TIMEOUT is used as a number later
without validation; add a small helper (e.g., require_arg) used by the parsing
case to ensure flags like --relay-dir and --timeout have a following value
before shifting, and after parsing validate that TIMEOUT is an integer (use a
regex/grep check) before any arithmetic comparison (the code that reads/uses
TIMEOUT later). Update the case branches that set RELAY_DIR and TIMEOUT to call
the helper and fail with a clear message if the argument is absent or TIMEOUT is
not numeric.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Line 95: Replace the direct write to "$CMD_DIR/$cmd_id.sh" with an atomic
publish: create a temp file in the same directory (e.g.,
"$CMD_DIR/$cmd_id.sh.tmp.$$" or via mktemp in CMD_DIR), write the command
contents into that temp file, flush/sync and set executable permissions, then
rename (mv) the temp file to "$CMD_DIR/$cmd_id.sh" so the server never sees a
partially-written file; update the code around the echo "$*" >
"$CMD_DIR/$cmd_id.sh" usage to perform these steps using CMD_DIR and cmd_id.
- Around line 139-142: The flush case calls rm -rf on RELAY_DIR without
validation; before removing, canonicalize RELAY_DIR (e.g., via realpath) and
validate it is non-empty, not "/" and not a top-level system path (reject
prefixes like /, /etc, /bin, /usr, /boot, /sbin, /proc, /sys, /dev), ensure the
resolved path exists and is a subdirectory of an expected safe base (or require
explicit user confirmation if not), and only then perform rm -rf; update the
flush handling in client.sh (the flush case and any code that parses
--relay-dir) to perform these guards.
- Line 135: The command substitution setting pending uses ls and a fallback that
fails under set -euo pipefail; replace it so missing/invalid CMD_DIR yields 0
reliably by either first testing [ -d "$CMD_DIR" ] and setting pending=0 when
absent, or by using find on "$CMD_DIR" (with -maxdepth 1 and -name '*.sh' and
-type f) piped to wc -l while redirecting find errors to /dev/null; update the
assignment that sets pending to use one of these robust approaches (referencing
the pending variable and CMD_DIR).

In `@tools/debugger/server.sh`:
- Around line 103-121: The loop currently iterates over cmd files and deletes
them after running, which can replay on restart; change to atomically claim each
command before running by renaming/moving the file (using mv) to a processing
name (e.g., append .claimed or move into a PROCESS_DIR) and operate on that
claimed path instead of the original $CMD_DIR/$cmd_file; if the mv fails (no-op
or already claimed) skip that file, then run the claimed file from $WORKDIR, tee
output to $RESULT_DIR/$cmd_id.log, write the exit code to
$RESULT_DIR/$cmd_id.exit, and finally remove or archive the claimed file to mark
completion (all references: CMD_DIR, cmd_file, cmd_id, WORKDIR, RESULT_DIR).
- Line 76: The rm -rf "$RELAY_DIR" call is unsafe; add a guard in server.sh that
validates RELAY_DIR (use realpath/resolved path) and refuse to delete if it's
empty, "/", "/", ".", "/", or equals root or other critical system paths, and
ensure it is within an expected base directory (for example the project
workspace or /tmp) or contains an expected substring; if the check fails, print
an error and exit non‑zero. Locate the removal statement referencing RELAY_DIR
and replace it with the preflight validation (resolve path, check not
root/empty/"/" and check prefix or allowed list) before performing the rm -rf.

---

Nitpick comments:
In `@tools/debugger/client.sh`:
- Around line 39-45: The option parsing loop can dereference missing arguments
under set -u and TIMEOUT is used as a number later without validation; add a
small helper (e.g., require_arg) used by the parsing case to ensure flags like
--relay-dir and --timeout have a following value before shifting, and after
parsing validate that TIMEOUT is an integer (use a regex/grep check) before any
arithmetic comparison (the code that reads/uses TIMEOUT later). Update the case
branches that set RELAY_DIR and TIMEOUT to call the helper and fail with a clear
message if the argument is absent or TIMEOUT is not numeric.

In `@tools/debugger/README.md`:
- Around line 20-25: Add an explicit trust-boundary note to
tools/debugger/README.md stating that any actor with write access to .relay/cmd/
can execute commands inside the container (since server.sh auto-detects the repo
root and will run dispatched commands), and advise restricting mount permissions
so only trusted users/processes can write to .relay/cmd/ (e.g., use read-only
mounts where possible, limit host user ownership/ACLs, and avoid bind-mounting
directories from untrusted sources).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 18c84092-667e-4170-a1fa-fdc7a6efef5f

📥 Commits

Reviewing files that changed from the base of the PR and between 4afac7f and 33eb9c5.

📒 Files selected for processing (5)

tools/debugger/.gitignore
tools/debugger/CLAUDE.md
tools/debugger/README.md
tools/debugger/client.sh
tools/debugger/server.sh

tools/debugger/client.sh

coderabbitai · 2026-04-03T17:11:34Z

tools/debugger/client.sh

+    flush)
+        if [[ -d "$RELAY_DIR" ]]; then
+            rm -rf "$RELAY_DIR"
+            echo "Relay directory cleared: $RELAY_DIR"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail rg -n -C2 'flush\)|rm -rf "\$RELAY_DIR"' tools/debugger/client.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 255

🏁 Script executed:

cat tools/debugger/client.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 4957

Add validation guards before recursive delete in flush command.

RELAY_DIR is user-controllable via the --relay-dir option and passed directly to rm -rf without validation. A user could run ./client.sh --relay-dir / flush or similar to delete critical system directories. Add safeguards to reject unsafe paths:

Recommended fix

flush) + if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then + echo "ERROR: Refusing to clear unsafe relay dir: '$RELAY_DIR'" + exit 1 + fi if [[ -d "$RELAY_DIR" ]]; then - rm -rf "$RELAY_DIR" + rm -rf -- "$RELAY_DIR" echo "Relay directory cleared: $RELAY_DIR" else echo "Relay directory does not exist: $RELAY_DIR"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

flush)

if [[ -d "$RELAY_DIR" ]]; then

rm -rf "$RELAY_DIR"

echo "Relay directory cleared: $RELAY_DIR"

flush)

if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then

echo "ERROR: Refusing to clear unsafe relay dir: '$RELAY_DIR'"

exit 1

fi

if [[ -d "$RELAY_DIR" ]]; then

rm -rf -- "$RELAY_DIR"

echo "Relay directory cleared: $RELAY_DIR"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/debugger/client.sh` around lines 139 - 142, The flush case calls rm -rf on RELAY_DIR without validation; before removing, canonicalize RELAY_DIR (e.g., via realpath) and validate it is non-empty, not "/" and not a top-level system path (reject prefixes like /, /etc, /bin, /usr, /boot, /sbin, /proc, /sys, /dev), ensure the resolved path exists and is a subdirectory of an expected safe base (or require explicit user confirmation if not), and only then perform rm -rf; update the flush handling in client.sh (the flush case and any code that parses --relay-dir) to perform these guards.

coderabbitai · 2026-04-03T17:11:34Z

tools/debugger/server.sh

+export PYTHONPATH="$WORKDIR"
+
+# Initialize relay directories
+rm -rf "$RELAY_DIR"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail rg -n -C2 'rm -rf "\$RELAY_DIR"|--relay-dir' tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 815

🏁 Script executed:

cat -n tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 4816

Guard startup cleanup against unsafe relay paths.

Line 76 executes rm -rf on a configurable path without validating it first. An attacker or misconfigured invocation could pass --relay-dir / or another critical path, causing data loss. Add explicit safety checks before deletion.

Suggested guard

-# Initialize relay directories -rm -rf "$RELAY_DIR" +# Initialize relay directories +if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then + echo "[server] ERROR: refusing unsafe relay dir: '$RELAY_DIR'" + exit 1 +fi +rm -rf -- "$RELAY_DIR" mkdir -p "$CMD_DIR" "$RESULT_DIR"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/debugger/server.sh` at line 76, The rm -rf "$RELAY_DIR" call is unsafe; add a guard in server.sh that validates RELAY_DIR (use realpath/resolved path) and refuse to delete if it's empty, "/", "/", ".", "/", or equals root or other critical system paths, and ensure it is within an expected base directory (for example the project workspace or /tmp) or contains an expected substring; if the check fails, print an error and exit non‑zero. Locate the removal statement referencing RELAY_DIR and replace it with the preflight validation (resolve path, check not root/empty/"/" and check prefix or allowed list) before performing the rm -rf.

Same comment from Claude

rm -rf "$RELAY_DIR" on server startup is risky (server.sh:91)
If RELAY_DIR is accidentally set to an empty string or /, this would be destructive. Add a guard:
[[ -n "$RELAY_DIR" && "$RELAY_DIR" != "/" ]] || { echo "Invalid relay dir"; exit 1; }

@chadvoegele, agreed — glad Claude flagged this too. The guard can be made slightly more robust by combining both suggestions, also handling edge cases like . and paths that don't start with / when the user likely intends an absolute path:

# Initialize relay directories if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" || "$RELAY_DIR" == "." ]]; then echo "[server] ERROR: refusing unsafe relay dir: '${RELAY_DIR}'" exit 1 fi rm -rf -- "$RELAY_DIR" mkdir -p "$CMD_DIR" "$RESULT_DIR"

The -- after rm -rf is also good practice to prevent argument injection if RELAY_DIR somehow starts with a -. The client.sh flush subcommand similarly removes cmd/ and result/ directories — worth verifying it has equivalent guards if --relay-dir is user-supplied there as well.

coderabbitai · 2026-04-03T17:11:34Z

tools/debugger/server.sh

+while true; do
+    for cmd_file in "$CMD_DIR"/*.sh; do
+        [[ -e "$cmd_file" ]] || continue
+
+        cmd_id="$(basename "$cmd_file" .sh)"
+        echo "[server] Executing command $cmd_id..."
+
+        # Execute the command, tee stdout+stderr to console and result file
+        set +e
+        (cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
+        exit_code=${PIPESTATUS[0]}
+        set -e
+
+        # Write exit code
+        echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"
+
+        # Remove the command file to mark it as processed
+        rm -f "$cmd_file"
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail rg -n -C3 'for cmd_file in "\$CMD_DIR"/\*\.sh|bash "\$cmd_file"|rm -f "\$cmd_file"' tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 776

Claim command files before execution to prevent replay on server restart.

The current flow executes commands, then deletes them afterward. If the server crashes between execution and deletion, the same command file will be re-executed on restart.

Use atomic claiming with mv before execution to prevent this race condition:

Atomic claim-before-run pattern

while true; do for cmd_file in "$CMD_DIR"/*.sh; do [[ -e "$cmd_file" ]] || continue cmd_id="$(basename "$cmd_file" .sh)" + claimed_file="$CMD_DIR/$cmd_id.running.$$" + if ! mv "$cmd_file" "$claimed_file" 2>/dev/null; then + continue + fi echo "[server] Executing command $cmd_id..." # Execute the command, tee stdout+stderr to console and result file set +e - (cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log" + (cd "$WORKDIR" && bash "$claimed_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log" exit_code=${PIPESTATUS[0]} set -e @@ - rm -f "$cmd_file" + rm -f "$claimed_file"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while true; do

for cmd_file in "$CMD_DIR"/*.sh; do

[[ -e "$cmd_file" ]] || continue

cmd_id="$(basename "$cmd_file" .sh)"

echo "[server] Executing command $cmd_id..."

# Execute the command, tee stdout+stderr to console and result file

set +e

(cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"

exit_code=${PIPESTATUS[0]}

set -e

# Write exit code

echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"

# Remove the command file to mark it as processed

rm -f "$cmd_file"

while true; do

for cmd_file in "$CMD_DIR"/*.sh; do

[[ -e "$cmd_file" ]] || continue

cmd_id="$(basename "$cmd_file" .sh)"

claimed_file="$CMD_DIR/$cmd_id.running.$$"

if ! mv "$cmd_file" "$claimed_file" 2>/dev/null; then

continue

fi

echo "[server] Executing command $cmd_id..."

# Execute the command, tee stdout+stderr to console and result file

set +e

(cd "$WORKDIR" && bash "$claimed_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"

exit_code=${PIPESTATUS[0]}

set -e

# Write exit code

echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"

# Remove the command file to mark it as processed

rm -f "$claimed_file"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/debugger/server.sh` around lines 103 - 121, The loop currently iterates over cmd files and deletes them after running, which can replay on restart; change to atomically claim each command before running by renaming/moving the file (using mv) to a processing name (e.g., append .claimed or move into a PROCESS_DIR) and operate on that claimed path instead of the original $CMD_DIR/$cmd_file; if the mv fails (no-op or already claimed) skip that file, then run the claimed file from $WORKDIR, tee output to $RESULT_DIR/$cmd_id.log, write the exit code to $RESULT_DIR/$cmd_id.exit, and finally remove or archive the claimed file to mark completion (all references: CMD_DIR, cmd_file, cmd_id, WORKDIR, RESULT_DIR).

github-actions · 2026-04-03T17:12:09Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-06 05:45 UTC

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

tools/debugger/client.sh (2)

142-143: ⚠️ Potential issue | 🟡 Minor

Use find for pending count robustness (same issue as prior review).

Line 143 still relies on ls for counting *.sh files. This is brittle with shell/glob edge cases; find is safer and aligns with SC2012.

#!/usr/bin/env bash
set -euo pipefail

tmp="$(mktemp -d)"
touch "$tmp/a.sh" "$tmp/b.sh"

echo "Current pattern:"
bash -lc "set -euo pipefail; CMD_DIR='$tmp'; pending=\$(ls \"\$CMD_DIR\"/*.sh 2>/dev/null | wc -l); echo pending=\$pending"

echo "Recommended pattern:"
bash -lc "set -euo pipefail; CMD_DIR='$tmp'; pending=\$(find \"\$CMD_DIR\" -maxdepth 1 -type f -name '*.sh' | wc -l); echo pending=\$pending"

rm -rf "$tmp"

Suggested change

         if [[ -d "$CMD_DIR" ]]; then
-            pending=$(ls "$CMD_DIR"/*.sh 2>/dev/null | wc -l)
+            pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' | wc -l)
         else
             pending=0
         fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 142 - 143, Replace the fragile
ls-based count of shell scripts with a find-based count: in the block that
checks [[ -d "$CMD_DIR" ]], update the computation of the pending variable (the
pending= assignment) to use find on $CMD_DIR with -maxdepth 1 -type f -name
'*.sh' and pipe to wc -l (suppressing errors) so it robustly counts files and
satisfies SC2012; keep the existing conditional around CMD_DIR and preserve
error redirection to /dev/null.

150-153: ⚠️ Potential issue | 🔴 Critical

Guard rm -rf against unsafe relay paths (still unresolved).

Line 152 deletes a user-provided path without safety checks. --relay-dir / (or similarly dangerous paths) can cause destructive deletion.

Suggested safety checks

     flush)
         if [[ -d "$RELAY_DIR" ]]; then
-            rm -rf "$RELAY_DIR"
+            resolved="$(realpath -m "$RELAY_DIR")"
+            case "$resolved" in
+                "/"|"/bin"|"/boot"|"/dev"|"/etc"|"/lib"|"/lib64"|"/proc"|"/root"|"/run"|"/sbin"|"/sys"|"/usr"|"/var")
+                    echo "ERROR: Refusing to clear unsafe relay dir: $resolved"
+                    exit 1
+                    ;;
+            esac
+            rm -rf -- "$resolved"
             echo "Relay directory cleared: $RELAY_DIR"
         else
             echo "Relay directory does not exist: $RELAY_DIR"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 150 - 153, The flush case currently
runs rm -rf "$RELAY_DIR" unsafely; modify the flush) branch to validate
RELAY_DIR before deletion: ensure RELAY_DIR is non-empty, resolve it with
realpath (or readlink -f) into a variable (e.g., resolved_relay_dir), reject if
it equals "/" or "." or "/" + "" or contains ".." or is shorter than a minimal
safe prefix, and (preferably) enforce it is under an allowed base directory (or
match a whitelist/regex); only after passing these checks prompt the user for
confirmation and then run rm -rf "$resolved_relay_dir". Reference the flush)
case and the RELAY_DIR usage to locate where to add the validation and
confirmation logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Around line 84-96: The run command handler currently allows empty invocations
and writes an empty script; add an early guard in the run case (before
generating cmd_id and writing files) that checks for no arguments (e.g., test
"$#" -eq 0 or test -z "$*") and prints a usage/error like "ERROR: No command
provided. Usage: client.sh run <command>" then exit 1; keep the rest of the
logic that creates cmd_id, writes to "$CMD_DIR/$cmd_id.sh.tmp" and moves it
intact.

---

Duplicate comments:
In `@tools/debugger/client.sh`:
- Around line 142-143: Replace the fragile ls-based count of shell scripts with
a find-based count: in the block that checks [[ -d "$CMD_DIR" ]], update the
computation of the pending variable (the pending= assignment) to use find on
$CMD_DIR with -maxdepth 1 -type f -name '*.sh' and pipe to wc -l (suppressing
errors) so it robustly counts files and satisfies SC2012; keep the existing
conditional around CMD_DIR and preserve error redirection to /dev/null.
- Around line 150-153: The flush case currently runs rm -rf "$RELAY_DIR"
unsafely; modify the flush) branch to validate RELAY_DIR before deletion: ensure
RELAY_DIR is non-empty, resolve it with realpath (or readlink -f) into a
variable (e.g., resolved_relay_dir), reject if it equals "/" or "." or "/" + ""
or contains ".." or is shorter than a minimal safe prefix, and (preferably)
enforce it is under an allowed base directory (or match a whitelist/regex); only
after passing these checks prompt the user for confirmation and then run rm -rf
"$resolved_relay_dir". Reference the flush) case and the RELAY_DIR usage to
locate where to add the validation and confirmation logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fee62419-0770-464f-b29c-6aacef9c8efa

📥 Commits

Reviewing files that changed from the base of the PR and between 33eb9c5 and 419e260.

📒 Files selected for processing (2)

tools/debugger/client.sh
tools/debugger/server.sh

✅ Files skipped from review due to trivial changes (1)

tools/debugger/server.sh

coderabbitai · 2026-04-03T17:21:16Z

tools/debugger/client.sh

+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --relay-dir) RELAY_DIR="$2"; shift 2 ;;
+        --timeout) TIMEOUT="$2"; shift 2 ;;
+        *) break ;;


⚠️ Potential issue | 🟠 Major

Validate option values before use.

Line 41 and Line 42 assume $2 exists, and Line 109 assumes TIMEOUT is numeric. client.sh --timeout foo or missing option values can fail with shell errors instead of a clear user-facing message.

Suggested hardening

while [[ $# -gt 0 ]]; do case "$1" in - --relay-dir) RELAY_DIR="$2"; shift 2 ;; - --timeout) TIMEOUT="$2"; shift 2 ;; + --relay-dir) + [[ $# -ge 2 ]] || { echo "ERROR: --relay-dir requires a value."; exit 1; } + RELAY_DIR="$2"; shift 2 ;; + --timeout) + [[ $# -ge 2 ]] || { echo "ERROR: --timeout requires a value."; exit 1; } + TIMEOUT="$2"; shift 2 ;; *) break ;; esac done + +[[ "$TIMEOUT" =~ ^[0-9]+$ && "$TIMEOUT" -gt 0 ]] || { + echo "ERROR: --timeout must be a positive integer." + exit 1 +}

Also applies to: 109-109

coderabbitai · 2026-04-03T17:21:16Z

tools/debugger/client.sh

+    run)
+        # Verify handshake was done
+        if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then
+            echo "ERROR: Not connected. Run 'client.sh handshake' first."
+            exit 1
+        fi
+
+        # Generate a unique command ID (timestamp + PID to avoid collisions)
+        cmd_id="$(date +%s%N)_$$"
+
+        # Write the command file atomically (tmp + mv)
+        echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
+        mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"


⚠️ Potential issue | 🟡 Minor

Reject empty run commands early.

Line 84 onward does not validate command presence. run without args currently writes an empty script. Return a usage error when no command is provided.

Suggested guard

run) + if [[ $# -eq 0 ]]; then + echo "ERROR: Missing command. Usage: $0 run <command...>" + exit 1 + fi # Verify handshake was done if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/debugger/client.sh` around lines 84 - 96, The run command handler currently allows empty invocations and writes an empty script; add an early guard in the run case (before generating cmd_id and writing files) that checks for no arguments (e.g., test "$#" -eq 0 or test -z "$*") and prints a usage/error like "ERROR: No command provided. Usage: client.sh run <command>" then exit 1; keep the rest of the logic that creates cmd_id, writes to "$CMD_DIR/$cmd_id.sh.tmp" and moves it intact.

codecov · 2026-04-03T17:29:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.07%. Comparing base (00c002f) to head (5806895).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1174      +/-   ##
==========================================
- Coverage   74.27%   74.07%   -0.20%     
==========================================
  Files         349      349              
  Lines       39846    40277     +431     
==========================================
+ Hits        29594    29835     +241     
- Misses      10252    10442     +190

Flag	Coverage Δ
unit	`54.53% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu

PR Review: Add file-based command relay for remote Docker testing

Summary

Clean, well-documented developer tool. File-based relay protocol is simple and effective — no networking, atomic file writes for result signaling, good handshake/status/flush subcommands. Documentation (README + CLAUDE.md) is thorough. Tested with a real PTQ workload. A few items below.

Comments

1. `sync` with file argument is not portable (`server.sh:126`)

On Linux, sync doesn't accept file arguments (that's macOS/BSD). This silently syncs all filesystems, which is slow and not the intent. Since you already use atomic tmp+mv for .exit and the client only reads .log after .exit exists, the ordering is correct without this line. Safe to remove.

2. Client timeout doesn't cancel server-side execution (`client.sh:113-116`)

Removing the .sh file only prevents future execution. If the server already picked up the command, it continues running with no way to cancel. Consider writing a $cmd_id.cancel file that the server checks, or documenting this as a known limitation.

3. `pip install -e .[dev]` on every server start (`server.sh:98-101`)

Runs on every server start even when the environment is already set up. For a large project this can take 30+ seconds. Consider a --skip-install flag or checking if modelopt is already installed.

4. `kill -- -$$` may affect unrelated processes (`server.sh:79`)

Sends SIGTERM to the entire process group. If the server was launched from an interactive shell (common in Docker), $$ is the shell's process group, which could include other processes. Safer to track child PIDs explicitly or use pkill -P $$.

5. Copyright year is 2024

6. Minor: `ls` output parsed for counting (`client.sh:143`)

ls "$CMD_DIR"/*.sh 2>/dev/null | wc -l — fragile if filenames contain newlines. Safer: files=("$CMD_DIR"/*.sh); pending=${#files[@]}.

Overall: clean dev tool, no blockers. The sync portability issue and forced pip install are the most actionable.

This is an AI-assisted review — human sign-off required before merging.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

chadvoegele · 2026-04-03T19:51:13Z

tools/debugger/server.sh

+export PYTHONPATH="$WORKDIR"
+
+# Initialize relay directories
+rm -rf "$RELAY_DIR"


Same comment from Claude

rm -rf "$RELAY_DIR" on server startup is risky (server.sh:91)
If RELAY_DIR is accidentally set to an empty string or /, this would be destructive. Add a guard:
[[ -n "$RELAY_DIR" && "$RELAY_DIR" != "/" ]] || { echo "Invalid relay dir"; exit 1; }

chadvoegele · 2026-04-03T19:52:39Z

tools/debugger/CLAUDE.md

+| Path | Description |
+|------|-------------|
+| Repo root (auto-detected) | ModelOpt source, used as workdir |
+| `/hf-local` | HuggingFace model cache |


Just the file-based relay doesn't have anything to do with hf-local , right?

Further I think if you rooted the whole thing under '${model-opt}/tools/debugger', then you could remove the model opt references too and this could be purely a portable file-based relay.

Just the file-based relay doesn't have anything to do with hf-local , right? Yes

Sure. This is to provide the info so that claude knows where to modify code regards to modelopt during the debugging process.

ChenhanYu

All review feedback addressed cleanly. sync removed, copyright updated, ls → find, kill -- -$$ → pkill -P $$, pip install now skips when already editable-installed. Nice additions: re-handshake support and smarter flush that preserves server.ready. LGTM.

Add a filesystem based debugging so claude and run tests inside a docker

33eb9c5

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from ChenhanYu, Edwardf0t1, kaix-nv, meenchen and mxinO April 3, 2026 17:09

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Review

419e260

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

ChenhanYu reviewed Apr 3, 2026

View reviewed changes

cjluo-nv added 2 commits April 3, 2026 18:54

Fix

6362761

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Fix

5806895

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

chadvoegele approved these changes Apr 3, 2026

View reviewed changes

ChenhanYu approved these changes Apr 3, 2026

View reviewed changes

cjluo-nv merged commit f755722 into main Apr 6, 2026
38 checks passed

cjluo-nv deleted the chenjiel/debugger branch April 6, 2026 05:44

Conversation

cjluo-nv commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tested

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chadvoegele Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

PR Review: Add file-based command relay for remote Docker testing

Summary

Comments

1. sync with file argument is not portable (server.sh:126)

2. Client timeout doesn't cancel server-side execution (client.sh:113-116)

3. pip install -e .[dev] on every server start (server.sh:98-101)

4. kill -- -$$ may affect unrelated processes (server.sh:79)

5. Copyright year is 2024

6. Minor: ls output parsed for counting (client.sh:143)

Uh oh!

chadvoegele Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chadvoegele Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

coderabbitai bot Apr 3, 2026 •

edited

Loading

github-actions bot commented Apr 3, 2026 •

edited

Loading

codecov bot commented Apr 3, 2026 •

edited

Loading

1. `sync` with file argument is not portable (`server.sh:126`)

2. Client timeout doesn't cancel server-side execution (`client.sh:113-116`)

3. `pip install -e .[dev]` on every server start (`server.sh:98-101`)

4. `kill -- -$$` may affect unrelated processes (`server.sh:79`)

6. Minor: `ls` output parsed for counting (`client.sh:143`)