Skip to content

Add file-based command relay for remote Docker testing#1174

Merged
cjluo-nv merged 4 commits intomainfrom
chenjiel/debugger
Apr 6, 2026
Merged

Add file-based command relay for remote Docker testing#1174
cjluo-nv merged 4 commits intomainfrom
chenjiel/debugger

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 3, 2026

Summary

  • Adds a lightweight file-based client/server relay (tools/debugger/) that enables Claude Code (or any host-side automation) to execute commands inside a remote Docker container using only a shared filesystem — no networking setup required.
  • The server auto-detects the repo root, installs modelopt (pip install -e .[dev]), sets PYTHONPATH, and listens for commands.
  • The client supports handshake, run, status, and flush subcommands.
  • Includes README.md (full protocol docs) and CLAUDE.md (quick reference for Claude Code).

Tested

  • Ran Qwen3.5-35B-A3B MoE PTQ with nvfp4_experts_only quantization via the relay:
    bash examples/llm_ptq/scripts/huggingface_example.sh --model /hf-local/Qwen/Qwen3.5-35B-A3B/ --quant nvfp4_experts_only
    
    • 42,140 quantizers inserted, MTP layers correctly excluded
    • Quantized checkpoint exported successfully (~208s, 147.61 GB peak GPU memory)

Test plan

  • Start server.sh inside a Docker container with the repo mounted
  • Run client.sh handshake from the host
  • Run client.sh run "echo hello" and verify output
  • Run client.sh flush and verify .relay/ is cleared
  • Run a real PTQ workload end-to-end

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a file-based command relay system with host↔container client and server CLIs, supporting handshake, run, status and flush workflows to execute commands inside containers.
  • Documentation

    • Added guides describing the relay protocol, usage examples, CLI options, lifecycle, and operational notes (workdir, timeouts, sequential execution).
  • Chores

    • Updated ignore rules to exclude ephemeral relay artifacts.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b550aa8e-3831-4b21-a83d-7398c6297a22

📥 Commits

Reviewing files that changed from the base of the PR and between 6362761 and 5806895.

📒 Files selected for processing (2)
  • tools/debugger/client.sh
  • tools/debugger/server.sh
✅ Files skipped from review due to trivial changes (1)
  • tools/debugger/server.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/debugger/client.sh

📝 Walkthrough

Walkthrough

A file-based command relay: a Docker-hosted server monitors a shared .relay/ directory and executes timestamped command scripts; a host client performs a handshake, submits commands, polls for results, and cleans up. Documentation and a gitignore entry were added.

Changes

Cohort / File(s) Summary
Configuration
tools/debugger/.gitignore
Added .relay/ to ignore relay artifacts.
Documentation
tools/debugger/README.md, tools/debugger/CLAUDE.md
New docs describing the .relay/ protocol, handshake and command lifecycle, CLI usage examples, workdir behavior, and operational notes.
Client implementation
tools/debugger/client.sh
New Bash client CLI with subcommands handshake, run, status, flush; supports --relay-dir and --timeout; writes atomic cmd/<id>.sh, polls for result/<id>.exit, handles timeouts and cleanup.
Server implementation
tools/debugger/server.sh
New Bash server daemon for Docker: initializes/clears relay dirs, writes server.ready, completes handshake, executes cmd/*.sh in configured workdir, writes result/<id>.log and result/<id>.exit, and cleans up on shutdown.

Sequence Diagram

sequenceDiagram
    participant Host as Host<br/>(client.sh)
    participant FS as Shared<br/>Filesystem
    participant Docker as Docker<br/>(server.sh)

    rect rgba(100,150,255,0.5)
    Note over Host,Docker: Handshake
    Docker->>FS: Write server.ready
    Host->>FS: Read server.ready
    Host->>FS: Write client.ready
    Docker->>FS: Read client.ready
    Docker->>FS: Write handshake.done
    end

    rect rgba(100,200,100,0.5)
    Note over Host,Docker: Command Execution Cycle
    Host->>FS: Write cmd/<id>.sh
    Docker->>FS: Detect cmd/<id>.sh
    Docker->>Docker: Execute cmd/<id>.sh in workdir
    Docker->>FS: Write result/<id>.log (stdout+stderr)
    Docker->>FS: Write result/<id>.exit (exit code)
    Docker->>FS: Delete cmd/<id>.sh
    Host->>FS: Read result/<id>.log and result/<id>.exit
    Host->>FS: Delete result files
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main addition: a file-based command relay system for executing commands in remote Docker containers.
Security Anti-Patterns ✅ Passed PR #1174 adds only shell scripts, markdown, and .gitignore to tools/debugger/ with no Python code or security anti-patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/debugger

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
tools/debugger/README.md (1)

20-25: Document the trust boundary for .relay/ explicitly.

Please add a note that any actor with write access to .relay/cmd/ can execute commands in-container, and recommend restricting mount permissions to trusted users/processes only.

Suggested wording
 ## Assumptions
 
 - The ModelOpt repo is accessible from both host and container (e.g., bind-mounted)
 - **HuggingFace models** are mounted at `/hf-local`
 - The server auto-detects the repo root from the location of `server.sh`
+- The shared `.relay/` path is writable only by trusted users/processes; write access to
+  `.relay/cmd/` implies command execution capability inside the container.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/README.md` around lines 20 - 25, Add an explicit
trust-boundary note to tools/debugger/README.md stating that any actor with
write access to .relay/cmd/ can execute commands inside the container (since
server.sh auto-detects the repo root and will run dispatched commands), and
advise restricting mount permissions so only trusted users/processes can write
to .relay/cmd/ (e.g., use read-only mounts where possible, limit host user
ownership/ACLs, and avoid bind-mounting directories from untrusted sources).
tools/debugger/client.sh (1)

39-45: Add validation for option values and timeout argument type.

Lines 41–42 can fail under set -u if the option argument is missing, and line 102 assumes TIMEOUT is numeric without validation. Use a validation helper to ensure options receive valid arguments and that TIMEOUT is an integer before arithmetic comparison.

Suggested hardening
+require_value() {
+    local opt="$1"
+    local val="${2-}"
+    if [[ -z "$val" || "$val" == --* ]]; then
+        echo "ERROR: ${opt} requires a value."
+        exit 1
+    fi
+}
+
 while [[ $# -gt 0 ]]; do
     case "$1" in
-        --relay-dir) RELAY_DIR="$2"; shift 2 ;;
-        --timeout) TIMEOUT="$2"; shift 2 ;;
+        --relay-dir)
+            require_value "$1" "${2-}"
+            RELAY_DIR="$2"
+            shift 2
+            ;;
+        --timeout)
+            require_value "$1" "${2-}"
+            [[ "$2" =~ ^[0-9]+$ ]] || { echo "ERROR: --timeout must be an integer."; exit 1; }
+            TIMEOUT="$2"
+            shift 2
+            ;;
         *) break ;;
     esac
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 39 - 45, The option parsing loop can
dereference missing arguments under set -u and TIMEOUT is used as a number later
without validation; add a small helper (e.g., require_arg) used by the parsing
case to ensure flags like --relay-dir and --timeout have a following value
before shifting, and after parsing validate that TIMEOUT is an integer (use a
regex/grep check) before any arithmetic comparison (the code that reads/uses
TIMEOUT later). Update the case branches that set RELAY_DIR and TIMEOUT to call
the helper and fail with a clear message if the argument is absent or TIMEOUT is
not numeric.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Line 95: Replace the direct write to "$CMD_DIR/$cmd_id.sh" with an atomic
publish: create a temp file in the same directory (e.g.,
"$CMD_DIR/$cmd_id.sh.tmp.$$" or via mktemp in CMD_DIR), write the command
contents into that temp file, flush/sync and set executable permissions, then
rename (mv) the temp file to "$CMD_DIR/$cmd_id.sh" so the server never sees a
partially-written file; update the code around the echo "$*" >
"$CMD_DIR/$cmd_id.sh" usage to perform these steps using CMD_DIR and cmd_id.
- Around line 139-142: The flush case calls rm -rf on RELAY_DIR without
validation; before removing, canonicalize RELAY_DIR (e.g., via realpath) and
validate it is non-empty, not "/" and not a top-level system path (reject
prefixes like /, /etc, /bin, /usr, /boot, /sbin, /proc, /sys, /dev), ensure the
resolved path exists and is a subdirectory of an expected safe base (or require
explicit user confirmation if not), and only then perform rm -rf; update the
flush handling in client.sh (the flush case and any code that parses
--relay-dir) to perform these guards.
- Line 135: The command substitution setting pending uses ls and a fallback that
fails under set -euo pipefail; replace it so missing/invalid CMD_DIR yields 0
reliably by either first testing [ -d "$CMD_DIR" ] and setting pending=0 when
absent, or by using find on "$CMD_DIR" (with -maxdepth 1 and -name '*.sh' and
-type f) piped to wc -l while redirecting find errors to /dev/null; update the
assignment that sets pending to use one of these robust approaches (referencing
the pending variable and CMD_DIR).

In `@tools/debugger/server.sh`:
- Around line 103-121: The loop currently iterates over cmd files and deletes
them after running, which can replay on restart; change to atomically claim each
command before running by renaming/moving the file (using mv) to a processing
name (e.g., append .claimed or move into a PROCESS_DIR) and operate on that
claimed path instead of the original $CMD_DIR/$cmd_file; if the mv fails (no-op
or already claimed) skip that file, then run the claimed file from $WORKDIR, tee
output to $RESULT_DIR/$cmd_id.log, write the exit code to
$RESULT_DIR/$cmd_id.exit, and finally remove or archive the claimed file to mark
completion (all references: CMD_DIR, cmd_file, cmd_id, WORKDIR, RESULT_DIR).
- Line 76: The rm -rf "$RELAY_DIR" call is unsafe; add a guard in server.sh that
validates RELAY_DIR (use realpath/resolved path) and refuse to delete if it's
empty, "/", "/", ".", "/", or equals root or other critical system paths, and
ensure it is within an expected base directory (for example the project
workspace or /tmp) or contains an expected substring; if the check fails, print
an error and exit non‑zero. Locate the removal statement referencing RELAY_DIR
and replace it with the preflight validation (resolve path, check not
root/empty/"/" and check prefix or allowed list) before performing the rm -rf.

---

Nitpick comments:
In `@tools/debugger/client.sh`:
- Around line 39-45: The option parsing loop can dereference missing arguments
under set -u and TIMEOUT is used as a number later without validation; add a
small helper (e.g., require_arg) used by the parsing case to ensure flags like
--relay-dir and --timeout have a following value before shifting, and after
parsing validate that TIMEOUT is an integer (use a regex/grep check) before any
arithmetic comparison (the code that reads/uses TIMEOUT later). Update the case
branches that set RELAY_DIR and TIMEOUT to call the helper and fail with a clear
message if the argument is absent or TIMEOUT is not numeric.

In `@tools/debugger/README.md`:
- Around line 20-25: Add an explicit trust-boundary note to
tools/debugger/README.md stating that any actor with write access to .relay/cmd/
can execute commands inside the container (since server.sh auto-detects the repo
root and will run dispatched commands), and advise restricting mount permissions
so only trusted users/processes can write to .relay/cmd/ (e.g., use read-only
mounts where possible, limit host user ownership/ACLs, and avoid bind-mounting
directories from untrusted sources).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 18c84092-667e-4170-a1fa-fdc7a6efef5f

📥 Commits

Reviewing files that changed from the base of the PR and between 4afac7f and 33eb9c5.

📒 Files selected for processing (5)
  • tools/debugger/.gitignore
  • tools/debugger/CLAUDE.md
  • tools/debugger/README.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh

Comment on lines +139 to +142
flush)
if [[ -d "$RELAY_DIR" ]]; then
rm -rf "$RELAY_DIR"
echo "Relay directory cleared: $RELAY_DIR"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail
rg -n -C2 'flush\)|rm -rf "\$RELAY_DIR"' tools/debugger/client.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 255


🏁 Script executed:

cat tools/debugger/client.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 4957


Add validation guards before recursive delete in flush command.

RELAY_DIR is user-controllable via the --relay-dir option and passed directly to rm -rf without validation. A user could run ./client.sh --relay-dir / flush or similar to delete critical system directories. Add safeguards to reject unsafe paths:

Recommended fix
     flush)
+        if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then
+            echo "ERROR: Refusing to clear unsafe relay dir: '$RELAY_DIR'"
+            exit 1
+        fi
         if [[ -d "$RELAY_DIR" ]]; then
-            rm -rf "$RELAY_DIR"
+            rm -rf -- "$RELAY_DIR"
             echo "Relay directory cleared: $RELAY_DIR"
         else
             echo "Relay directory does not exist: $RELAY_DIR"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
flush)
if [[ -d "$RELAY_DIR" ]]; then
rm -rf "$RELAY_DIR"
echo "Relay directory cleared: $RELAY_DIR"
flush)
if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then
echo "ERROR: Refusing to clear unsafe relay dir: '$RELAY_DIR'"
exit 1
fi
if [[ -d "$RELAY_DIR" ]]; then
rm -rf -- "$RELAY_DIR"
echo "Relay directory cleared: $RELAY_DIR"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 139 - 142, The flush case calls rm -rf
on RELAY_DIR without validation; before removing, canonicalize RELAY_DIR (e.g.,
via realpath) and validate it is non-empty, not "/" and not a top-level system
path (reject prefixes like /, /etc, /bin, /usr, /boot, /sbin, /proc, /sys,
/dev), ensure the resolved path exists and is a subdirectory of an expected safe
base (or require explicit user confirmation if not), and only then perform rm
-rf; update the flush handling in client.sh (the flush case and any code that
parses --relay-dir) to perform these guards.

export PYTHONPATH="$WORKDIR"

# Initialize relay directories
rm -rf "$RELAY_DIR"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail
rg -n -C2 'rm -rf "\$RELAY_DIR"|--relay-dir' tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 815


🏁 Script executed:

cat -n tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 4816


Guard startup cleanup against unsafe relay paths.

Line 76 executes rm -rf on a configurable path without validating it first. An attacker or misconfigured invocation could pass --relay-dir / or another critical path, causing data loss. Add explicit safety checks before deletion.

Suggested guard
-# Initialize relay directories
-rm -rf "$RELAY_DIR"
+# Initialize relay directories
+if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" ]]; then
+    echo "[server] ERROR: refusing unsafe relay dir: '$RELAY_DIR'"
+    exit 1
+fi
+rm -rf -- "$RELAY_DIR"
 mkdir -p "$CMD_DIR" "$RESULT_DIR"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` at line 76, The rm -rf "$RELAY_DIR" call is unsafe;
add a guard in server.sh that validates RELAY_DIR (use realpath/resolved path)
and refuse to delete if it's empty, "/", "/", ".", "/", or equals root or other
critical system paths, and ensure it is within an expected base directory (for
example the project workspace or /tmp) or contains an expected substring; if the
check fails, print an error and exit non‑zero. Locate the removal statement
referencing RELAY_DIR and replace it with the preflight validation (resolve
path, check not root/empty/"/" and check prefix or allowed list) before
performing the rm -rf.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment from Claude

  1. rm -rf "$RELAY_DIR" on server startup is risky (server.sh:91)
    If RELAY_DIR is accidentally set to an empty string or /, this would be destructive. Add a guard:
    [[ -n "$RELAY_DIR" && "$RELAY_DIR" != "/" ]] || { echo "Invalid relay dir"; exit 1; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chadvoegele, agreed — glad Claude flagged this too. The guard can be made slightly more robust by combining both suggestions, also handling edge cases like . and paths that don't start with / when the user likely intends an absolute path:

# Initialize relay directories
if [[ -z "$RELAY_DIR" || "$RELAY_DIR" == "/" || "$RELAY_DIR" == "." ]]; then
    echo "[server] ERROR: refusing unsafe relay dir: '${RELAY_DIR}'"
    exit 1
fi
rm -rf -- "$RELAY_DIR"
mkdir -p "$CMD_DIR" "$RESULT_DIR"

The -- after rm -rf is also good practice to prevent argument injection if RELAY_DIR somehow starts with a -. The client.sh flush subcommand similarly removes cmd/ and result/ directories — worth verifying it has equivalent guards if --relay-dir is user-supplied there as well.

Comment on lines +103 to +121
while true; do
for cmd_file in "$CMD_DIR"/*.sh; do
[[ -e "$cmd_file" ]] || continue

cmd_id="$(basename "$cmd_file" .sh)"
echo "[server] Executing command $cmd_id..."

# Execute the command, tee stdout+stderr to console and result file
set +e
(cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
exit_code=${PIPESTATUS[0]}
set -e

# Write exit code
echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"

# Remove the command file to mark it as processed
rm -f "$cmd_file"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail
rg -n -C3 'for cmd_file in "\$CMD_DIR"/\*\.sh|bash "\$cmd_file"|rm -f "\$cmd_file"' tools/debugger/server.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 776


Claim command files before execution to prevent replay on server restart.

The current flow executes commands, then deletes them afterward. If the server crashes between execution and deletion, the same command file will be re-executed on restart.

Use atomic claiming with mv before execution to prevent this race condition:

Atomic claim-before-run pattern
 while true; do
     for cmd_file in "$CMD_DIR"/*.sh; do
         [[ -e "$cmd_file" ]] || continue
 
         cmd_id="$(basename "$cmd_file" .sh)"
+        claimed_file="$CMD_DIR/$cmd_id.running.$$"
+        if ! mv "$cmd_file" "$claimed_file" 2>/dev/null; then
+            continue
+        fi
         echo "[server] Executing command $cmd_id..."
 
         # Execute the command, tee stdout+stderr to console and result file
         set +e
-        (cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
+        (cd "$WORKDIR" && bash "$claimed_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
         exit_code=${PIPESTATUS[0]}
         set -e
@@
-        rm -f "$cmd_file"
+        rm -f "$claimed_file"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while true; do
for cmd_file in "$CMD_DIR"/*.sh; do
[[ -e "$cmd_file" ]] || continue
cmd_id="$(basename "$cmd_file" .sh)"
echo "[server] Executing command $cmd_id..."
# Execute the command, tee stdout+stderr to console and result file
set +e
(cd "$WORKDIR" && bash "$cmd_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
exit_code=${PIPESTATUS[0]}
set -e
# Write exit code
echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"
# Remove the command file to mark it as processed
rm -f "$cmd_file"
while true; do
for cmd_file in "$CMD_DIR"/*.sh; do
[[ -e "$cmd_file" ]] || continue
cmd_id="$(basename "$cmd_file" .sh)"
claimed_file="$CMD_DIR/$cmd_id.running.$$"
if ! mv "$cmd_file" "$claimed_file" 2>/dev/null; then
continue
fi
echo "[server] Executing command $cmd_id..."
# Execute the command, tee stdout+stderr to console and result file
set +e
(cd "$WORKDIR" && bash "$claimed_file" 2>&1) | tee "$RESULT_DIR/$cmd_id.log"
exit_code=${PIPESTATUS[0]}
set -e
# Write exit code
echo "$exit_code" > "$RESULT_DIR/$cmd_id.exit"
# Remove the command file to mark it as processed
rm -f "$claimed_file"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 103 - 121, The loop currently iterates
over cmd files and deletes them after running, which can replay on restart;
change to atomically claim each command before running by renaming/moving the
file (using mv) to a processing name (e.g., append .claimed or move into a
PROCESS_DIR) and operate on that claimed path instead of the original
$CMD_DIR/$cmd_file; if the mv fails (no-op or already claimed) skip that file,
then run the claimed file from $WORKDIR, tee output to $RESULT_DIR/$cmd_id.log,
write the exit code to $RESULT_DIR/$cmd_id.exit, and finally remove or archive
the claimed file to mark completion (all references: CMD_DIR, cmd_file, cmd_id,
WORKDIR, RESULT_DIR).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-06 05:45 UTC

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tools/debugger/client.sh (2)

142-143: ⚠️ Potential issue | 🟡 Minor

Use find for pending count robustness (same issue as prior review).

Line 143 still relies on ls for counting *.sh files. This is brittle with shell/glob edge cases; find is safer and aligns with SC2012.

#!/usr/bin/env bash
set -euo pipefail

tmp="$(mktemp -d)"
touch "$tmp/a.sh" "$tmp/b.sh"

echo "Current pattern:"
bash -lc "set -euo pipefail; CMD_DIR='$tmp'; pending=\$(ls \"\$CMD_DIR\"/*.sh 2>/dev/null | wc -l); echo pending=\$pending"

echo "Recommended pattern:"
bash -lc "set -euo pipefail; CMD_DIR='$tmp'; pending=\$(find \"\$CMD_DIR\" -maxdepth 1 -type f -name '*.sh' | wc -l); echo pending=\$pending"

rm -rf "$tmp"
Suggested change
         if [[ -d "$CMD_DIR" ]]; then
-            pending=$(ls "$CMD_DIR"/*.sh 2>/dev/null | wc -l)
+            pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' | wc -l)
         else
             pending=0
         fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 142 - 143, Replace the fragile
ls-based count of shell scripts with a find-based count: in the block that
checks [[ -d "$CMD_DIR" ]], update the computation of the pending variable (the
pending= assignment) to use find on $CMD_DIR with -maxdepth 1 -type f -name
'*.sh' and pipe to wc -l (suppressing errors) so it robustly counts files and
satisfies SC2012; keep the existing conditional around CMD_DIR and preserve
error redirection to /dev/null.

150-153: ⚠️ Potential issue | 🔴 Critical

Guard rm -rf against unsafe relay paths (still unresolved).

Line 152 deletes a user-provided path without safety checks. --relay-dir / (or similarly dangerous paths) can cause destructive deletion.

Suggested safety checks
     flush)
         if [[ -d "$RELAY_DIR" ]]; then
-            rm -rf "$RELAY_DIR"
+            resolved="$(realpath -m "$RELAY_DIR")"
+            case "$resolved" in
+                "/"|"/bin"|"/boot"|"/dev"|"/etc"|"/lib"|"/lib64"|"/proc"|"/root"|"/run"|"/sbin"|"/sys"|"/usr"|"/var")
+                    echo "ERROR: Refusing to clear unsafe relay dir: $resolved"
+                    exit 1
+                    ;;
+            esac
+            rm -rf -- "$resolved"
             echo "Relay directory cleared: $RELAY_DIR"
         else
             echo "Relay directory does not exist: $RELAY_DIR"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 150 - 153, The flush case currently
runs rm -rf "$RELAY_DIR" unsafely; modify the flush) branch to validate
RELAY_DIR before deletion: ensure RELAY_DIR is non-empty, resolve it with
realpath (or readlink -f) into a variable (e.g., resolved_relay_dir), reject if
it equals "/" or "." or "/" + "" or contains ".." or is shorter than a minimal
safe prefix, and (preferably) enforce it is under an allowed base directory (or
match a whitelist/regex); only after passing these checks prompt the user for
confirmation and then run rm -rf "$resolved_relay_dir". Reference the flush)
case and the RELAY_DIR usage to locate where to add the validation and
confirmation logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Around line 84-96: The run command handler currently allows empty invocations
and writes an empty script; add an early guard in the run case (before
generating cmd_id and writing files) that checks for no arguments (e.g., test
"$#" -eq 0 or test -z "$*") and prints a usage/error like "ERROR: No command
provided. Usage: client.sh run <command>" then exit 1; keep the rest of the
logic that creates cmd_id, writes to "$CMD_DIR/$cmd_id.sh.tmp" and moves it
intact.

---

Duplicate comments:
In `@tools/debugger/client.sh`:
- Around line 142-143: Replace the fragile ls-based count of shell scripts with
a find-based count: in the block that checks [[ -d "$CMD_DIR" ]], update the
computation of the pending variable (the pending= assignment) to use find on
$CMD_DIR with -maxdepth 1 -type f -name '*.sh' and pipe to wc -l (suppressing
errors) so it robustly counts files and satisfies SC2012; keep the existing
conditional around CMD_DIR and preserve error redirection to /dev/null.
- Around line 150-153: The flush case currently runs rm -rf "$RELAY_DIR"
unsafely; modify the flush) branch to validate RELAY_DIR before deletion: ensure
RELAY_DIR is non-empty, resolve it with realpath (or readlink -f) into a
variable (e.g., resolved_relay_dir), reject if it equals "/" or "." or "/" + ""
or contains ".." or is shorter than a minimal safe prefix, and (preferably)
enforce it is under an allowed base directory (or match a whitelist/regex); only
after passing these checks prompt the user for confirmation and then run rm -rf
"$resolved_relay_dir". Reference the flush) case and the RELAY_DIR usage to
locate where to add the validation and confirmation logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fee62419-0770-464f-b29c-6aacef9c8efa

📥 Commits

Reviewing files that changed from the base of the PR and between 33eb9c5 and 419e260.

📒 Files selected for processing (2)
  • tools/debugger/client.sh
  • tools/debugger/server.sh
✅ Files skipped from review due to trivial changes (1)
  • tools/debugger/server.sh

Comment on lines +39 to +43
while [[ $# -gt 0 ]]; do
case "$1" in
--relay-dir) RELAY_DIR="$2"; shift 2 ;;
--timeout) TIMEOUT="$2"; shift 2 ;;
*) break ;;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate option values before use.

Line 41 and Line 42 assume $2 exists, and Line 109 assumes TIMEOUT is numeric. client.sh --timeout foo or missing option values can fail with shell errors instead of a clear user-facing message.

Suggested hardening
 while [[ $# -gt 0 ]]; do
     case "$1" in
-        --relay-dir) RELAY_DIR="$2"; shift 2 ;;
-        --timeout) TIMEOUT="$2"; shift 2 ;;
+        --relay-dir)
+            [[ $# -ge 2 ]] || { echo "ERROR: --relay-dir requires a value."; exit 1; }
+            RELAY_DIR="$2"; shift 2 ;;
+        --timeout)
+            [[ $# -ge 2 ]] || { echo "ERROR: --timeout requires a value."; exit 1; }
+            TIMEOUT="$2"; shift 2 ;;
         *) break ;;
     esac
 done
+
+[[ "$TIMEOUT" =~ ^[0-9]+$ && "$TIMEOUT" -gt 0 ]] || {
+    echo "ERROR: --timeout must be a positive integer."
+    exit 1
+}

Also applies to: 109-109

Comment on lines +84 to +96
run)
# Verify handshake was done
if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then
echo "ERROR: Not connected. Run 'client.sh handshake' first."
exit 1
fi

# Generate a unique command ID (timestamp + PID to avoid collisions)
cmd_id="$(date +%s%N)_$$"

# Write the command file atomically (tmp + mv)
echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject empty run commands early.

Line 84 onward does not validate command presence. run without args currently writes an empty script. Return a usage error when no command is provided.

Suggested guard
     run)
+        if [[ $# -eq 0 ]]; then
+            echo "ERROR: Missing command. Usage: $0 run <command...>"
+            exit 1
+        fi
         # Verify handshake was done
         if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 84 - 96, The run command handler
currently allows empty invocations and writes an empty script; add an early
guard in the run case (before generating cmd_id and writing files) that checks
for no arguments (e.g., test "$#" -eq 0 or test -z "$*") and prints a
usage/error like "ERROR: No command provided. Usage: client.sh run <command>"
then exit 1; keep the rest of the logic that creates cmd_id, writes to
"$CMD_DIR/$cmd_id.sh.tmp" and moves it intact.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.07%. Comparing base (00c002f) to head (5806895).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1174      +/-   ##
==========================================
- Coverage   74.27%   74.07%   -0.20%     
==========================================
  Files         349      349              
  Lines       39846    40277     +431     
==========================================
+ Hits        29594    29835     +241     
- Misses      10252    10442     +190     
Flag Coverage Δ
unit 54.53% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Add file-based command relay for remote Docker testing

Summary

Clean, well-documented developer tool. File-based relay protocol is simple and effective — no networking, atomic file writes for result signaling, good handshake/status/flush subcommands. Documentation (README + CLAUDE.md) is thorough. Tested with a real PTQ workload. A few items below.

Comments

1. sync with file argument is not portable (server.sh:126)

On Linux, sync doesn't accept file arguments (that's macOS/BSD). This silently syncs all filesystems, which is slow and not the intent. Since you already use atomic tmp+mv for .exit and the client only reads .log after .exit exists, the ordering is correct without this line. Safe to remove.

2. Client timeout doesn't cancel server-side execution (client.sh:113-116)

Removing the .sh file only prevents future execution. If the server already picked up the command, it continues running with no way to cancel. Consider writing a $cmd_id.cancel file that the server checks, or documenting this as a known limitation.

3. pip install -e .[dev] on every server start (server.sh:98-101)

Runs on every server start even when the environment is already set up. For a large project this can take 30+ seconds. Consider a --skip-install flag or checking if modelopt is already installed.

4. kill -- -$$ may affect unrelated processes (server.sh:79)

Sends SIGTERM to the entire process group. If the server was launched from an interactive shell (common in Docker), $$ is the shell's process group, which could include other processes. Safer to track child PIDs explicitly or use pkill -P $$.

5. Copyright year is 2024

New files added in 2026 have Copyright (c) 2024. Should be 2026.

6. Minor: ls output parsed for counting (client.sh:143)

ls "$CMD_DIR"/*.sh 2>/dev/null | wc -l — fragile if filenames contain newlines. Safer: files=("$CMD_DIR"/*.sh); pending=${#files[@]}.


Overall: clean dev tool, no blockers. The sync portability issue and forced pip install are the most actionable.

This is an AI-assisted review — human sign-off required before merging.

cjluo-nv added 2 commits April 3, 2026 18:54
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
export PYTHONPATH="$WORKDIR"

# Initialize relay directories
rm -rf "$RELAY_DIR"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment from Claude

  1. rm -rf "$RELAY_DIR" on server startup is risky (server.sh:91)
    If RELAY_DIR is accidentally set to an empty string or /, this would be destructive. Add a guard:
    [[ -n "$RELAY_DIR" && "$RELAY_DIR" != "/" ]] || { echo "Invalid relay dir"; exit 1; }

| Path | Description |
|------|-------------|
| Repo root (auto-detected) | ModelOpt source, used as workdir |
| `/hf-local` | HuggingFace model cache |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the file-based relay doesn't have anything to do with hf-local , right?

Further I think if you rooted the whole thing under '${model-opt}/tools/debugger', then you could remove the model opt references too and this could be purely a portable file-based relay.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the file-based relay doesn't have anything to do with hf-local , right? Yes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is to provide the info so that claude knows where to modify code regards to modelopt during the debugging process.

Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review feedback addressed cleanly. sync removed, copyright updated, lsfind, kill -- -$$pkill -P $$, pip install now skips when already editable-installed. Nice additions: re-handshake support and smarter flush that preserves server.ready. LGTM.

@cjluo-nv cjluo-nv merged commit f755722 into main Apr 6, 2026
38 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/debugger branch April 6, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants