Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tools/debugger/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.relay/
62 changes: 62 additions & 0 deletions tools/debugger/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Remote Command Relay

This directory contains a file-based command relay for executing commands inside a remote Docker
container from the host machine (where Claude Code runs).

## How to Use (for Claude Code)

### Setup (one-time per session)

The user must start the server inside Docker first:

```bash
# Inside Docker container (auto-detects repo root from script location):
bash /path/to/modelopt/tools/debugger/server.sh
```

Then Claude Code performs the handshake:

```bash
bash tools/debugger/client.sh handshake
```

### Running Commands

```bash
# Run any command in the Docker container (workdir = auto-detected repo root):
bash tools/debugger/client.sh run "<command>"

# For long-running tasks, increase timeout:
bash tools/debugger/client.sh --timeout 1800 run "<command>"
```

### Key Paths Inside Docker

| Path | Description |
|------|-------------|
| Repo root (auto-detected) | ModelOpt source, used as workdir |
| `/hf-local` | HuggingFace model cache |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the file-based relay doesn't have anything to do with hf-local , right?

Further I think if you rooted the whole thing under '${model-opt}/tools/debugger', then you could remove the model opt references too and this could be purely a portable file-based relay.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the file-based relay doesn't have anything to do with hf-local , right? Yes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is to provide the info so that claude knows where to modify code regards to modelopt during the debugging process.


### Examples

```bash
# Run PTQ test
bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"

# Run pytest
bash tools/debugger/client.sh run "python -m pytest tests/gpu -k test_quantize"

# Check GPU
bash tools/debugger/client.sh run "nvidia-smi"

# Use HF models from local cache
bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B"
```

### Important Notes

- The server must be started by the user manually inside Docker before the handshake.
- Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks.
- Commands execute sequentially — one at a time.
- All commands run with the auto-detected repo root as the working directory.
- The `.relay/` directory is ephemeral and git-ignored.
109 changes: 109 additions & 0 deletions tools/debugger/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# File-Based Command Relay (Debugger)

A lightweight client/server system for running commands inside a Docker container from the host,
using only a shared filesystem — no networking required.

## Overview

```text
Host (Claude Code) Docker Container
┌─────────────┐ ┌─────────────────┐
│ client.sh │ writes cmd file │ server.sh │
│ run "X" │ ───────────────────► │ detects cmd │
│ │ │ executes X │
│ reads │ writes result file │ writes result │
│ result │ ◄─────────────────── │ │
└─────────────┘ └─────────────────┘
└──── shared filesystem (.relay/) ────┘
```

## Assumptions

- The ModelOpt repo is accessible from both host and container (e.g., bind-mounted)
- **HuggingFace models** are mounted at `/hf-local`
- The server auto-detects the repo root from the location of `server.sh`

## Quick Start

### 1. Start the server (inside Docker)

```bash
# The server auto-detects the repo root (two levels up from tools/debugger/)
bash /path/to/modelopt/tools/debugger/server.sh
```

The server automatically sets the working directory to the repo root. You can override with `--workdir`.

### 2. Connect from the host

```bash
bash tools/debugger/client.sh handshake
```

### 3. Run commands

```bash
# Run a simple command
bash tools/debugger/client.sh run "echo hello"

# Run a test script
bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh"

# Run with a long timeout (default is 600s)
bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py"

# Check status
bash tools/debugger/client.sh status
```

## Protocol

The relay uses a directory at `tools/debugger/.relay/` with this structure:

```text
.relay/
├── server.ready # Written by server on startup
├── client.ready # Written by client during handshake
├── handshake.done # Written by server to confirm handshake
├── cmd/ # Client writes command .sh files here
│ └── <id>.sh # Command to execute
└── result/ # Server writes results here
├── <id>.log # stdout + stderr
└── <id>.exit # Exit code
```

### Handshake

1. Server starts, creates `.relay/server.ready`
2. Client writes `.relay/client.ready`
3. Server detects it, writes `.relay/handshake.done`
4. Both sides are now connected

### Command Execution

1. Client writes a command to `.relay/cmd/<timestamp>.sh`
2. Server detects the file, runs `bash <file>` in the workdir, captures output
3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit`
4. Server removes the `.sh` file; client reads results and cleans up

## Options

### Server

| Flag | Default | Description |
|------|---------|-------------|
| `--relay-dir` | `<script_dir>/.relay` | Relay directory path |
| `--workdir` | Auto-detected repo root | Working directory for commands |

### Client

| Flag | Default | Description |
|------|---------|-------------|
| `--relay-dir` | `<script_dir>/.relay` | Relay directory path |
| `--timeout` | `600` | Seconds to wait for command result |

## Notes

- The `.relay/` directory is in `.gitignore` — it is not checked in.
- Only one server should run at a time (startup clears the relay directory).
- Commands run sequentially in the order the server discovers them.
172 changes: 172 additions & 0 deletions tools/debugger/client.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# File-based command relay client.
# Run this from the host / Claude Code side. It sends commands to the server
# running inside Docker by writing files to the shared relay directory.
#
# Usage:
# bash client.sh handshake - Connect to server
# bash client.sh run <command...> - Run a command and print output
# bash client.sh status - Check server status
#
# Options:
# --relay-dir <path> Path to relay directory (default: <script_dir>/.relay)
# --timeout <secs> Timeout waiting for result (default: 600)

set -euo pipefail

RELAY_DIR=""
TIMEOUT=600
POLL_INTERVAL=1

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Parse global options before subcommand
while [[ $# -gt 0 ]]; do
case "$1" in
--relay-dir) RELAY_DIR="$2"; shift 2 ;;
--timeout) TIMEOUT="$2"; shift 2 ;;
*) break ;;
Comment on lines +39 to +43
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate option values before use.

Line 41 and Line 42 assume $2 exists, and Line 109 assumes TIMEOUT is numeric. client.sh --timeout foo or missing option values can fail with shell errors instead of a clear user-facing message.

Suggested hardening
 while [[ $# -gt 0 ]]; do
     case "$1" in
-        --relay-dir) RELAY_DIR="$2"; shift 2 ;;
-        --timeout) TIMEOUT="$2"; shift 2 ;;
+        --relay-dir)
+            [[ $# -ge 2 ]] || { echo "ERROR: --relay-dir requires a value."; exit 1; }
+            RELAY_DIR="$2"; shift 2 ;;
+        --timeout)
+            [[ $# -ge 2 ]] || { echo "ERROR: --timeout requires a value."; exit 1; }
+            TIMEOUT="$2"; shift 2 ;;
         *) break ;;
     esac
 done
+
+[[ "$TIMEOUT" =~ ^[0-9]+$ && "$TIMEOUT" -gt 0 ]] || {
+    echo "ERROR: --timeout must be a positive integer."
+    exit 1
+}

Also applies to: 109-109

esac
done

if [[ -z "$RELAY_DIR" ]]; then
RELAY_DIR="$SCRIPT_DIR/.relay"
fi

CMD_DIR="$RELAY_DIR/cmd"
RESULT_DIR="$RELAY_DIR/result"

SUBCOMMAND="${1:-}"
shift || true

case "$SUBCOMMAND" in
handshake)
# Check server is ready
if [[ ! -f "$RELAY_DIR/server.ready" ]]; then
echo "ERROR: Server not ready. Start server.sh in Docker first."
exit 1
fi
SERVER_INFO=$(cat "$RELAY_DIR/server.ready")
echo "Server found: $SERVER_INFO"

# Send client handshake
echo "$(hostname):$$:$(date -Iseconds)" > "$RELAY_DIR/client.ready"

# Wait for server acknowledgment
elapsed=0
while [[ ! -f "$RELAY_DIR/handshake.done" ]]; do
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
if [[ $elapsed -ge 120 ]]; then
echo "ERROR: Handshake timed out after 120s."
exit 1
fi
done

echo "Handshake complete."
;;

run)
# Verify handshake was done
if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then
echo "ERROR: Not connected. Run 'client.sh handshake' first."
exit 1
fi

# Generate a unique command ID (timestamp + PID to avoid collisions)
cmd_id="$(date +%s%N)_$$"

# Write the command file atomically (tmp + mv)
echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp"
mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh"
Comment on lines +84 to +96
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject empty run commands early.

Line 84 onward does not validate command presence. run without args currently writes an empty script. Return a usage error when no command is provided.

Suggested guard
     run)
+        if [[ $# -eq 0 ]]; then
+            echo "ERROR: Missing command. Usage: $0 run <command...>"
+            exit 1
+        fi
         # Verify handshake was done
         if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/client.sh` around lines 84 - 96, The run command handler
currently allows empty invocations and writes an empty script; add an early
guard in the run case (before generating cmd_id and writing files) that checks
for no arguments (e.g., test "$#" -eq 0 or test -z "$*") and prints a
usage/error like "ERROR: No command provided. Usage: client.sh run <command>"
then exit 1; keep the rest of the logic that creates cmd_id, writes to
"$CMD_DIR/$cmd_id.sh.tmp" and moves it intact.


# Wait for result
elapsed=0
while [[ ! -f "$RESULT_DIR/$cmd_id.exit" ]]; do
# Check if server is still alive
if [[ ! -f "$RELAY_DIR/server.ready" ]]; then
echo "ERROR: Server appears to have stopped."
rm -f "$CMD_DIR/$cmd_id.sh"
exit 1
fi
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
if [[ $elapsed -ge $TIMEOUT ]]; then
echo "ERROR: Command timed out after ${TIMEOUT}s."
# Clean up the pending command
rm -f "$CMD_DIR/$cmd_id.sh"
exit 1
fi
done

# Read and display results
exit_code=$(cat "$RESULT_DIR/$cmd_id.exit")
if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then
cat "$RESULT_DIR/$cmd_id.log"
fi

# Clean up result files
rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log"

exit "$exit_code"
;;

status)
if [[ -f "$RELAY_DIR/server.ready" ]]; then
echo "Server: $(cat "$RELAY_DIR/server.ready")"
else
echo "Server: not running"
fi
if [[ -f "$RELAY_DIR/handshake.done" ]]; then
echo "Handshake: complete"
elif [[ -f "$RELAY_DIR/client.ready" ]]; then
echo "Handshake: pending"
else
echo "Handshake: not started"
fi
if [[ -d "$CMD_DIR" ]]; then
pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l)
else
pending=0
fi
echo "Pending commands: $pending"
;;

flush)
if [[ -d "$RELAY_DIR" ]]; then
# Clear handshake and command/result files, but keep server.ready
rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done"
rm -rf "$CMD_DIR" "$RESULT_DIR"
mkdir -p "$CMD_DIR" "$RESULT_DIR"
echo "Relay state cleared (server.ready preserved): $RELAY_DIR"
else
echo "Relay directory does not exist: $RELAY_DIR"
fi
;;

*)
echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>"
echo ""
echo "Subcommands:"
echo " handshake Connect to the server"
echo " run <cmd> Execute a command on the server"
echo " status Check connection status"
echo " flush Clear the relay directory"
exit 1
;;
esac
Loading
Loading