-
Notifications
You must be signed in to change notification settings - Fork 350
Add file-based command relay for remote Docker testing #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| .relay/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # Remote Command Relay | ||
|
|
||
| This directory contains a file-based command relay for executing commands inside a remote Docker | ||
| container from the host machine (where Claude Code runs). | ||
|
|
||
| ## How to Use (for Claude Code) | ||
|
|
||
| ### Setup (one-time per session) | ||
|
|
||
| The user must start the server inside Docker first: | ||
|
|
||
| ```bash | ||
| # Inside Docker container (auto-detects repo root from script location): | ||
| bash /path/to/modelopt/tools/debugger/server.sh | ||
| ``` | ||
|
|
||
| Then Claude Code performs the handshake: | ||
|
|
||
| ```bash | ||
| bash tools/debugger/client.sh handshake | ||
| ``` | ||
|
|
||
| ### Running Commands | ||
|
|
||
| ```bash | ||
| # Run any command in the Docker container (workdir = auto-detected repo root): | ||
| bash tools/debugger/client.sh run "<command>" | ||
|
|
||
| # For long-running tasks, increase timeout: | ||
| bash tools/debugger/client.sh --timeout 1800 run "<command>" | ||
| ``` | ||
|
|
||
| ### Key Paths Inside Docker | ||
|
|
||
| | Path | Description | | ||
| |------|-------------| | ||
| | Repo root (auto-detected) | ModelOpt source, used as workdir | | ||
| | `/hf-local` | HuggingFace model cache | | ||
|
|
||
| ### Examples | ||
|
|
||
| ```bash | ||
| # Run PTQ test | ||
| bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh" | ||
|
|
||
| # Run pytest | ||
| bash tools/debugger/client.sh run "python -m pytest tests/gpu -k test_quantize" | ||
|
|
||
| # Check GPU | ||
| bash tools/debugger/client.sh run "nvidia-smi" | ||
|
|
||
| # Use HF models from local cache | ||
| bash tools/debugger/client.sh run "python script.py --model /hf-local/Qwen/Qwen3-8B" | ||
| ``` | ||
|
|
||
| ### Important Notes | ||
|
|
||
| - The server must be started by the user manually inside Docker before the handshake. | ||
| - Default command timeout is 600 seconds (10 minutes). Use `--timeout` for longer tasks. | ||
| - Commands execute sequentially — one at a time. | ||
| - All commands run with the auto-detected repo root as the working directory. | ||
| - The `.relay/` directory is ephemeral and git-ignored. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # File-Based Command Relay (Debugger) | ||
|
|
||
| A lightweight client/server system for running commands inside a Docker container from the host, | ||
| using only a shared filesystem — no networking required. | ||
|
|
||
| ## Overview | ||
|
|
||
| ```text | ||
| Host (Claude Code) Docker Container | ||
| ┌─────────────┐ ┌─────────────────┐ | ||
| │ client.sh │ writes cmd file │ server.sh │ | ||
| │ run "X" │ ───────────────────► │ detects cmd │ | ||
| │ │ │ executes X │ | ||
| │ reads │ writes result file │ writes result │ | ||
| │ result │ ◄─────────────────── │ │ | ||
| └─────────────┘ └─────────────────┘ | ||
| └──── shared filesystem (.relay/) ────┘ | ||
| ``` | ||
|
|
||
| ## Assumptions | ||
|
|
||
| - The ModelOpt repo is accessible from both host and container (e.g., bind-mounted) | ||
| - **HuggingFace models** are mounted at `/hf-local` | ||
| - The server auto-detects the repo root from the location of `server.sh` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### 1. Start the server (inside Docker) | ||
|
|
||
| ```bash | ||
| # The server auto-detects the repo root (two levels up from tools/debugger/) | ||
| bash /path/to/modelopt/tools/debugger/server.sh | ||
| ``` | ||
|
|
||
| The server automatically sets the working directory to the repo root. You can override with `--workdir`. | ||
|
|
||
| ### 2. Connect from the host | ||
|
|
||
| ```bash | ||
| bash tools/debugger/client.sh handshake | ||
| ``` | ||
|
|
||
| ### 3. Run commands | ||
|
|
||
| ```bash | ||
| # Run a simple command | ||
| bash tools/debugger/client.sh run "echo hello" | ||
|
|
||
| # Run a test script | ||
| bash tools/debugger/client.sh run "bash llm_ptq/scripts/huggingface_example.sh" | ||
|
|
||
| # Run with a long timeout (default is 600s) | ||
| bash tools/debugger/client.sh --timeout 1800 run "python my_long_test.py" | ||
|
|
||
| # Check status | ||
| bash tools/debugger/client.sh status | ||
| ``` | ||
|
|
||
| ## Protocol | ||
|
|
||
| The relay uses a directory at `tools/debugger/.relay/` with this structure: | ||
|
|
||
| ```text | ||
| .relay/ | ||
| ├── server.ready # Written by server on startup | ||
| ├── client.ready # Written by client during handshake | ||
| ├── handshake.done # Written by server to confirm handshake | ||
| ├── cmd/ # Client writes command .sh files here | ||
| │ └── <id>.sh # Command to execute | ||
| └── result/ # Server writes results here | ||
| ├── <id>.log # stdout + stderr | ||
| └── <id>.exit # Exit code | ||
| ``` | ||
|
|
||
| ### Handshake | ||
|
|
||
| 1. Server starts, creates `.relay/server.ready` | ||
| 2. Client writes `.relay/client.ready` | ||
| 3. Server detects it, writes `.relay/handshake.done` | ||
| 4. Both sides are now connected | ||
|
|
||
| ### Command Execution | ||
|
|
||
| 1. Client writes a command to `.relay/cmd/<timestamp>.sh` | ||
| 2. Server detects the file, runs `bash <file>` in the workdir, captures output | ||
| 3. Server writes `.relay/result/<timestamp>.log` and `.relay/result/<timestamp>.exit` | ||
| 4. Server removes the `.sh` file; client reads results and cleans up | ||
|
|
||
| ## Options | ||
|
|
||
| ### Server | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--relay-dir` | `<script_dir>/.relay` | Relay directory path | | ||
| | `--workdir` | Auto-detected repo root | Working directory for commands | | ||
|
|
||
| ### Client | ||
|
|
||
| | Flag | Default | Description | | ||
| |------|---------|-------------| | ||
| | `--relay-dir` | `<script_dir>/.relay` | Relay directory path | | ||
| | `--timeout` | `600` | Seconds to wait for command result | | ||
|
|
||
| ## Notes | ||
|
|
||
| - The `.relay/` directory is in `.gitignore` — it is not checked in. | ||
| - Only one server should run at a time (startup clears the relay directory). | ||
| - Commands run sequentially in the order the server discovers them. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| #!/usr/bin/env bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # File-based command relay client. | ||
| # Run this from the host / Claude Code side. It sends commands to the server | ||
| # running inside Docker by writing files to the shared relay directory. | ||
| # | ||
| # Usage: | ||
| # bash client.sh handshake - Connect to server | ||
| # bash client.sh run <command...> - Run a command and print output | ||
| # bash client.sh status - Check server status | ||
| # | ||
| # Options: | ||
| # --relay-dir <path> Path to relay directory (default: <script_dir>/.relay) | ||
| # --timeout <secs> Timeout waiting for result (default: 600) | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| RELAY_DIR="" | ||
| TIMEOUT=600 | ||
| POLL_INTERVAL=1 | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| # Parse global options before subcommand | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --relay-dir) RELAY_DIR="$2"; shift 2 ;; | ||
| --timeout) TIMEOUT="$2"; shift 2 ;; | ||
| *) break ;; | ||
|
Comment on lines
+39
to
+43
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Validate option values before use. Line 41 and Line 42 assume Suggested hardening while [[ $# -gt 0 ]]; do
case "$1" in
- --relay-dir) RELAY_DIR="$2"; shift 2 ;;
- --timeout) TIMEOUT="$2"; shift 2 ;;
+ --relay-dir)
+ [[ $# -ge 2 ]] || { echo "ERROR: --relay-dir requires a value."; exit 1; }
+ RELAY_DIR="$2"; shift 2 ;;
+ --timeout)
+ [[ $# -ge 2 ]] || { echo "ERROR: --timeout requires a value."; exit 1; }
+ TIMEOUT="$2"; shift 2 ;;
*) break ;;
esac
done
+
+[[ "$TIMEOUT" =~ ^[0-9]+$ && "$TIMEOUT" -gt 0 ]] || {
+ echo "ERROR: --timeout must be a positive integer."
+ exit 1
+}Also applies to: 109-109 |
||
| esac | ||
| done | ||
|
|
||
| if [[ -z "$RELAY_DIR" ]]; then | ||
| RELAY_DIR="$SCRIPT_DIR/.relay" | ||
| fi | ||
|
|
||
| CMD_DIR="$RELAY_DIR/cmd" | ||
| RESULT_DIR="$RELAY_DIR/result" | ||
|
|
||
| SUBCOMMAND="${1:-}" | ||
| shift || true | ||
|
|
||
| case "$SUBCOMMAND" in | ||
| handshake) | ||
| # Check server is ready | ||
| if [[ ! -f "$RELAY_DIR/server.ready" ]]; then | ||
| echo "ERROR: Server not ready. Start server.sh in Docker first." | ||
| exit 1 | ||
| fi | ||
| SERVER_INFO=$(cat "$RELAY_DIR/server.ready") | ||
| echo "Server found: $SERVER_INFO" | ||
|
|
||
| # Send client handshake | ||
| echo "$(hostname):$$:$(date -Iseconds)" > "$RELAY_DIR/client.ready" | ||
|
|
||
| # Wait for server acknowledgment | ||
| elapsed=0 | ||
| while [[ ! -f "$RELAY_DIR/handshake.done" ]]; do | ||
| sleep "$POLL_INTERVAL" | ||
| elapsed=$((elapsed + POLL_INTERVAL)) | ||
| if [[ $elapsed -ge 120 ]]; then | ||
| echo "ERROR: Handshake timed out after 120s." | ||
| exit 1 | ||
| fi | ||
| done | ||
|
|
||
| echo "Handshake complete." | ||
| ;; | ||
|
|
||
| run) | ||
| # Verify handshake was done | ||
| if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then | ||
| echo "ERROR: Not connected. Run 'client.sh handshake' first." | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Generate a unique command ID (timestamp + PID to avoid collisions) | ||
| cmd_id="$(date +%s%N)_$$" | ||
|
|
||
| # Write the command file atomically (tmp + mv) | ||
| echo "$*" > "$CMD_DIR/$cmd_id.sh.tmp" | ||
| mv "$CMD_DIR/$cmd_id.sh.tmp" "$CMD_DIR/$cmd_id.sh" | ||
|
Comment on lines
+84
to
+96
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reject empty Line 84 onward does not validate command presence. Suggested guard run)
+ if [[ $# -eq 0 ]]; then
+ echo "ERROR: Missing command. Usage: $0 run <command...>"
+ exit 1
+ fi
# Verify handshake was done
if [[ ! -f "$RELAY_DIR/handshake.done" ]]; then🤖 Prompt for AI Agents |
||
|
|
||
| # Wait for result | ||
| elapsed=0 | ||
| while [[ ! -f "$RESULT_DIR/$cmd_id.exit" ]]; do | ||
| # Check if server is still alive | ||
| if [[ ! -f "$RELAY_DIR/server.ready" ]]; then | ||
| echo "ERROR: Server appears to have stopped." | ||
| rm -f "$CMD_DIR/$cmd_id.sh" | ||
| exit 1 | ||
| fi | ||
| sleep "$POLL_INTERVAL" | ||
| elapsed=$((elapsed + POLL_INTERVAL)) | ||
| if [[ $elapsed -ge $TIMEOUT ]]; then | ||
| echo "ERROR: Command timed out after ${TIMEOUT}s." | ||
| # Clean up the pending command | ||
| rm -f "$CMD_DIR/$cmd_id.sh" | ||
| exit 1 | ||
| fi | ||
| done | ||
|
|
||
| # Read and display results | ||
| exit_code=$(cat "$RESULT_DIR/$cmd_id.exit") | ||
| if [[ -f "$RESULT_DIR/$cmd_id.log" ]]; then | ||
| cat "$RESULT_DIR/$cmd_id.log" | ||
| fi | ||
|
|
||
| # Clean up result files | ||
| rm -f "$RESULT_DIR/$cmd_id.exit" "$RESULT_DIR/$cmd_id.log" | ||
|
|
||
| exit "$exit_code" | ||
| ;; | ||
|
|
||
| status) | ||
| if [[ -f "$RELAY_DIR/server.ready" ]]; then | ||
| echo "Server: $(cat "$RELAY_DIR/server.ready")" | ||
| else | ||
| echo "Server: not running" | ||
| fi | ||
| if [[ -f "$RELAY_DIR/handshake.done" ]]; then | ||
| echo "Handshake: complete" | ||
| elif [[ -f "$RELAY_DIR/client.ready" ]]; then | ||
| echo "Handshake: pending" | ||
| else | ||
| echo "Handshake: not started" | ||
| fi | ||
| if [[ -d "$CMD_DIR" ]]; then | ||
| pending=$(find "$CMD_DIR" -maxdepth 1 -type f -name '*.sh' 2>/dev/null | wc -l) | ||
| else | ||
| pending=0 | ||
| fi | ||
| echo "Pending commands: $pending" | ||
| ;; | ||
|
|
||
| flush) | ||
| if [[ -d "$RELAY_DIR" ]]; then | ||
| # Clear handshake and command/result files, but keep server.ready | ||
| rm -f "$RELAY_DIR/client.ready" "$RELAY_DIR/handshake.done" | ||
| rm -rf "$CMD_DIR" "$RESULT_DIR" | ||
| mkdir -p "$CMD_DIR" "$RESULT_DIR" | ||
| echo "Relay state cleared (server.ready preserved): $RELAY_DIR" | ||
| else | ||
| echo "Relay directory does not exist: $RELAY_DIR" | ||
| fi | ||
| ;; | ||
|
|
||
| *) | ||
| echo "Usage: $0 [--relay-dir <path>] [--timeout <secs>] <subcommand>" | ||
| echo "" | ||
| echo "Subcommands:" | ||
| echo " handshake Connect to the server" | ||
| echo " run <cmd> Execute a command on the server" | ||
| echo " status Check connection status" | ||
| echo " flush Clear the relay directory" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the file-based relay doesn't have anything to do with
hf-local, right?Further I think if you rooted the whole thing under '${model-opt}/tools/debugger', then you could remove the model opt references too and this could be purely a portable file-based relay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the file-based relay doesn't have anything to do with hf-local , right? Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This is to provide the info so that claude knows where to modify code regards to modelopt during the debugging process.