NVIDIA
diff --git a/‎.claude/clusters.yaml.example‎
Lines changed: 18 additions & 0 deletions b/‎.claude/clusters.yaml.example‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎.claude/skills/common/environment-setup.md‎
Lines changed: 80 additions & 0 deletions b/‎.claude/skills/common/environment-setup.md‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎.claude/skills/common/remote-execution.md‎
Lines changed: 147 additions & 0 deletions b/‎.claude/skills/common/remote-execution.md‎
Lines changed: 147 additions & 0 deletions
@@ -0,0 +1,18 @@
+# ModelOpt Remote Cluster Configuration
+# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
+# or .claude/clusters.yaml (project-level, can be committed).
+
+clusters:
+  # GPU workstation or SLURM login node
+  my-cluster:
+    login_node: cluster-login.example.com
+    user: myusername
+    ssh_key: ~/.ssh/id_rsa
+    # ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128"  # optional
+    workspace: /path/to/remote/workdir
+    gpu_type: H100   # used for quantization format recommendation
+    # slurm:
+    #   default_account: my_account
+    #   default_partition: batch_short
+
+default_cluster: my-cluster
@@ -0,0 +1,80 @@
+# Environment Setup
+
+Common detection for all ModelOpt skills. After this, you know what's available.
+
+## Env-1. Get ModelOpt source
+
+```bash
+ls examples/llm_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
+```
+
+If not found: `git clone https://github.com/NVIDIA/Model-Optimizer.git && cd Model-Optimizer`
+
+If found, ensure the source is up to date:
+
+```bash
+git pull origin main
+```
+
+If previous runs left patches in `modelopt/` (from 4C unlisted model work), check whether they should be kept. Reset only if starting a completely new task: `git checkout main`.
+
+## Env-2. Local or remote?
+
+1. **User explicitly requests local or remote** → follow the user's choice
+2. **User doesn't specify** → check for cluster config:
+
+```bash
+cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
+```
+
+If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
+
+For remote, connect:
+
+```bash
+source .claude/skills/common/remote_exec.sh
+remote_load_cluster <cluster_name>
+remote_check_ssh
+remote_detect_env    # sets REMOTE_ENV_TYPE = slurm / docker / bare
+```
+
+If remote but no config, ask user for: hostname, SSH username, SSH key path, remote workdir. Create `~/.config/modelopt/clusters.yaml` (see `skills/common/remote-execution.md` for format).
+
+## Env-3. What compute is available?
+
+Run on the **target machine** (local, or via `remote_run` if remote):
+
+```bash
+which srun sbatch 2>/dev/null && echo "SLURM"
+docker info 2>/dev/null | grep -qi nvidia && echo "Docker+GPU"
+nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null
+```
+
+Also check:
+
+```bash
+ls tools/launcher/launch.py 2>/dev/null && echo "Launcher available"
+```
+
+**No GPU detected?**
+
+- If local with no GPU and no cluster config → ask the user:
+  *"No local GPU detected. Do you have a remote machine or cluster with GPUs? If so, I'll need connection details (hostname, SSH username, key path, remote workdir) to run there."*
+- If user provides remote info → create `clusters.yaml`, go back to Env-2
+- If user has no GPU anywhere → **stop**: this task requires a CUDA GPU
+
+## Summary
+
+After this, you should know:
+
+- ModelOpt source location
+- Local or remote (+ cluster config if remote)
+- SLURM / Docker+GPU / bare GPU
+- Launcher availability
+- GPU model and count
+
+Return to the skill's SKILL.md for the execution path based on these results.
+
+## Multi-user / Slack bot
+
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md` before proceeding.
@@ -0,0 +1,147 @@
+# Remote Execution
+
+Read this when Claude Code runs on a different machine than the target GPU cluster/workstation. This covers SSH connectivity, cluster config, persistent sessions, and remote command execution.
+
+---
+
+## 1. Cluster Config
+
+Config locations (checked in order, first found wins):
+
+1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
+2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
+3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
+
+```yaml
+clusters:
+  my-cluster:
+    login_node: cluster-login.example.com   # SSH hostname or SSH config alias
+    user: username                           # SSH user
+    ssh_key: ~/.ssh/id_rsa                   # (optional) SSH key path
+    ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128"  # (optional) proxy
+    workspace: /absolute/path/to/workdir     # Remote working directory
+    gpu_type: H100                           # For quant format recommendation
+    slurm:                                   # (optional) pre-fill SLURM defaults
+      default_account: my_account
+      default_partition: batch_short
+
+default_cluster: my-cluster
+```
+
+See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
+
+---
+
+## 2. Connect and Establish Persistent Session
+
+```bash
+source .claude/skills/common/remote_exec.sh
+remote_load_cluster <cluster_name>    # or omit name to use default_cluster
+remote_check_ssh                      # validates connectivity + starts persistent session
+```
+
+`remote_check_ssh` starts an SSH **ControlMaster** connection. All subsequent `remote_run` / `remote_sync_*` / SCP calls reuse this single connection:
+
+- ~180ms per command (vs 5-15s per new connection)
+- Eliminates flaky proxy timeouts
+- Auto-cleaned up when the shell exits
+
+---
+
+## 3. Detect Remote Environment
+
+```bash
+remote_detect_env
+```
+
+Auto-discovers whether the remote has SLURM, Docker, or bare-metal GPUs. Sets `REMOTE_ENV_TYPE` to `slurm`, `docker`, `bare`, or `unknown`.
+
+After detection, proceed with the environment-specific setup:
+
+- **SLURM** → prefix all commands with `remote_run`. For SLURM job scripts, see the skill's own references.
+- **Docker** → use `remote_docker_run <container> "<command>"`
+- **Bare metal** → use `remote_run` directly
+
+---
+
+## 4. Running Commands Remotely
+
+### Single commands
+
+```bash
+remote_run "nvidia-smi"
+remote_run "python --version"
+remote_run "sbatch /path/to/job.sh"
+```
+
+`remote_run` uses base64 encoding internally, so special characters (`%`, `$`, quotes) work without escaping. It retries up to 3 times on SSH failures.
+
+### Syncing files
+
+```bash
+# Local → remote
+remote_sync_to /local/path remote_subdir
+
+# Remote → local
+remote_sync_from remote_subdir /local/path
+```
+
+Both use rsync over the persistent SSH session with default excludes (`.git`, `__pycache__`, `.claude`, `*.pyc`, `node_modules`, `*.egg-info`). The `.claude` directory is intentionally excluded — skills and config should not be synced to the remote machine.
+
+### SCP (alternative to rsync)
+
+SCP also reuses the persistent session automatically via ControlMaster:
+
+```bash
+scp /local/script.sh ${REMOTE_USER}@${REMOTE_HOST}:/remote/path/
+```
+
+---
+
+## 5. The Two-Script Pattern
+
+When submitting SLURM jobs remotely, write **two files** locally to avoid shell escaping issues:
+
+1. **SLURM wrapper** (e.g., `job_slurm.sh`) — `#SBATCH` directives + `srun` with container
+2. **Inner runner** (e.g., `run.sh`) — the actual work (runs inside the container)
+
+Then upload both and submit:
+
+```bash
+remote_sync_to /local/scripts/ scripts/
+JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
+```
+
+---
+
+## 6. Verifying Results Remotely
+
+```bash
+remote_run "ls -lh <output_path>/"
+remote_run "cat <output_path>/hf_quant_config.json"
+```
+
+Or fetch results to local:
+
+```bash
+remote_sync_from <remote_output_subdir> /local/output/
+```
+
+---
+
+## 7. Troubleshooting
+
+| Problem | Cause | Fix |
+| ------- | ----- | --- |
+| `Connection timed out during banner exchange` | Proxy/login node overloaded | `remote_run` retries 3x automatically; use persistent session to avoid |
+| SSH proxy completely unreachable (`Network is unreachable`) | VPN/proxy host is down or not running on this machine | Check if VPN is connected; verify `socat`/proxy service is running locally; try direct SSH by temporarily removing `ssh_proxy` from config |
+| `unix_listener: cannot bind to path ... Read-only file system` | SSH ControlMaster socket in non-writable `/tmp` | `remote_exec.sh` auto-finds writable dir; ensure `TMPDIR` or `/tmp/claude-*` exists |
+| `cd: /home/user/~/path: No such file or directory` | `~` not expanding on remote | Use absolute paths in `workspace` config, not `~/...` |
+| Login nodes resolve home dirs differently | Symlinked home dirs vary by node | Use absolute lustre/NFS paths (e.g., `/lustre/fs1/...`) in job scripts |
+| `#!` becomes `#\!` in scripts | Shell environment mangles shebang | Fix with `sed -i 's\|^#\\\\!\|#!\|' script.sh` after writing |
+
+## Reference Files
+
+- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
+- **`.claude/clusters.yaml`** — Active cluster configuration
+- **`.claude/clusters.yaml.example`** — Annotated example config