|
| 1 | +# Remote Execution |
| 2 | + |
| 3 | +Read this when Claude Code runs on a different machine than the target GPU cluster/workstation. This covers SSH connectivity, cluster config, persistent sessions, and remote command execution. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1. Cluster Config |
| 8 | + |
| 9 | +Config locations (checked in order, first found wins): |
| 10 | + |
| 11 | +1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended) |
| 12 | +2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults) |
| 13 | +3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding |
| 14 | + |
| 15 | +```yaml |
| 16 | +clusters: |
| 17 | + my-cluster: |
| 18 | + login_node: cluster-login.example.com # SSH hostname or SSH config alias |
| 19 | + user: username # SSH user |
| 20 | + ssh_key: ~/.ssh/id_rsa # (optional) SSH key path |
| 21 | + ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # (optional) proxy |
| 22 | + workspace: /absolute/path/to/workdir # Remote working directory |
| 23 | + gpu_type: H100 # For quant format recommendation |
| 24 | + slurm: # (optional) pre-fill SLURM defaults |
| 25 | + default_account: my_account |
| 26 | + default_partition: batch_short |
| 27 | + |
| 28 | +default_cluster: my-cluster |
| 29 | +``` |
| 30 | +
|
| 31 | +See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## 2. Connect and Establish Persistent Session |
| 36 | + |
| 37 | +```bash |
| 38 | +source .claude/skills/common/remote_exec.sh |
| 39 | +remote_load_cluster <cluster_name> # or omit name to use default_cluster |
| 40 | +remote_check_ssh # validates connectivity + starts persistent session |
| 41 | +``` |
| 42 | + |
| 43 | +`remote_check_ssh` starts an SSH **ControlMaster** connection. All subsequent `remote_run` / `remote_sync_*` / SCP calls reuse this single connection: |
| 44 | + |
| 45 | +- ~180ms per command (vs 5-15s per new connection) |
| 46 | +- Eliminates flaky proxy timeouts |
| 47 | +- Auto-cleaned up when the shell exits |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## 3. Detect Remote Environment |
| 52 | + |
| 53 | +```bash |
| 54 | +remote_detect_env |
| 55 | +``` |
| 56 | + |
| 57 | +Auto-discovers whether the remote has SLURM, Docker, or bare-metal GPUs. Sets `REMOTE_ENV_TYPE` to `slurm`, `docker`, `bare`, or `unknown`. |
| 58 | + |
| 59 | +After detection, proceed with the environment-specific setup: |
| 60 | + |
| 61 | +- **SLURM** → prefix all commands with `remote_run`. For SLURM job scripts, see the skill's own references. |
| 62 | +- **Docker** → use `remote_docker_run <container> "<command>"` |
| 63 | +- **Bare metal** → use `remote_run` directly |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## 4. Running Commands Remotely |
| 68 | + |
| 69 | +### Single commands |
| 70 | + |
| 71 | +```bash |
| 72 | +remote_run "nvidia-smi" |
| 73 | +remote_run "python --version" |
| 74 | +remote_run "sbatch /path/to/job.sh" |
| 75 | +``` |
| 76 | + |
| 77 | +`remote_run` uses base64 encoding internally, so special characters (`%`, `$`, quotes) work without escaping. It retries up to 3 times on SSH failures. |
| 78 | + |
| 79 | +### Syncing files |
| 80 | + |
| 81 | +```bash |
| 82 | +# Local → remote |
| 83 | +remote_sync_to /local/path remote_subdir |
| 84 | +
|
| 85 | +# Remote → local |
| 86 | +remote_sync_from remote_subdir /local/path |
| 87 | +``` |
| 88 | + |
| 89 | +Both use rsync over the persistent SSH session with default excludes (`.git`, `__pycache__`, `.claude`, `*.pyc`, `node_modules`, `*.egg-info`). The `.claude` directory is intentionally excluded — skills and config should not be synced to the remote machine. |
| 90 | + |
| 91 | +### SCP (alternative to rsync) |
| 92 | + |
| 93 | +SCP also reuses the persistent session automatically via ControlMaster: |
| 94 | + |
| 95 | +```bash |
| 96 | +scp /local/script.sh ${REMOTE_USER}@${REMOTE_HOST}:/remote/path/ |
| 97 | +``` |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 5. The Two-Script Pattern |
| 102 | + |
| 103 | +When submitting SLURM jobs remotely, write **two files** locally to avoid shell escaping issues: |
| 104 | + |
| 105 | +1. **SLURM wrapper** (e.g., `job_slurm.sh`) — `#SBATCH` directives + `srun` with container |
| 106 | +2. **Inner runner** (e.g., `run.sh`) — the actual work (runs inside the container) |
| 107 | + |
| 108 | +Then upload both and submit: |
| 109 | + |
| 110 | +```bash |
| 111 | +remote_sync_to /local/scripts/ scripts/ |
| 112 | +JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1) |
| 113 | +``` |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## 6. Verifying Results Remotely |
| 118 | + |
| 119 | +```bash |
| 120 | +remote_run "ls -lh <output_path>/" |
| 121 | +remote_run "cat <output_path>/hf_quant_config.json" |
| 122 | +``` |
| 123 | + |
| 124 | +Or fetch results to local: |
| 125 | + |
| 126 | +```bash |
| 127 | +remote_sync_from <remote_output_subdir> /local/output/ |
| 128 | +``` |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## 7. Troubleshooting |
| 133 | + |
| 134 | +| Problem | Cause | Fix | |
| 135 | +| ------- | ----- | --- | |
| 136 | +| `Connection timed out during banner exchange` | Proxy/login node overloaded | `remote_run` retries 3x automatically; use persistent session to avoid | |
| 137 | +| SSH proxy completely unreachable (`Network is unreachable`) | VPN/proxy host is down or not running on this machine | Check if VPN is connected; verify `socat`/proxy service is running locally; try direct SSH by temporarily removing `ssh_proxy` from config | |
| 138 | +| `unix_listener: cannot bind to path ... Read-only file system` | SSH ControlMaster socket in non-writable `/tmp` | `remote_exec.sh` auto-finds writable dir; ensure `TMPDIR` or `/tmp/claude-*` exists | |
| 139 | +| `cd: /home/user/~/path: No such file or directory` | `~` not expanding on remote | Use absolute paths in `workspace` config, not `~/...` | |
| 140 | +| Login nodes resolve home dirs differently | Symlinked home dirs vary by node | Use absolute lustre/NFS paths (e.g., `/lustre/fs1/...`) in job scripts | |
| 141 | +| `#!` becomes `#\!` in scripts | Shell environment mangles shebang | Fix with `sed -i 's\|^#\\\\!\|#!\|' script.sh` after writing | |
| 142 | + |
| 143 | +## Reference Files |
| 144 | + |
| 145 | +- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers) |
| 146 | +- **`.claude/clusters.yaml`** — Active cluster configuration |
| 147 | +- **`.claude/clusters.yaml.example`** — Annotated example config |
0 commit comments