Skip to content

Commit 6bbb9d6

Browse files
Merge branch 'feature/puzzletron' into jrausch/fix-lm-eval-version
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
2 parents 303ee64 + 3f41819 commit 6bbb9d6

File tree

504 files changed

+24127
-10860
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

504 files changed

+24127
-10860
lines changed

.claude/clusters.yaml.example

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# ModelOpt Remote Cluster Configuration
2+
# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
3+
# or .claude/clusters.yaml (project-level, can be committed).
4+
5+
clusters:
6+
# GPU workstation or SLURM login node
7+
my-cluster:
8+
login_node: cluster-login.example.com
9+
user: myusername
10+
ssh_key: ~/.ssh/id_rsa
11+
# ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # optional
12+
workspace: /path/to/remote/workdir
13+
gpu_type: H100 # used for quantization format recommendation
14+
# slurm:
15+
# default_account: my_account
16+
# default_partition: batch_short
17+
18+
default_cluster: my-cluster
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Environment Setup
2+
3+
Common detection for all ModelOpt skills. After this, you know what's available.
4+
5+
## Env-1. Get ModelOpt source
6+
7+
```bash
8+
ls examples/llm_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
9+
```
10+
11+
If not found: `git clone https://github.com/NVIDIA/Model-Optimizer.git && cd Model-Optimizer`
12+
13+
If found, ensure the source is up to date:
14+
15+
```bash
16+
git pull origin main
17+
```
18+
19+
If previous runs left patches in `modelopt/` (from 4C unlisted model work), check whether they should be kept. Reset only if starting a completely new task: `git checkout main`.
20+
21+
## Env-2. Local or remote?
22+
23+
1. **User explicitly requests local or remote** → follow the user's choice
24+
2. **User doesn't specify** → check for cluster config:
25+
26+
```bash
27+
cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
28+
```
29+
30+
If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
31+
32+
For remote, connect:
33+
34+
```bash
35+
source .claude/skills/common/remote_exec.sh
36+
remote_load_cluster <cluster_name>
37+
remote_check_ssh
38+
remote_detect_env # sets REMOTE_ENV_TYPE = slurm / docker / bare
39+
```
40+
41+
If remote but no config, ask user for: hostname, SSH username, SSH key path, remote workdir. Create `~/.config/modelopt/clusters.yaml` (see `skills/common/remote-execution.md` for format).
42+
43+
## Env-3. What compute is available?
44+
45+
Run on the **target machine** (local, or via `remote_run` if remote):
46+
47+
```bash
48+
which srun sbatch 2>/dev/null && echo "SLURM"
49+
docker info 2>/dev/null | grep -qi nvidia && echo "Docker+GPU"
50+
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null
51+
```
52+
53+
Also check:
54+
55+
```bash
56+
ls tools/launcher/launch.py 2>/dev/null && echo "Launcher available"
57+
```
58+
59+
**No GPU detected?**
60+
61+
- If local with no GPU and no cluster config → ask the user:
62+
*"No local GPU detected. Do you have a remote machine or cluster with GPUs? If so, I'll need connection details (hostname, SSH username, key path, remote workdir) to run there."*
63+
- If user provides remote info → create `clusters.yaml`, go back to Env-2
64+
- If user has no GPU anywhere → **stop**: this task requires a CUDA GPU
65+
66+
## Summary
67+
68+
After this, you should know:
69+
70+
- ModelOpt source location
71+
- Local or remote (+ cluster config if remote)
72+
- SLURM / Docker+GPU / bare GPU
73+
- Launcher availability
74+
- GPU model and count
75+
76+
Return to the skill's SKILL.md for the execution path based on these results.
77+
78+
## Multi-user / Slack bot
79+
80+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md` before proceeding.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Remote Execution
2+
3+
Read this when Claude Code runs on a different machine than the target GPU cluster/workstation. This covers SSH connectivity, cluster config, persistent sessions, and remote command execution.
4+
5+
---
6+
7+
## 1. Cluster Config
8+
9+
Config locations (checked in order, first found wins):
10+
11+
1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
12+
2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
13+
3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
14+
15+
```yaml
16+
clusters:
17+
my-cluster:
18+
login_node: cluster-login.example.com # SSH hostname or SSH config alias
19+
user: username # SSH user
20+
ssh_key: ~/.ssh/id_rsa # (optional) SSH key path
21+
ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # (optional) proxy
22+
workspace: /absolute/path/to/workdir # Remote working directory
23+
gpu_type: H100 # For quant format recommendation
24+
slurm: # (optional) pre-fill SLURM defaults
25+
default_account: my_account
26+
default_partition: batch_short
27+
28+
default_cluster: my-cluster
29+
```
30+
31+
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
32+
33+
---
34+
35+
## 2. Connect and Establish Persistent Session
36+
37+
```bash
38+
source .claude/skills/common/remote_exec.sh
39+
remote_load_cluster <cluster_name> # or omit name to use default_cluster
40+
remote_check_ssh # validates connectivity + starts persistent session
41+
```
42+
43+
`remote_check_ssh` starts an SSH **ControlMaster** connection. All subsequent `remote_run` / `remote_sync_*` / SCP calls reuse this single connection:
44+
45+
- ~180ms per command (vs 5-15s per new connection)
46+
- Eliminates flaky proxy timeouts
47+
- Auto-cleaned up when the shell exits
48+
49+
---
50+
51+
## 3. Detect Remote Environment
52+
53+
```bash
54+
remote_detect_env
55+
```
56+
57+
Auto-discovers whether the remote has SLURM, Docker, or bare-metal GPUs. Sets `REMOTE_ENV_TYPE` to `slurm`, `docker`, `bare`, or `unknown`.
58+
59+
After detection, proceed with the environment-specific setup:
60+
61+
- **SLURM** → prefix all commands with `remote_run`. For SLURM job scripts, see the skill's own references.
62+
- **Docker** → use `remote_docker_run <container> "<command>"`
63+
- **Bare metal** → use `remote_run` directly
64+
65+
---
66+
67+
## 4. Running Commands Remotely
68+
69+
### Single commands
70+
71+
```bash
72+
remote_run "nvidia-smi"
73+
remote_run "python --version"
74+
remote_run "sbatch /path/to/job.sh"
75+
```
76+
77+
`remote_run` uses base64 encoding internally, so special characters (`%`, `$`, quotes) work without escaping. It retries up to 3 times on SSH failures.
78+
79+
### Syncing files
80+
81+
```bash
82+
# Local → remote
83+
remote_sync_to /local/path remote_subdir
84+
85+
# Remote → local
86+
remote_sync_from remote_subdir /local/path
87+
```
88+
89+
Both use rsync over the persistent SSH session with default excludes (`.git`, `__pycache__`, `.claude`, `*.pyc`, `node_modules`, `*.egg-info`). The `.claude` directory is intentionally excluded — skills and config should not be synced to the remote machine.
90+
91+
### SCP (alternative to rsync)
92+
93+
SCP also reuses the persistent session automatically via ControlMaster:
94+
95+
```bash
96+
scp /local/script.sh ${REMOTE_USER}@${REMOTE_HOST}:/remote/path/
97+
```
98+
99+
---
100+
101+
## 5. The Two-Script Pattern
102+
103+
When submitting SLURM jobs remotely, write **two files** locally to avoid shell escaping issues:
104+
105+
1. **SLURM wrapper** (e.g., `job_slurm.sh`) — `#SBATCH` directives + `srun` with container
106+
2. **Inner runner** (e.g., `run.sh`) — the actual work (runs inside the container)
107+
108+
Then upload both and submit:
109+
110+
```bash
111+
remote_sync_to /local/scripts/ scripts/
112+
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
113+
```
114+
115+
---
116+
117+
## 6. Verifying Results Remotely
118+
119+
```bash
120+
remote_run "ls -lh <output_path>/"
121+
remote_run "cat <output_path>/hf_quant_config.json"
122+
```
123+
124+
Or fetch results to local:
125+
126+
```bash
127+
remote_sync_from <remote_output_subdir> /local/output/
128+
```
129+
130+
---
131+
132+
## 7. Troubleshooting
133+
134+
| Problem | Cause | Fix |
135+
| ------- | ----- | --- |
136+
| `Connection timed out during banner exchange` | Proxy/login node overloaded | `remote_run` retries 3x automatically; use persistent session to avoid |
137+
| SSH proxy completely unreachable (`Network is unreachable`) | VPN/proxy host is down or not running on this machine | Check if VPN is connected; verify `socat`/proxy service is running locally; try direct SSH by temporarily removing `ssh_proxy` from config |
138+
| `unix_listener: cannot bind to path ... Read-only file system` | SSH ControlMaster socket in non-writable `/tmp` | `remote_exec.sh` auto-finds writable dir; ensure `TMPDIR` or `/tmp/claude-*` exists |
139+
| `cd: /home/user/~/path: No such file or directory` | `~` not expanding on remote | Use absolute paths in `workspace` config, not `~/...` |
140+
| Login nodes resolve home dirs differently | Symlinked home dirs vary by node | Use absolute lustre/NFS paths (e.g., `/lustre/fs1/...`) in job scripts |
141+
| `#!` becomes `#\!` in scripts | Shell environment mangles shebang | Fix with `sed -i 's\|^#\\\\!\|#!\|' script.sh` after writing |
142+
143+
## Reference Files
144+
145+
- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
146+
- **`.claude/clusters.yaml`** — Active cluster configuration
147+
- **`.claude/clusters.yaml.example`** — Annotated example config

0 commit comments

Comments
 (0)