Skip to content

Commit 4afac7f

Browse files
authored
Add Agent PTQ skill for model quantization (#1107)
### What does this PR do? Type of change: New feature Add a PTQ skill that supports Non-launcher path: 1. Supported or unsupported transfomers models, if supported, will run ptq script, in unsupported (needs source code), will patch modules and write custom script. 2. Remote execution support, config your cluster in `clusters.yaml` (or give info to the agent directly), then say "quantize Qwen3-8B in xxx". The agent will use SSH ControlMaster to keep a persistent ssh session, so the remote server won't be abused and connection overhead can be reduced significantly. 3. Slurm support, will automatically detect if the current machine (or a remote machine) is a slurm login node, and will write slurm script for PTQ if so. 4. It can also dequantize fp8 to bf16 if a fp8 model is given like deepseek. launcher path: We currently only use launcher for (supported models + non bare metal) case. What are not supported now: 1. Models not in transformers ### Usage 1. Quantize Qwen3-0.6B to NVFP4 2. Quantize Qwen3-0.6B using cluster xxx ### Testing Test | Model | Path | Result -- | -- | -- | -- 1 | Qwen3-0.6B | 4B Launcher + SLURM | ✓ 2 | SmolLM-135M | 4C → hf_ptq.py + SLURM | ✓ 3 | InternVL3.5-20B | 4C → patched + SLURM | ✓ 4 | InternVL3.5-30B | 4C → hf_ptq.py + SLURM | ✓ 5 | FakeUnsupported-0.6B (local) | 4C → custom script, local GPU | ✓ 6 | FakeUnsupported-0.6B (HF) | 4C → custom script, remote SLURM | ✓ The fake model is available at https://huggingface.co/supermmx/FakeUnsupported-0.6B, it's modified to have a module needs quantization patch. What are not tested? 1. Multi-node PTQ. 2. More complex unsupported models. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Mandatory --> - Did you write any new necessary tests?: ❌ <!--- Mandatory for new features or examples. --> added, but it won't automatically run now. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * New comprehensive guides for remote GPU execution, SLURM workflows, workspace management, PTQ workflow, launcher usage, unsupported-model investigation, and PTQ SLURM/container guidance. * **Chores** * Added example cluster configuration and a sourced remote-execution helper script to manage remote runs, sync, and job lifecycle. * **Tests** * Added PTQ skill evaluation specification covering expected flows and artifact verification. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com>
1 parent 18ddcb7 commit 4afac7f

12 files changed

Lines changed: 1708 additions & 0 deletions

.claude/clusters.yaml.example

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# ModelOpt Remote Cluster Configuration
2+
# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
3+
# or .claude/clusters.yaml (project-level, can be committed).
4+
5+
clusters:
6+
# GPU workstation or SLURM login node
7+
my-cluster:
8+
login_node: cluster-login.example.com
9+
user: myusername
10+
ssh_key: ~/.ssh/id_rsa
11+
# ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # optional
12+
workspace: /path/to/remote/workdir
13+
gpu_type: H100 # used for quantization format recommendation
14+
# slurm:
15+
# default_account: my_account
16+
# default_partition: batch_short
17+
18+
default_cluster: my-cluster
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Environment Setup
2+
3+
Common detection for all ModelOpt skills. After this, you know what's available.
4+
5+
## Env-1. Get ModelOpt source
6+
7+
```bash
8+
ls examples/llm_ptq/hf_ptq.py 2>/dev/null && echo "Source found"
9+
```
10+
11+
If not found: `git clone https://github.com/NVIDIA/Model-Optimizer.git && cd Model-Optimizer`
12+
13+
If found, ensure the source is up to date:
14+
15+
```bash
16+
git pull origin main
17+
```
18+
19+
If previous runs left patches in `modelopt/` (from 4C unlisted model work), check whether they should be kept. Reset only if starting a completely new task: `git checkout main`.
20+
21+
## Env-2. Local or remote?
22+
23+
1. **User explicitly requests local or remote** → follow the user's choice
24+
2. **User doesn't specify** → check for cluster config:
25+
26+
```bash
27+
cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
28+
```
29+
30+
If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
31+
32+
For remote, connect:
33+
34+
```bash
35+
source .claude/skills/common/remote_exec.sh
36+
remote_load_cluster <cluster_name>
37+
remote_check_ssh
38+
remote_detect_env # sets REMOTE_ENV_TYPE = slurm / docker / bare
39+
```
40+
41+
If remote but no config, ask user for: hostname, SSH username, SSH key path, remote workdir. Create `~/.config/modelopt/clusters.yaml` (see `skills/common/remote-execution.md` for format).
42+
43+
## Env-3. What compute is available?
44+
45+
Run on the **target machine** (local, or via `remote_run` if remote):
46+
47+
```bash
48+
which srun sbatch 2>/dev/null && echo "SLURM"
49+
docker info 2>/dev/null | grep -qi nvidia && echo "Docker+GPU"
50+
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null
51+
```
52+
53+
Also check:
54+
55+
```bash
56+
ls tools/launcher/launch.py 2>/dev/null && echo "Launcher available"
57+
```
58+
59+
**No GPU detected?**
60+
61+
- If local with no GPU and no cluster config → ask the user:
62+
*"No local GPU detected. Do you have a remote machine or cluster with GPUs? If so, I'll need connection details (hostname, SSH username, key path, remote workdir) to run there."*
63+
- If user provides remote info → create `clusters.yaml`, go back to Env-2
64+
- If user has no GPU anywhere → **stop**: this task requires a CUDA GPU
65+
66+
## Summary
67+
68+
After this, you should know:
69+
70+
- ModelOpt source location
71+
- Local or remote (+ cluster config if remote)
72+
- SLURM / Docker+GPU / bare GPU
73+
- Launcher availability
74+
- GPU model and count
75+
76+
Return to the skill's SKILL.md for the execution path based on these results.
77+
78+
## Multi-user / Slack bot
79+
80+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md` before proceeding.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Remote Execution
2+
3+
Read this when Claude Code runs on a different machine than the target GPU cluster/workstation. This covers SSH connectivity, cluster config, persistent sessions, and remote command execution.
4+
5+
---
6+
7+
## 1. Cluster Config
8+
9+
Config locations (checked in order, first found wins):
10+
11+
1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
12+
2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
13+
3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
14+
15+
```yaml
16+
clusters:
17+
my-cluster:
18+
login_node: cluster-login.example.com # SSH hostname or SSH config alias
19+
user: username # SSH user
20+
ssh_key: ~/.ssh/id_rsa # (optional) SSH key path
21+
ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # (optional) proxy
22+
workspace: /absolute/path/to/workdir # Remote working directory
23+
gpu_type: H100 # For quant format recommendation
24+
slurm: # (optional) pre-fill SLURM defaults
25+
default_account: my_account
26+
default_partition: batch_short
27+
28+
default_cluster: my-cluster
29+
```
30+
31+
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
32+
33+
---
34+
35+
## 2. Connect and Establish Persistent Session
36+
37+
```bash
38+
source .claude/skills/common/remote_exec.sh
39+
remote_load_cluster <cluster_name> # or omit name to use default_cluster
40+
remote_check_ssh # validates connectivity + starts persistent session
41+
```
42+
43+
`remote_check_ssh` starts an SSH **ControlMaster** connection. All subsequent `remote_run` / `remote_sync_*` / SCP calls reuse this single connection:
44+
45+
- ~180ms per command (vs 5-15s per new connection)
46+
- Eliminates flaky proxy timeouts
47+
- Auto-cleaned up when the shell exits
48+
49+
---
50+
51+
## 3. Detect Remote Environment
52+
53+
```bash
54+
remote_detect_env
55+
```
56+
57+
Auto-discovers whether the remote has SLURM, Docker, or bare-metal GPUs. Sets `REMOTE_ENV_TYPE` to `slurm`, `docker`, `bare`, or `unknown`.
58+
59+
After detection, proceed with the environment-specific setup:
60+
61+
- **SLURM** → prefix all commands with `remote_run`. For SLURM job scripts, see the skill's own references.
62+
- **Docker** → use `remote_docker_run <container> "<command>"`
63+
- **Bare metal** → use `remote_run` directly
64+
65+
---
66+
67+
## 4. Running Commands Remotely
68+
69+
### Single commands
70+
71+
```bash
72+
remote_run "nvidia-smi"
73+
remote_run "python --version"
74+
remote_run "sbatch /path/to/job.sh"
75+
```
76+
77+
`remote_run` uses base64 encoding internally, so special characters (`%`, `$`, quotes) work without escaping. It retries up to 3 times on SSH failures.
78+
79+
### Syncing files
80+
81+
```bash
82+
# Local → remote
83+
remote_sync_to /local/path remote_subdir
84+
85+
# Remote → local
86+
remote_sync_from remote_subdir /local/path
87+
```
88+
89+
Both use rsync over the persistent SSH session with default excludes (`.git`, `__pycache__`, `.claude`, `*.pyc`, `node_modules`, `*.egg-info`). The `.claude` directory is intentionally excluded — skills and config should not be synced to the remote machine.
90+
91+
### SCP (alternative to rsync)
92+
93+
SCP also reuses the persistent session automatically via ControlMaster:
94+
95+
```bash
96+
scp /local/script.sh ${REMOTE_USER}@${REMOTE_HOST}:/remote/path/
97+
```
98+
99+
---
100+
101+
## 5. The Two-Script Pattern
102+
103+
When submitting SLURM jobs remotely, write **two files** locally to avoid shell escaping issues:
104+
105+
1. **SLURM wrapper** (e.g., `job_slurm.sh`) — `#SBATCH` directives + `srun` with container
106+
2. **Inner runner** (e.g., `run.sh`) — the actual work (runs inside the container)
107+
108+
Then upload both and submit:
109+
110+
```bash
111+
remote_sync_to /local/scripts/ scripts/
112+
JOBID=$(remote_run "sbatch /remote/path/scripts/job_slurm.sh" | grep -o '[0-9]\+' | tail -1)
113+
```
114+
115+
---
116+
117+
## 6. Verifying Results Remotely
118+
119+
```bash
120+
remote_run "ls -lh <output_path>/"
121+
remote_run "cat <output_path>/hf_quant_config.json"
122+
```
123+
124+
Or fetch results to local:
125+
126+
```bash
127+
remote_sync_from <remote_output_subdir> /local/output/
128+
```
129+
130+
---
131+
132+
## 7. Troubleshooting
133+
134+
| Problem | Cause | Fix |
135+
| ------- | ----- | --- |
136+
| `Connection timed out during banner exchange` | Proxy/login node overloaded | `remote_run` retries 3x automatically; use persistent session to avoid |
137+
| SSH proxy completely unreachable (`Network is unreachable`) | VPN/proxy host is down or not running on this machine | Check if VPN is connected; verify `socat`/proxy service is running locally; try direct SSH by temporarily removing `ssh_proxy` from config |
138+
| `unix_listener: cannot bind to path ... Read-only file system` | SSH ControlMaster socket in non-writable `/tmp` | `remote_exec.sh` auto-finds writable dir; ensure `TMPDIR` or `/tmp/claude-*` exists |
139+
| `cd: /home/user/~/path: No such file or directory` | `~` not expanding on remote | Use absolute paths in `workspace` config, not `~/...` |
140+
| Login nodes resolve home dirs differently | Symlinked home dirs vary by node | Use absolute lustre/NFS paths (e.g., `/lustre/fs1/...`) in job scripts |
141+
| `#!` becomes `#\!` in scripts | Shell environment mangles shebang | Fix with `sed -i 's\|^#\\\\!\|#!\|' script.sh` after writing |
142+
143+
## Reference Files
144+
145+
- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
146+
- **`.claude/clusters.yaml`** — Active cluster configuration
147+
- **`.claude/clusters.yaml.example`** — Annotated example config

0 commit comments

Comments
 (0)