Skip to content

Commit f4e4222

Browse files
committed
Add remote support for the evaluation skill
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 5691456 commit f4e4222

2 files changed

Lines changed: 57 additions & 4 deletions

File tree

.claude/skills/deployment/SKILL.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
name: deployment
33
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
4+
license: Apache-2.0
45
---
56

67
# Deployment Skill
@@ -191,6 +192,57 @@ python -m vllm.benchmark_serving \
191192

192193
Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
193194

195+
### 7. Remote deployment (SSH/SLURM)
196+
197+
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
198+
199+
1. **Source remote utilities:**
200+
201+
```bash
202+
source .claude/skills/common/remote_exec.sh
203+
remote_load_cluster
204+
remote_check_ssh
205+
remote_detect_env
206+
```
207+
208+
2. **Sync the checkpoint** (if it was produced locally):
209+
210+
```bash
211+
remote_sync_to <local_checkpoint_path> checkpoints/
212+
```
213+
214+
3. **Deploy based on remote environment:**
215+
216+
- **SLURM** — write a job script that starts the server inside a container, then submit:
217+
218+
```bash
219+
srun --container-image="<container.sqsh>" \
220+
--container-mounts="<data_root>:<data_root>" \
221+
python -m vllm.entrypoints.openai.api_server \
222+
--model <remote_checkpoint_path> \
223+
--quantization modelopt \
224+
--host 0.0.0.0 --port 8000
225+
```
226+
227+
Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
228+
229+
- **Bare metal / Docker** — use `remote_run` to start the server directly:
230+
231+
```bash
232+
remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
233+
```
234+
235+
4. **Verify remotely:**
236+
237+
```bash
238+
remote_run "curl -s http://localhost:8000/health"
239+
remote_run "curl -s http://localhost:8000/v1/models"
240+
```
241+
242+
5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.
243+
244+
For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
245+
194246
## Error Handling
195247

196248
| Error | Cause | Fix |

.claude/skills/modelopt/SKILL.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,12 @@ If unclear, ask: **"After quantization, do you want to (a) deploy the model as a
3434

3535
Collect from the user (skip what's already provided):
3636

37-
1. **Model path** — local path or HuggingFace model ID
37+
1. **Model path** — local path or HuggingFace model ID (save this for baseline comparison in Step 4)
3838
2. **Quantization format** — e.g., fp8, nvfp4, int4_awq (or "recommend one")
39-
3. **GPU IDs** — which GPUs to use (default: `0`)
40-
4. For Deploy pipeline: **Deployment framework** — vLLM, SGLang, or TRT-LLM (default: vLLM)
41-
5. For Evaluate pipeline: **Evaluation tasks** — default: `mmlu`
39+
3. **Execution target** — local GPU or remote cluster. Check for `~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`. If found, ask which cluster to use. Both sub-skills support remote execution via `remote_exec.sh`.
40+
4. **GPU IDs** — which GPUs to use (default: `0`; skip if remote — sub-skills handle GPU allocation via SLURM)
41+
5. For Deploy pipeline: **Deployment framework** — vLLM, SGLang, or TRT-LLM (default: vLLM)
42+
6. For Evaluate pipeline: **Evaluation tasks** — default: `mmlu`
4243

4344
## Step 2: Quantize
4445

0 commit comments

Comments
 (0)