|
1 | 1 | --- |
2 | 2 | name: deployment |
3 | 3 | description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. |
| 4 | +license: Apache-2.0 |
4 | 5 | --- |
5 | 6 |
|
6 | 7 | # Deployment Skill |
@@ -191,6 +192,57 @@ python -m vllm.benchmark_serving \ |
191 | 192 |
|
192 | 193 | Report: throughput (tok/s), latency p50/p99, time to first token (TTFT). |
193 | 194 |
|
| 195 | +### 7. Remote deployment (SSH/SLURM) |
| 196 | + |
| 197 | +If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine: |
| 198 | + |
| 199 | +1. **Source remote utilities:** |
| 200 | + |
| 201 | + ```bash |
| 202 | + source .claude/skills/common/remote_exec.sh |
| 203 | + remote_load_cluster |
| 204 | + remote_check_ssh |
| 205 | + remote_detect_env |
| 206 | + ``` |
| 207 | + |
| 208 | +2. **Sync the checkpoint** (if it was produced locally): |
| 209 | + |
| 210 | + ```bash |
| 211 | + remote_sync_to <local_checkpoint_path> checkpoints/ |
| 212 | + ``` |
| 213 | + |
| 214 | +3. **Deploy based on remote environment:** |
| 215 | + |
| 216 | + - **SLURM** — write a job script that starts the server inside a container, then submit: |
| 217 | + |
| 218 | + ```bash |
| 219 | + srun --container-image="<container.sqsh>" \ |
| 220 | + --container-mounts="<data_root>:<data_root>" \ |
| 221 | + python -m vllm.entrypoints.openai.api_server \ |
| 222 | + --model <remote_checkpoint_path> \ |
| 223 | + --quantization modelopt \ |
| 224 | + --host 0.0.0.0 --port 8000 |
| 225 | + ``` |
| 226 | + |
| 227 | + Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`. |
| 228 | + |
| 229 | + - **Bare metal / Docker** — use `remote_run` to start the server directly: |
| 230 | + |
| 231 | + ```bash |
| 232 | + remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &" |
| 233 | + ``` |
| 234 | + |
| 235 | +4. **Verify remotely:** |
| 236 | + |
| 237 | + ```bash |
| 238 | + remote_run "curl -s http://localhost:8000/health" |
| 239 | + remote_run "curl -s http://localhost:8000/v1/models" |
| 240 | + ``` |
| 241 | + |
| 242 | +5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network. |
| 243 | + |
| 244 | +For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically. |
| 245 | + |
194 | 246 | ## Error Handling |
195 | 247 |
|
196 | 248 | | Error | Cause | Fix | |
|
0 commit comments