Skip to content

Commit f539c03

Browse files
authored
Add Agent Deployment skill for model serving (#1133)
### What does this PR do? Type of change: New feature Add a Claude Code skill for serving quantized or unquantized LLM checkpoints as OpenAI-compatible API endpoints. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy). #### Skill structure ``` deployment/ ├── SKILL.md Decision workflow (identify → framework → env → deploy → verify) ├── references/ │ ├── vllm.md vLLM-specific deployment details │ ├── sglang.md SGLang-specific details │ ├── trtllm.md TRT-LLM / AutoDeploy details │ ├── support-matrix.md Model/format/framework compatibility │ └── setup.md Framework installation guides ├── scripts/ │ └── deploy.sh Local server lifecycle (start/stop/restart/status/test/detect) └── evals/ ├── vllm-fp8-local.json Happy path: local vLLM deployment └── remote-slurm-deployment.json Remote SLURM deployment ``` #### Features - **Auto-detect quantization format** from `hf_quant_config.json` with `config.json` fallback - **PTQ checkpoint discovery** - find checkpoints from prior PTQ runs in common output directories - **Framework recommendation** based on model/format/use case (vLLM for general, SGLang for DeepSeek/Llama 4, TRT-LLM for max throughput) - **GPU memory estimation** and tensor parallelism guidance - **Server lifecycle management** via `deploy.sh` (start/stop/restart/status/detect) - **Health check verification** and API testing - **Remote deployment** via SSH/SLURM (using common `remote_exec.sh`, with smart checkpoint sync) - **HF-format checkpoint assumption** documented — TRT-LLM format has separate path - **Optional benchmarking** for throughput/latency metrics **Depends on**: #1107 (common files: `remote_exec.sh`, `workspace-management.md`, `environment-setup.md`) ### Testing Invoke in Claude Code: ``` claude -p "deploy outputs/Qwen3-0.6B-FP8 with vLLM" ``` ### Before your PR is "*Ready for review*" - Is this change backward compatible?: N/A (new feature) - If you copied code from any other sources, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: N/A (skill evals provided separately) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a CLI to deploy, manage, and test OpenAI-compatible inference servers (start/stop/status/test/restart/detect) across vLLM, SGLang, and TensorRT-LLM, with automatic quantization detection, GPU checks, readiness polling, and basic orchestration. * **Documentation** * Comprehensive deployment guides: framework installation, quantization support matrix, AutoDeploy/AutoQuant notes, SLURM and Docker examples, benchmarking, verification, and troubleshooting. * **Tests** * New evaluation scenarios for local (FP8) and remote (SLURM) deployments. * **Chores** * Updated markdown linting configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 9aab38c commit f539c03

9 files changed

Lines changed: 1336 additions & 0 deletions

File tree

.claude/skills/deployment/SKILL.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
---
2+
name: deployment
3+
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
4+
license: Apache-2.0
5+
---
6+
7+
# Deployment Skill
8+
9+
Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
10+
11+
## Quick Start
12+
13+
Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.
14+
15+
```bash
16+
# Start vLLM server with a ModelOpt checkpoint
17+
scripts/deploy.sh start --model ./qwen3-0.6b-fp8
18+
19+
# Start with SGLang and tensor parallelism
20+
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
21+
22+
# Start from HuggingFace hub
23+
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
24+
25+
# Test the API
26+
scripts/deploy.sh test
27+
28+
# Check status
29+
scripts/deploy.sh status
30+
31+
# Stop
32+
scripts/deploy.sh stop
33+
```
34+
35+
The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.
36+
37+
## Decision Flow
38+
39+
### 0. Check workspace (multi-user / Slack bot)
40+
41+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
42+
43+
```bash
44+
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
45+
```
46+
47+
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
48+
49+
### 1. Identify the checkpoint
50+
51+
Determine what the user wants to deploy:
52+
53+
- **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command.
54+
- **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly
55+
- **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill
56+
57+
> **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`.
58+
59+
Check the quantization format if applicable:
60+
61+
```bash
62+
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
63+
```
64+
65+
If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized.
66+
67+
### 2. Choose the framework
68+
69+
If the user hasn't specified a framework, recommend based on this priority:
70+
71+
| Situation | Recommended | Why |
72+
|-----------|-------------|-----|
73+
| General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible |
74+
| Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support |
75+
| Maximum optimization | **TRT-LLM** | Best throughput via engine compilation |
76+
| Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints |
77+
78+
Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported.
79+
80+
### 3. Check the environment
81+
82+
Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
83+
84+
Then check the **deployment framework** is installed:
85+
86+
```bash
87+
python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
88+
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
89+
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
90+
```
91+
92+
If not installed, consult `references/setup.md`.
93+
94+
**GPU memory estimate** (to determine tensor parallelism):
95+
96+
- BF16: `params × 2 bytes` (8B ≈ 16 GB)
97+
- FP8: `params × 1 byte` (8B ≈ 8 GB)
98+
- FP4: `params × 0.5 bytes` (8B ≈ 4 GB)
99+
- Add ~2-4 GB for KV cache and framework overhead
100+
101+
If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
102+
103+
### 4. Deploy
104+
105+
Read the framework-specific reference for detailed instructions:
106+
107+
| Framework | Reference file |
108+
|-----------|---------------|
109+
| vLLM | `references/vllm.md` |
110+
| SGLang | `references/sglang.md` |
111+
| TRT-LLM | `references/trtllm.md` |
112+
113+
**Quick-start commands** (for common cases):
114+
115+
#### vLLM
116+
117+
```bash
118+
# Serve as OpenAI-compatible endpoint
119+
python -m vllm.entrypoints.openai.api_server \
120+
--model <checkpoint_path> \
121+
--quantization modelopt \
122+
--tensor-parallel-size <num_gpus> \
123+
--host 0.0.0.0 --port 8000
124+
```
125+
126+
For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
127+
128+
#### SGLang
129+
130+
```bash
131+
python -m sglang.launch_server \
132+
--model-path <checkpoint_path> \
133+
--quantization modelopt \
134+
--tp <num_gpus> \
135+
--host 0.0.0.0 --port 8000
136+
```
137+
138+
#### TRT-LLM (direct)
139+
140+
```python
141+
from tensorrt_llm import LLM, SamplingParams
142+
llm = LLM(model="<checkpoint_path>")
143+
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
144+
```
145+
146+
#### TRT-LLM AutoDeploy
147+
148+
For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`.
149+
150+
### 5. Verify the deployment
151+
152+
After the server starts, verify it's healthy:
153+
154+
```bash
155+
# Health check
156+
curl -s http://localhost:8000/health
157+
158+
# List models
159+
curl -s http://localhost:8000/v1/models | python -m json.tool
160+
161+
# Test generation
162+
curl -s http://localhost:8000/v1/completions \
163+
-H "Content-Type: application/json" \
164+
-d '{
165+
"model": "<model_name>",
166+
"prompt": "The capital of France is",
167+
"max_tokens": 32
168+
}' | python -m json.tool
169+
```
170+
171+
All checks must pass before reporting success to the user.
172+
173+
### 6. Remote deployment (SSH/SLURM)
174+
175+
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
176+
177+
1. **Source remote utilities:**
178+
179+
```bash
180+
source .claude/skills/common/remote_exec.sh
181+
remote_load_cluster
182+
remote_check_ssh
183+
remote_detect_env
184+
```
185+
186+
2. **Sync the checkpoint** (only if it was produced locally):
187+
188+
If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
189+
190+
```bash
191+
remote_sync_to <local_checkpoint_path> checkpoints/
192+
```
193+
194+
3. **Deploy based on remote environment:**
195+
196+
- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
197+
198+
- **Bare metal / Docker** — use `remote_run` to start the server directly:
199+
200+
```bash
201+
remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
202+
```
203+
204+
4. **Verify remotely:**
205+
206+
```bash
207+
remote_run "curl -s http://localhost:8000/health"
208+
remote_run "curl -s http://localhost:8000/v1/models"
209+
```
210+
211+
5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.
212+
213+
For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
214+
215+
## Error Handling
216+
217+
| Error | Cause | Fix |
218+
|-------|-------|-----|
219+
| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
220+
| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
221+
| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
222+
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
223+
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
224+
225+
## Success Criteria
226+
227+
1. Server process is running and healthy (`/health` returns 200)
228+
2. Model is listed at `/v1/models`
229+
3. Test generation produces coherent output
230+
4. Server URL and port are reported to the user
231+
5. If benchmarking was requested, throughput/latency numbers are reported
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Deployment Environment Setup
2+
3+
## Framework Installation
4+
5+
### vLLM
6+
7+
```bash
8+
pip install vllm
9+
```
10+
11+
Minimum version: 0.10.1
12+
13+
### SGLang
14+
15+
```bash
16+
pip install "sglang[all]"
17+
```
18+
19+
Minimum version: 0.4.10
20+
21+
### TRT-LLM
22+
23+
TRT-LLM is best installed via NVIDIA container:
24+
25+
```bash
26+
docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
27+
```
28+
29+
Or via pip (requires CUDA toolkit):
30+
31+
```bash
32+
pip install tensorrt-llm
33+
```
34+
35+
Minimum version: 0.17.0
36+
37+
## SLURM Deployment
38+
39+
For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line:
40+
41+
```bash
42+
#!/bin/bash
43+
#SBATCH --job-name=deploy
44+
#SBATCH --account=<account>
45+
#SBATCH --partition=<partition>
46+
#SBATCH --nodes=1
47+
#SBATCH --ntasks-per-node=1
48+
#SBATCH --gpus-per-node=<num_gpus>
49+
#SBATCH --time=04:00:00
50+
#SBATCH --output=deploy_%j.log
51+
52+
srun \
53+
--container-image="<path/to/container.sqsh>" \
54+
--container-mounts="<data_root>:<data_root>" \
55+
--container-workdir="<workdir>" \
56+
--no-container-mount-home \
57+
bash -c "python -m vllm.entrypoints.openai.api_server \
58+
--model <checkpoint_path> \
59+
--quantization modelopt \
60+
--tensor-parallel-size <num_gpus> \
61+
--host 0.0.0.0 --port 8000"
62+
```
63+
64+
To access the server from outside the SLURM node, note the allocated hostname:
65+
66+
```bash
67+
squeue -u $USER -o "%j %N %S" # Get the node name
68+
# Then SSH tunnel or use the node's hostname directly
69+
```
70+
71+
## Docker Deployment
72+
73+
### Official Images (recommended)
74+
75+
| Framework | Image | Source |
76+
|-----------|-------|--------|
77+
| vLLM | `vllm/vllm-openai:latest` | <https://hub.docker.com/r/vllm/vllm-openai> |
78+
| SGLang | `lmsysorg/sglang:latest` | <https://hub.docker.com/r/lmsysorg/sglang> |
79+
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:latest` | <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/> |
80+
81+
Example with the official vLLM image:
82+
83+
```bash
84+
docker run --gpus all -p 8000:8000 \
85+
-v /path/to/checkpoint:/model \
86+
vllm/vllm-openai:latest \
87+
--model /model \
88+
--quantization modelopt \
89+
--host 0.0.0.0 --port 8000
90+
```
91+
92+
### Custom Image (optional)
93+
94+
A Dockerfile is also available at `examples/vllm_serve/Dockerfile` if you need a custom build:
95+
96+
```bash
97+
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
98+
99+
docker run --gpus all -p 8000:8000 \
100+
-v /path/to/checkpoint:/model \
101+
vllm-modelopt \
102+
python -m vllm.entrypoints.openai.api_server \
103+
--model /model \
104+
--quantization modelopt \
105+
--host 0.0.0.0 --port 8000
106+
```

0 commit comments

Comments
 (0)