Skip to content

Commit 2732ff4

Browse files
committed
add a debug loop in deployment skills for unsupported models
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 82cf851 commit 2732ff4

File tree

2 files changed

+67
-0
lines changed

2 files changed

+67
-0
lines changed

.claude/skills/deployment/SKILL.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,10 @@ For NEL-managed deployment (evaluation with self-deployment), use the evaluation
222222
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
223223
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
224224
225+
## Unsupported Models
226+
227+
If the model is not in the validated support matrix (`references/support-matrix.md`), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read `references/unsupported-models.md` for the iterative debug loop: **run → read error → diagnose → patch framework source → re-run**. For kernel-level issues, escalate to the framework team rather than attempting fixes.
228+
225229
## Success Criteria
226230
227231
1. Server process is running and healthy (`/health` returns 200)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Deploying Unsupported Models
2+
3+
When deploying a model not in the validated support matrix (`references/support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
4+
5+
## Step 1 — Run and collect the error
6+
7+
Submit the deployment job. When it fails, read the full log — focus on the **first** error traceback (not "See root cause above" wrappers). Identify the file and line number in the framework source.
8+
9+
## Step 2 — Diagnose the root cause
10+
11+
Fetch the framework source at the failing line (use `gh api` for the tagged version, or `find` inside the container). Common error categories:
12+
13+
| Category | Symptoms | Examples |
14+
|----------|----------|----------|
15+
| **Weight key mismatch** | `KeyError`, `Unexpected key`, `Missing key` during weight loading | Checkpoint uses `model.language_model.layers.*` but framework expects `model.layers.*`. See [vllm#39406](https://github.com/vllm-project/vllm/pull/39406) |
16+
| **Quantized/unquantized layer confusion** | Wrong layer type loaded, dtype errors, shape mismatches | Framework tries to load unquantized layers with FP4 kernel due to overly broad `quantization_config.ignore` patterns or missing ignore entries. See [sglang#18937](https://github.com/sgl-project/sglang/pull/18937) |
17+
| **Missing architecture support** | `NoneType is not iterable`, `KeyError` on model type, unknown architecture | Framework's model handler doesn't recognize the text backbone type (e.g., `ministral3` not handled in vLLM's `mistral3.py` init). Fix: extend the model type mapping |
18+
| **Transformers version mismatch** | `ImportError`, `KeyError` on config fields | Framework ships with older transformers that doesn't know the model type. Fix: upgrade transformers after installing the framework |
19+
| **Kernel-level issues** | CUDA errors, `triton` import failures, unsupported ops | Framework lacks kernel support for this model + quantization combo |
20+
21+
## Step 3 — Apply a targeted fix
22+
23+
Focus on **small, targeted patches** to the framework source. Do not modify `config.json` or the checkpoint — fix the framework's handling instead.
24+
25+
### Weight key mismatches and architecture mapping gaps
26+
27+
Patch the framework source in the run script using `sed` or a Python one-liner. Keep patches minimal — change only what's needed to unblock the current error.
28+
29+
```bash
30+
# Example: extend model type mapping in vLLM mistral3.py
31+
FRAMEWORK_FILE=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
32+
sed -i 's/old_pattern/new_pattern/' "${FRAMEWORK_FILE}"
33+
```
34+
35+
> **Tip**: when locating framework source files inside containers, use `find` instead of Python import — some frameworks print log messages to stdout during import that can corrupt captured paths.
36+
37+
### Quantized/unquantized layer confusion
38+
39+
Check `hf_quant_config.json` ignore patterns against the framework's weight loading logic. The framework may try to load layers listed in `ignore` with quantized kernels, or vice versa. Fix by adjusting the framework's layer filtering logic.
40+
41+
### Kernel-level issues
42+
43+
These require framework kernel team involvement. Do NOT attempt to patch kernels. Instead:
44+
45+
1. Document the exact error (model, format, framework version, GPU type)
46+
2. Inform the user: *"This model + quantization combination requires kernel support that isn't available in {framework} v{version}. I'd suggest reaching out to the {framework} kernel team or trying a different framework."*
47+
3. Suggest trying an alternative framework (vLLM → SGLang → TRT-LLM)
48+
49+
## Step 4 — Re-run and iterate
50+
51+
After applying a fix, resubmit the job. Each iteration may reveal a new error (e.g., fixing the init error exposes a weight loading error). Continue the loop: **run → read error → diagnose → patch → re-run**.
52+
53+
Typical iteration count: 1-3 for straightforward fixes, 3-5 for models requiring multiple patches.
54+
55+
## Step 5 — Know when to stop
56+
57+
**Stop patching and escalate** when:
58+
59+
- The error is in compiled CUDA kernels or triton ops (not Python-level)
60+
- The fix requires changes to core framework abstractions (not just model handlers)
61+
- You've done 5+ iterations without the server starting
62+
63+
In these cases, inform the user and suggest: trying a different framework, checking for a newer framework version, or filing an issue with the framework team.

0 commit comments

Comments
 (0)