Skip to content

Commit 11b7efb

Browse files
committed
Address review comments on deployment unsupported-models guide
- Fix relative path: references/support-matrix.md → support-matrix.md (file is a sibling, not nested) [Copilot] - Add --load-format dummy tip for skipping weight loading during debug iterations [mxinO] - Add VLLM_USE_PRECOMPILED=1 pip install --editable . tip for fast Python-only rebuilds when patching vLLM source [mxinO] Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 2732ff4 commit 11b7efb

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

.claude/skills/deployment/references/unsupported-models.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Deploying Unsupported Models
22

3-
When deploying a model not in the validated support matrix (`references/support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
3+
When deploying a model not in the validated support matrix (`support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
44

55
## Step 1 — Run and collect the error
66

@@ -34,6 +34,13 @@ sed -i 's/old_pattern/new_pattern/' "${FRAMEWORK_FILE}"
3434

3535
> **Tip**: when locating framework source files inside containers, use `find` instead of Python import — some frameworks print log messages to stdout during import that can corrupt captured paths.
3636
37+
### Speeding up debug iterations (vLLM)
38+
39+
When iterating on fixes, use these flags to shorten the feedback loop:
40+
41+
- **`--load-format dummy`** — skip loading actual model weights. Useful for testing whether the model initializes, config is parsed correctly, and weight keys match without waiting for the full checkpoint load.
42+
- **`VLLM_USE_PRECOMPILED=1 pip install --editable .`** — when patching vLLM source directly (instead of `sed`), this rebuilds only Python code without recompiling C++/CUDA extensions.
43+
3744
### Quantized/unquantized layer confusion
3845

3946
Check `hf_quant_config.json` ignore patterns against the framework's weight loading logic. The framework may try to load layers listed in `ignore` with quantized kernels, or vice versa. Fix by adjusting the framework's layer filtering logic.

0 commit comments

Comments
 (0)