Skip to content

Commit edc3a9b

Browse files
committed
Merge origin/main into zhiyu/polish-eval-skills
Resolves conflict in .claude/skills/ptq/SKILL.md by keeping both additions: main's new "Post-quantization validation" subsection (with pointer to references/checkpoint-validation.md), followed by our existing "Next steps" cross-skill pointer. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2 parents b0748dd + e9a4989 commit edc3a9b

File tree

444 files changed

+41580
-2302
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

444 files changed

+41580
-2302
lines changed

.claude/skills/common/slurm-setup.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,3 +206,128 @@ chmod -R g+rwX /path/to/.hf_cache/
206206
```
207207

208208
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
209+
210+
---
211+
212+
## 6. Container Registry Authentication
213+
214+
**Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake.
215+
216+
### Step 1: Detect the container runtime
217+
218+
Different clusters use different container runtimes. Detect which is available:
219+
220+
```bash
221+
# On the cluster (or via ssh):
222+
which enroot 2>/dev/null && echo "RUNTIME=enroot"
223+
which docker 2>/dev/null && echo "RUNTIME=docker"
224+
```
225+
226+
| Runtime | Typical clusters | SLURM integration |
227+
| --- | --- | --- |
228+
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
229+
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
230+
231+
### Step 2: Check credentials for the image's registry
232+
233+
Determine the registry from the image URI:
234+
235+
| Image pattern | Registry |
236+
| --- | --- |
237+
| `nvcr.io/nvidia/...` | NGC |
238+
| `vllm/vllm-openai:...`, `lmsysorg/sglang:...`, or no registry prefix | DockerHub |
239+
| `ghcr.io/...` | GitHub Container Registry |
240+
| `docker.io/...` | DockerHub (explicit) |
241+
242+
Then check credentials based on the runtime:
243+
244+
#### enroot/pyxis
245+
246+
```bash
247+
grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null
248+
```
249+
250+
Look for `machine <registry>` lines:
251+
- NGC → `machine nvcr.io`
252+
- DockerHub → `machine auth.docker.io`
253+
- GHCR → `machine ghcr.io`
254+
255+
#### Docker
256+
257+
```bash
258+
cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin).get('auths', {}).keys()))"
259+
```
260+
261+
Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).
262+
263+
### Step 3: If credentials are missing
264+
265+
**Do not submit the job.** Instead:
266+
267+
1. Tell the user which registry and runtime need authentication
268+
2. Show the fix for their runtime:
269+
270+
**enroot/pyxis:**
271+
272+
```bash
273+
mkdir -p ~/.config/enroot
274+
275+
# DockerHub (get token from https://hub.docker.com/settings/security)
276+
cat >> ~/.config/enroot/.credentials << 'EOF'
277+
machine auth.docker.io
278+
login <dockerhub_username>
279+
password <access_token>
280+
EOF
281+
282+
# NGC (get API key from https://org.ngc.nvidia.com/setup/api-keys)
283+
cat >> ~/.config/enroot/.credentials << 'EOF'
284+
machine nvcr.io
285+
login $oauthtoken
286+
password <ngc_api_key>
287+
EOF
288+
```
289+
290+
**Docker:**
291+
292+
```bash
293+
# DockerHub (interactive prompt)
294+
docker login
295+
296+
# NGC (use --password-stdin to avoid exposing secrets in process list)
297+
echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
298+
```
299+
300+
3. **Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images:
301+
302+
| DockerHub image | NGC alternative |
303+
| --- | --- |
304+
| `vllm/vllm-openai:latest` | `nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) |
305+
| `nvcr.io/nvidia/tensorrt-llm/release:<tag>` | Already NGC |
306+
307+
> **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file.
308+
309+
4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists):
310+
311+
```bash
312+
# enroot — test pull (aborts after manifest fetch)
313+
enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10
314+
# Success: shows "Fetching image manifest" + layer info
315+
# Failure: shows "401 Unauthorized" or "404 Not Found"
316+
317+
# docker
318+
docker manifest inspect <image> 2>&1 | head -5
319+
320+
# singularity
321+
singularity pull --dry-run docker://<image> 2>&1 | head -5
322+
```
323+
324+
> **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting.
325+
326+
### Common failure modes
327+
328+
| Symptom | Runtime | Cause | Fix |
329+
| --- | --- | --- | --- |
330+
| `curl: (22) ... error: 401` | enroot | No credentials for registry | Add to `~/.config/enroot/.credentials` |
331+
| `pyxis: failed to import docker image` | enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP |
332+
| `unauthorized: authentication required` | docker | No `docker login` | Run `docker login [registry]` |
333+
| Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes |

.claude/skills/debug/SKILL.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
name: debug
3+
description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
4+
---
5+
6+
# Remote Docker Debugger
7+
8+
Execute commands inside a Docker container from the host using the file-based command relay.
9+
10+
**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol and examples.
11+
12+
## Quick Reference
13+
14+
```bash
15+
# Check connection
16+
bash tools/debugger/client.sh status
17+
18+
# Connect to server (user must start server.sh in Docker first)
19+
bash tools/debugger/client.sh handshake
20+
21+
# Run a command
22+
bash tools/debugger/client.sh run "<command>"
23+
24+
# Long-running command (default timeout is 600s)
25+
bash tools/debugger/client.sh --timeout 1800 run "<command>"
26+
27+
# Cancel the currently running command
28+
bash tools/debugger/client.sh cancel
29+
30+
# Reconnect after server restart
31+
bash tools/debugger/client.sh flush
32+
bash tools/debugger/client.sh handshake
33+
```

.claude/skills/deployment/SKILL.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,8 @@ All checks must pass before reporting success to the user.
174174

175175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
176176

177+
0. **Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.**
178+
177179
1. **Source remote utilities:**
178180

179181
```bash
@@ -222,6 +224,10 @@ For NEL-managed deployment (evaluation with self-deployment), use the evaluation
222224
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
223225
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
224226
227+
## Unsupported Models
228+
229+
If the model is not in the validated support matrix (`references/support-matrix.md`), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read `references/unsupported-models.md` for the iterative debug loop: **run → read error → diagnose → patch framework source → re-run**. For kernel-level issues, escalate to the framework team rather than attempting fixes.
230+
225231
## Success Criteria
226232
227233
1. Server process is running and healthy (`/health` returns 200)
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Deploying Unsupported Models
2+
3+
When deploying a model not in the validated support matrix (`support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM.
4+
5+
## Step 1 — Run and collect the error
6+
7+
Submit the deployment job. When it fails, read the full log — focus on the **first** error traceback (not "See root cause above" wrappers). Identify the file and line number in the framework source.
8+
9+
## Step 2 — Diagnose the root cause
10+
11+
Fetch the framework source at the failing line (use `gh api` for the tagged version, or `find` inside the container). Common error categories:
12+
13+
| Category | Symptoms | Examples |
14+
|----------|----------|----------|
15+
| **Weight key mismatch** | `KeyError`, `Unexpected key`, `Missing key` during weight loading | Checkpoint uses `model.language_model.layers.*` but framework expects `model.layers.*`. See [vllm#39406](https://github.com/vllm-project/vllm/pull/39406) |
16+
| **Quantized/unquantized layer confusion** | Wrong layer type loaded, dtype errors, shape mismatches | Framework tries to load unquantized layers with FP4 kernel due to overly broad `quantization_config.ignore` patterns or missing ignore entries. See [sglang#18937](https://github.com/sgl-project/sglang/pull/18937) |
17+
| **Missing architecture support** | `NoneType is not iterable`, `KeyError` on model type, unknown architecture | Framework's model handler doesn't recognize the text backbone type (e.g., `ministral3` not handled in vLLM's `mistral3.py` init). Fix: extend the model type mapping |
18+
| **Transformers version mismatch** | `ImportError`, `KeyError` on config fields | Framework ships with older transformers that doesn't know the model type. Fix: upgrade transformers after installing the framework |
19+
| **Kernel-level issues** | CUDA errors, `triton` import failures, unsupported ops | Framework lacks kernel support for this model + quantization combo |
20+
21+
## Step 3 — Apply a targeted fix
22+
23+
Focus on **small, targeted patches** to the framework source. Do not modify `config.json` or the checkpoint — fix the framework's handling instead.
24+
25+
### Weight key mismatches and architecture mapping gaps
26+
27+
Patch the framework source in the run script using `sed` or a Python one-liner. Keep patches minimal — change only what's needed to unblock the current error.
28+
29+
```bash
30+
# Example: extend model type mapping in vLLM mistral3.py
31+
FRAMEWORK_FILE=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
32+
sed -i 's/old_pattern/new_pattern/' "${FRAMEWORK_FILE}"
33+
```
34+
35+
> **Tip**: when locating framework source files inside containers, use `find` instead of Python import — some frameworks print log messages to stdout during import that can corrupt captured paths.
36+
37+
### Speeding up debug iterations (vLLM)
38+
39+
When iterating on fixes, use these flags to shorten the feedback loop:
40+
41+
- **`--load-format dummy`** — skip loading actual model weights. Useful for testing whether the model initializes, config is parsed correctly, and weight keys match without waiting for the full checkpoint load.
42+
- **`VLLM_USE_PRECOMPILED=1 pip install --editable .`** — when patching vLLM source directly (instead of `sed`), this rebuilds only Python code without recompiling C++/CUDA extensions.
43+
44+
### Quantized/unquantized layer confusion
45+
46+
Check `hf_quant_config.json` ignore patterns against the framework's weight loading logic. The framework may try to load layers listed in `ignore` with quantized kernels, or vice versa. Fix by adjusting the framework's layer filtering logic.
47+
48+
### Kernel-level issues
49+
50+
These require framework kernel team involvement. Do NOT attempt to patch kernels. Instead:
51+
52+
1. Document the exact error (model, format, framework version, GPU type)
53+
2. Inform the user: *"This model + quantization combination requires kernel support that isn't available in {framework} v{version}. I'd suggest reaching out to the {framework} kernel team or trying a different framework."*
54+
3. Suggest trying an alternative framework (vLLM → SGLang → TRT-LLM)
55+
56+
## Step 4 — Re-run and iterate
57+
58+
After applying a fix, resubmit the job. Each iteration may reveal a new error (e.g., fixing the init error exposes a weight loading error). Continue the loop: **run → read error → diagnose → patch → re-run**.
59+
60+
Typical iteration count: 1-3 for straightforward fixes, 3-5 for models requiring multiple patches.
61+
62+
## Step 5 — Know when to stop
63+
64+
**Stop patching and escalate** when:
65+
66+
- The error is in compiled CUDA kernels or triton ops (not Python-level)
67+
- The fix requires changes to core framework abstractions (not just model handlers)
68+
- You've done 5+ iterations without the server starting
69+
70+
In these cases, inform the user and suggest: trying a different framework, checking for a newer framework version, or filing an issue with the framework team.

.claude/skills/evaluation/SKILL.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Config Generation Progress:
3030
- [ ] Step 5: Confirm tasks (iterative)
3131
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
3232
- [ ] Step 7: Advanced - Interceptors
33+
- [ ] Step 7.5: Check container registry auth (SLURM only)
3334
- [ ] Step 8: Run the evaluation
3435
```
3536

@@ -76,9 +77,9 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
7677
4. Safety & Security (like Garak and Safety Harness)
7778
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
7879

79-
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
80+
Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
8081

81-
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
82+
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). When the CLI's current options differ from this list, prefer the CLI's options.
8283
8384
When you have all the answers, run the script to build the base config:
8485

@@ -183,6 +184,36 @@ If the user needs multi-node evaluation (model >120B, or more throughput), read
183184

184185
- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
185186

187+
**Step 7.5: Check container registry authentication (SLURM only)**
188+
189+
NEL's default deployment images by framework:
190+
191+
| Framework | Default image | Registry |
192+
| --- | --- | --- |
193+
| vLLM | `vllm/vllm-openai:latest` | DockerHub |
194+
| SGLang | `lmsysorg/sglang:latest` | DockerHub |
195+
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
196+
| Evaluation tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
197+
198+
Before submitting, verify the cluster has credentials for the deployment image. See `skills/common/slurm-setup.md` section 6 for the full procedure.
199+
200+
```bash
201+
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
202+
```
203+
204+
**Decision flow (check before submitting):**
205+
1. Check if the cluster has credentials for the default DockerHub image (see command above)
206+
2. If DockerHub credentials exist → use the default image and submit
207+
3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit
208+
4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit:
209+
210+
```yaml
211+
deployment:
212+
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
213+
```
214+
215+
5. **Do not retry more than once** without fixing the auth issue
216+
186217
**Step 8: Run the evaluation**
187218

188219
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
@@ -318,5 +349,6 @@ Config Generation Progress:
318349
- [ ] Step 5: Confirm tasks (iterative)
319350
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
320351
- [ ] Step 7: Advanced - Interceptors
352+
- [ ] Step 7.5: Check container registry auth (SLURM only)
321353
- [ ] Step 8: Run the evaluation
322354
```

.claude/skills/ptq/SKILL.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,24 @@ Check the support table in `examples/llm_ptq/README.md` for verified HF models.
2424
- **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
2525
- **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)
2626

27+
## Step 2.5 — Check for model-specific dependencies
28+
29+
If the model uses `trust_remote_code` (check `config.json` for `auto_map`), inspect its custom Python files for imports not present in the container:
30+
31+
```bash
32+
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
33+
```
34+
35+
**Known dependency patterns:**
36+
37+
| Import found | Packages to install |
38+
| --- | --- |
39+
| `from mamba_ssm` / `from causal_conv1d` | `mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba) |
40+
41+
If extra deps are needed:
42+
- **Launcher (4B)**: set `EXTRA_PIP_DEPS` in the task's `environment` section — `ptq.sh` installs them automatically
43+
- **Manual (4A)**: `unset PIP_CONSTRAINT && pip install <deps>` before running `hf_ptq.py`
44+
2745
## Step 3 — Choose quantization format
2846

2947
**First**, check for a model-specific recipe:
@@ -113,6 +131,10 @@ ls -lh <output_path>/
113131

114132
Report the path and size to the user.
115133

134+
### Post-quantization validation
135+
136+
Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns) — this only surfaces later as deployment failures. Read `references/checkpoint-validation.md` for the validation script, expected patterns per recipe, and common pattern gaps.
137+
116138
**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over — see `skills/common/end-to-end-workflow.md` for the full PTQ → Deploy → Eval pipeline. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
117139

118140
## Key API Rules
@@ -126,6 +148,7 @@ Report the path and size to the user.
126148

127149
## Common Pitfalls
128150

151+
- **Model-specific dependencies**: Models with `trust_remote_code` may import packages not in the container (e.g., `mamba-ssm` for hybrid Mamba models). See Step 2.5. Use `EXTRA_PIP_DEPS` env var with the launcher, or install manually before running `hf_ptq.py`
129152
- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
130153
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
131154
- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
@@ -139,6 +162,7 @@ Report the path and size to the user.
139162
| `references/launcher-guide.md` | Step 4B only (launcher path) |
140163
| `tools/launcher/CLAUDE.md` | Step 4B only, if you need more launcher detail |
141164
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
165+
| `references/checkpoint-validation.md` | Step 5: validate quantization pattern matches recipe |
142166
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
143167
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
144168
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |

0 commit comments

Comments
 (0)