Skip to content

Commit 6ded36b

Browse files
authored
Add dep check for ptq and runtime check for evaluation/deployment (#1240)
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> PTQ: model-specific dependency support - Add EXTRA_PIP_DEPS support to the launcher's `ptq.sh` so models requiring extra pip packages (e.g., `mamba-ssm` for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies. Container registry auth checks - Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes. - Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note. - Add auth check before remote SLURM deployment. ### Usage Set EXTRA_PIP_DEPS in the launcher YAML's environment section: ``` task_0: script: common/hf/ptq.sh args: - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - -- - --quant nvfp4 - --tasks quant environment: - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d" ``` ### Testing <!-- Mention how have you tested your change if applicable. --> Tested end-to-end: NVFP4 quantization of `NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards). ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Documentation** * Added container registry authentication verification workflow for SLURM deployments, including credential checks, verification commands, common failure symptoms, and remediation guidance. * Required credential validation before SLURM job submission and added SLURM-only verification steps with image fallback recommendations. * New dependency-checking step for models that use remote/trust_remote_code, plus guidance for resolving extra package requirements and tightened build-config guidance. * Updated PTQ launcher documentation to reference the new wrapper script. * **New Features** * Support for specifying extra pip dependencies during model processing via an environment variable. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 0a4908d commit 6ded36b

File tree

6 files changed

+191
-5
lines changed

6 files changed

+191
-5
lines changed

.claude/skills/common/slurm-setup.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,128 @@ chmod -R g+rwX /path/to/.hf_cache/
192192
```
193193

194194
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
195+
196+
---
197+
198+
## 6. Container Registry Authentication
199+
200+
**Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake.
201+
202+
### Step 1: Detect the container runtime
203+
204+
Different clusters use different container runtimes. Detect which is available:
205+
206+
```bash
207+
# On the cluster (or via ssh):
208+
which enroot 2>/dev/null && echo "RUNTIME=enroot"
209+
which docker 2>/dev/null && echo "RUNTIME=docker"
210+
```
211+
212+
| Runtime | Typical clusters | SLURM integration |
213+
| --- | --- | --- |
214+
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
215+
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
216+
217+
### Step 2: Check credentials for the image's registry
218+
219+
Determine the registry from the image URI:
220+
221+
| Image pattern | Registry |
222+
| --- | --- |
223+
| `nvcr.io/nvidia/...` | NGC |
224+
| `vllm/vllm-openai:...`, `lmsysorg/sglang:...`, or no registry prefix | DockerHub |
225+
| `ghcr.io/...` | GitHub Container Registry |
226+
| `docker.io/...` | DockerHub (explicit) |
227+
228+
Then check credentials based on the runtime:
229+
230+
#### enroot/pyxis
231+
232+
```bash
233+
grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null
234+
```
235+
236+
Look for `machine <registry>` lines:
237+
- NGC → `machine nvcr.io`
238+
- DockerHub → `machine auth.docker.io`
239+
- GHCR → `machine ghcr.io`
240+
241+
#### Docker
242+
243+
```bash
244+
cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin).get('auths', {}).keys()))"
245+
```
246+
247+
Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).
248+
249+
### Step 3: If credentials are missing
250+
251+
**Do not submit the job.** Instead:
252+
253+
1. Tell the user which registry and runtime need authentication
254+
2. Show the fix for their runtime:
255+
256+
**enroot/pyxis:**
257+
258+
```bash
259+
mkdir -p ~/.config/enroot
260+
261+
# DockerHub (get token from https://hub.docker.com/settings/security)
262+
cat >> ~/.config/enroot/.credentials << 'EOF'
263+
machine auth.docker.io
264+
login <dockerhub_username>
265+
password <access_token>
266+
EOF
267+
268+
# NGC (get API key from https://org.ngc.nvidia.com/setup/api-keys)
269+
cat >> ~/.config/enroot/.credentials << 'EOF'
270+
machine nvcr.io
271+
login $oauthtoken
272+
password <ngc_api_key>
273+
EOF
274+
```
275+
276+
**Docker:**
277+
278+
```bash
279+
# DockerHub (interactive prompt)
280+
docker login
281+
282+
# NGC (use --password-stdin to avoid exposing secrets in process list)
283+
echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
284+
```
285+
286+
3. **Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images:
287+
288+
| DockerHub image | NGC alternative |
289+
| --- | --- |
290+
| `vllm/vllm-openai:latest` | `nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) |
291+
| `nvcr.io/nvidia/tensorrt-llm/release:<tag>` | Already NGC |
292+
293+
> **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file.
294+
295+
4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists):
296+
297+
```bash
298+
# enroot — test pull (aborts after manifest fetch)
299+
enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10
300+
# Success: shows "Fetching image manifest" + layer info
301+
# Failure: shows "401 Unauthorized" or "404 Not Found"
302+
303+
# docker
304+
docker manifest inspect <image> 2>&1 | head -5
305+
306+
# singularity
307+
singularity pull --dry-run docker://<image> 2>&1 | head -5
308+
```
309+
310+
> **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting.
311+
312+
### Common failure modes
313+
314+
| Symptom | Runtime | Cause | Fix |
315+
| --- | --- | --- | --- |
316+
| `curl: (22) ... error: 401` | enroot | No credentials for registry | Add to `~/.config/enroot/.credentials` |
317+
| `pyxis: failed to import docker image` | enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP |
318+
| `unauthorized: authentication required` | docker | No `docker login` | Run `docker login [registry]` |
319+
| Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes |

.claude/skills/deployment/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,8 @@ All checks must pass before reporting success to the user.
174174

175175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
176176

177+
0. **Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.**
178+
177179
1. **Source remote utilities:**
178180

179181
```bash

.claude/skills/evaluation/SKILL.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Config Generation Progress:
2828
- [ ] Step 5: Confirm tasks (iterative)
2929
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
3030
- [ ] Step 7: Advanced - Interceptors
31+
- [ ] Step 7.5: Check container registry auth (SLURM only)
3132
- [ ] Step 8: Run the evaluation
3233
```
3334

@@ -74,9 +75,9 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
7475
4. Safety & Security (like Garak and Safety Harness)
7576
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
7677

77-
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
78+
Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
7879

79-
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
80+
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). When the CLI's current options differ from this list, prefer the CLI's options.
8081
8182
When you have all the answers, run the script to build the base config:
8283

@@ -181,6 +182,36 @@ If the user needs multi-node evaluation (model >120B, or more throughput), read
181182

182183
- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
183184

185+
**Step 7.5: Check container registry authentication (SLURM only)**
186+
187+
NEL's default deployment images by framework:
188+
189+
| Framework | Default image | Registry |
190+
| --- | --- | --- |
191+
| vLLM | `vllm/vllm-openai:latest` | DockerHub |
192+
| SGLang | `lmsysorg/sglang:latest` | DockerHub |
193+
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
194+
| Evaluation tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
195+
196+
Before submitting, verify the cluster has credentials for the deployment image. See `skills/common/slurm-setup.md` section 6 for the full procedure.
197+
198+
```bash
199+
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
200+
```
201+
202+
**Decision flow (check before submitting):**
203+
1. Check if the cluster has credentials for the default DockerHub image (see command above)
204+
2. If DockerHub credentials exist → use the default image and submit
205+
3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit
206+
4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit:
207+
208+
```yaml
209+
deployment:
210+
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
211+
```
212+
213+
5. **Do not retry more than once** without fixing the auth issue
214+
184215
**Step 8: Run the evaluation**
185216

186217
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
@@ -303,5 +334,6 @@ Config Generation Progress:
303334
- [ ] Step 5: Confirm tasks (iterative)
304335
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
305336
- [ ] Step 7: Advanced - Interceptors
337+
- [ ] Step 7.5: Check container registry auth (SLURM only)
306338
- [ ] Step 8: Run the evaluation
307339
```

.claude/skills/ptq/SKILL.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,24 @@ Check the support table in `examples/llm_ptq/README.md` for verified HF models.
2424
- **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
2525
- **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)
2626

27+
## Step 2.5 — Check for model-specific dependencies
28+
29+
If the model uses `trust_remote_code` (check `config.json` for `auto_map`), inspect its custom Python files for imports not present in the container:
30+
31+
```bash
32+
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
33+
```
34+
35+
**Known dependency patterns:**
36+
37+
| Import found | Packages to install |
38+
| --- | --- |
39+
| `from mamba_ssm` / `from causal_conv1d` | `mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba) |
40+
41+
If extra deps are needed:
42+
- **Launcher (4B)**: set `EXTRA_PIP_DEPS` in the task's `environment` section — `ptq.sh` installs them automatically
43+
- **Manual (4A)**: `unset PIP_CONSTRAINT && pip install <deps>` before running `hf_ptq.py`
44+
2745
## Step 3 — Choose quantization format
2846

2947
**First**, check for a model-specific recipe:
@@ -128,6 +146,7 @@ Validate the exported checkpoint's quantization pattern matches the recipe. Quan
128146

129147
## Common Pitfalls
130148

149+
- **Model-specific dependencies**: Models with `trust_remote_code` may import packages not in the container (e.g., `mamba-ssm` for hybrid Mamba models). See Step 2.5. Use `EXTRA_PIP_DEPS` env var with the launcher, or install manually before running `hf_ptq.py`
131150
- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
132151
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
133152
- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5

.claude/skills/ptq/references/launcher-guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@ uv run launch.py --yaml <config.yaml> hf_local=<cache> --yes # Local Docker
1212

1313
## HF Transformers PTQ Config
1414

15-
The launcher provides `common/hf_ptq/hf_ptq.sh` which wraps `hf_ptq.py`. Configure via environment variables:
15+
The launcher provides `common/hf/ptq.sh` which wraps `hf_ptq.py`. Configure via environment variables:
1616

1717
```yaml
1818
job_name: <Model>_<Format>
1919
pipeline:
2020
task_0:
21-
script: common/hf_ptq/hf_ptq.sh
21+
script: common/hf/ptq.sh
2222
environment:
2323
- HF_MODEL: <HuggingFace model ID, e.g. Qwen/Qwen3-0.6B>
2424
- QFORMAT: <format, e.g. nvfp4, fp8, int4_awq>
@@ -75,7 +75,7 @@ The launcher SSHes to `SLURM_HOST` via `nemo_run.SSHTunnel`. If `identity` is om
7575
## Known Issues
7676

7777
- **UID mapping in Docker**: May cause `getpwuid` failures. Add `USER=user` and `LOGNAME=user` to environment.
78-
- **Megatron-LM submodule**: Only needed for `MegatronLMQuantizeTask` (Megatron models). HF PTQ via `common/hf_ptq/hf_ptq.sh` does not require it.
78+
- **Megatron-LM submodule**: Only needed for `MegatronLMQuantizeTask` (Megatron models). HF PTQ via `common/hf/ptq.sh` does not require it.
7979

8080
## Dry Run
8181

tools/launcher/common/hf/ptq.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,14 @@
2525

2626
set -e
2727

28+
# Install extra pip dependencies if specified (e.g., mamba-ssm for hybrid Mamba models).
29+
if [ -n "$EXTRA_PIP_DEPS" ]; then
30+
echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
31+
unset PIP_CONSTRAINT
32+
read -r -a _deps <<< "$EXTRA_PIP_DEPS"
33+
pip install "${_deps[@]}"
34+
fi
35+
2836
REPO=""
2937
LOCAL_DIR=""
3038
PTQ_ARGS=()

0 commit comments

Comments
 (0)