You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
195
+
196
+
---
197
+
198
+
## 6. Container Registry Authentication
199
+
200
+
**Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake.
201
+
202
+
### Step 1: Detect the container runtime
203
+
204
+
Different clusters use different container runtimes. Detect which is available:
205
+
206
+
```bash
207
+
# On the cluster (or via ssh):
208
+
which enroot 2>/dev/null &&echo"RUNTIME=enroot"
209
+
which singularity 2>/dev/null &&echo"RUNTIME=singularity"
210
+
which apptainer 2>/dev/null &&echo"RUNTIME=apptainer"
3.**Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images:
308
+
309
+
| DockerHub image | NGC alternative |
310
+
| --- | --- |
311
+
|`vllm/vllm-openai:latest`|`nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) |
312
+
|`nvcr.io/nvidia/tensorrt-llm/release:<tag>`| Already NGC |
313
+
314
+
> **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file.
315
+
316
+
4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists):
317
+
318
+
```bash
319
+
# enroot — test pull (aborts after manifest fetch)
320
+
enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10
321
+
# Success: shows "Fetching image manifest" + layer info
322
+
# Failure: shows "401 Unauthorized" or "404 Not Found"
323
+
324
+
# docker
325
+
docker manifest inspect <image>2>&1| head -5
326
+
327
+
# singularity
328
+
singularity pull --dry-run docker://<image>2>&1| head -5
329
+
```
330
+
331
+
> **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting.
332
+
333
+
### Common failure modes
334
+
335
+
| Symptom | Runtime | Cause | Fix |
336
+
| --- | --- | --- | --- |
337
+
|`curl: (22) ... error: 401`| enroot | No credentials for registry | Add to `~/.config/enroot/.credentials`|
338
+
|`pyxis: failed to import docker image`| enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP |
339
+
|`unauthorized: authentication required`| docker | No `docker login`| Run `docker login [registry]`|
340
+
|`FATAL: While making image from oci registry`| singularity | No remote login | Run `singularity remote login`|
341
+
| Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes |
Copy file name to clipboardExpand all lines: .claude/skills/deployment/SKILL.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -174,6 +174,8 @@ All checks must pass before reporting success to the user.
174
174
175
175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
176
176
177
+
0.**Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.**
@@ -76,7 +77,7 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
76
77
77
78
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
78
79
79
-
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help`shows different options than listed above, use the CLI's current options instead.
80
+
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help`first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). Use the CLI's current options, not this list, when they conflict.
80
81
81
82
When you have all the answers, run the script to build the base config:
82
83
@@ -181,6 +182,37 @@ If the user needs multi-node evaluation (model >120B, or more throughput), read
181
182
182
183
- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
**Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config:
203
+
204
+
```yaml
205
+
deployment:
206
+
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
207
+
```
208
+
209
+
**Decision flow:**
210
+
1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`)
211
+
2. If the job fails with image pull auth error (401) → check credentials
212
+
3. If DockerHub credentials can be added → add them and resubmit
213
+
4. If not → override `deployment.image` to the NGC vLLM image and resubmit
214
+
5. **Do not retry more than once** without fixing the auth issue
215
+
184
216
**Step 8: Run the evaluation**
185
217
186
218
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
0 commit comments