Skip to content

Commit 0834e1b

Browse files
committed
Add remote authentication check
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent bea3516 commit 0834e1b

File tree

3 files changed

+182
-1
lines changed

3 files changed

+182
-1
lines changed

.claude/skills/common/slurm-setup.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,150 @@ chmod -R g+rwX /path/to/.hf_cache/
192192
```
193193

194194
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
195+
196+
---
197+
198+
## 6. Container Registry Authentication
199+
200+
**Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake.
201+
202+
### Step 1: Detect the container runtime
203+
204+
Different clusters use different container runtimes. Detect which is available:
205+
206+
```bash
207+
# On the cluster (or via ssh):
208+
which enroot 2>/dev/null && echo "RUNTIME=enroot"
209+
which singularity 2>/dev/null && echo "RUNTIME=singularity"
210+
which apptainer 2>/dev/null && echo "RUNTIME=apptainer"
211+
which docker 2>/dev/null && echo "RUNTIME=docker"
212+
```
213+
214+
| Runtime | Typical clusters | SLURM integration |
215+
| --- | --- | --- |
216+
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
217+
| **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script |
218+
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
219+
220+
### Step 2: Check credentials for the image's registry
221+
222+
Determine the registry from the image URI:
223+
224+
| Image pattern | Registry |
225+
| --- | --- |
226+
| `nvcr.io/nvidia/...` | NGC |
227+
| `vllm/vllm-openai:...` or no registry prefix | DockerHub |
228+
| `ghcr.io/...` | GitHub Container Registry |
229+
| `docker.io/...` | DockerHub (explicit) |
230+
231+
Then check credentials based on the runtime:
232+
233+
#### enroot/pyxis
234+
235+
```bash
236+
cat ~/.config/enroot/.credentials 2>/dev/null
237+
```
238+
239+
Look for `machine <registry>` lines:
240+
- NGC → `machine nvcr.io`
241+
- DockerHub → `machine auth.docker.io`
242+
- GHCR → `machine ghcr.io`
243+
244+
#### Docker
245+
246+
```bash
247+
cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin).get('auths',{}), indent=2))"
248+
```
249+
250+
Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).
251+
252+
#### Singularity/Apptainer
253+
254+
```bash
255+
cat ~/.singularity/docker-config.json 2>/dev/null || cat ~/.apptainer/docker-config.json 2>/dev/null
256+
```
257+
258+
Same format as Docker's `config.json` — look for registry keys in `auths`.
259+
260+
### Step 3: If credentials are missing
261+
262+
**Do not submit the job.** Instead:
263+
264+
1. Tell the user which registry and runtime need authentication
265+
2. Show the fix for their runtime:
266+
267+
**enroot/pyxis:**
268+
269+
```bash
270+
mkdir -p ~/.config/enroot
271+
272+
# DockerHub (get token from https://hub.docker.com/settings/security)
273+
cat >> ~/.config/enroot/.credentials << 'EOF'
274+
machine auth.docker.io
275+
login <dockerhub_username>
276+
password <access_token>
277+
EOF
278+
279+
# NGC (get API key from https://org.ngc.nvidia.com/setup/api-keys)
280+
cat >> ~/.config/enroot/.credentials << 'EOF'
281+
machine nvcr.io
282+
login $oauthtoken
283+
password <ngc_api_key>
284+
EOF
285+
```
286+
287+
**Docker:**
288+
289+
```bash
290+
# DockerHub
291+
docker login
292+
293+
# NGC
294+
docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>
295+
```
296+
297+
**Singularity/Apptainer:**
298+
299+
```bash
300+
# DockerHub
301+
singularity remote login --username <user> docker://docker.io
302+
303+
# NGC
304+
singularity remote login --username '$oauthtoken' --password <ngc_api_key> docker://nvcr.io
305+
```
306+
307+
3. **Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images:
308+
309+
| DockerHub image | NGC alternative |
310+
| --- | --- |
311+
| `vllm/vllm-openai:latest` | `nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) |
312+
| `nvcr.io/nvidia/tensorrt-llm/release:<tag>` | Already NGC |
313+
314+
> **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file.
315+
316+
4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists):
317+
318+
```bash
319+
# enroot — test pull (aborts after manifest fetch)
320+
enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10
321+
# Success: shows "Fetching image manifest" + layer info
322+
# Failure: shows "401 Unauthorized" or "404 Not Found"
323+
324+
# docker
325+
docker manifest inspect <image> 2>&1 | head -5
326+
327+
# singularity
328+
singularity pull --dry-run docker://<image> 2>&1 | head -5
329+
```
330+
331+
> **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting.
332+
333+
### Common failure modes
334+
335+
| Symptom | Runtime | Cause | Fix |
336+
| --- | --- | --- | --- |
337+
| `curl: (22) ... error: 401` | enroot | No credentials for registry | Add to `~/.config/enroot/.credentials` |
338+
| `pyxis: failed to import docker image` | enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP |
339+
| `unauthorized: authentication required` | docker | No `docker login` | Run `docker login [registry]` |
340+
| `FATAL: While making image from oci registry` | singularity | No remote login | Run `singularity remote login` |
341+
| Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes |

.claude/skills/deployment/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,8 @@ All checks must pass before reporting success to the user.
174174

175175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
176176

177+
0. **Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.**
178+
177179
1. **Source remote utilities:**
178180

179181
```bash

.claude/skills/evaluation/SKILL.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Config Generation Progress:
2828
- [ ] Step 5: Confirm tasks (iterative)
2929
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
3030
- [ ] Step 7: Advanced - Interceptors
31+
- [ ] Step 7.5: Check container registry auth (SLURM only)
3132
- [ ] Step 8: Run the evaluation
3233
```
3334

@@ -76,7 +77,7 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
7677

7778
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
7879

79-
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
80+
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). Use the CLI's current options, not this list, when they conflict.
8081
8182
When you have all the answers, run the script to build the base config:
8283

@@ -181,6 +182,37 @@ If the user needs multi-node evaluation (model >120B, or more throughput), read
181182

182183
- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
183184

185+
**Step 7.5: Check container registry authentication (SLURM only)**
186+
187+
NEL's default deployment images by framework:
188+
189+
| Framework | Default image | Registry |
190+
| --- | --- | --- |
191+
| vLLM | `vllm/vllm-openai:latest` | DockerHub |
192+
| SGLang | `lmsysorg/sglang:latest` | DockerHub |
193+
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
194+
| Evaluation tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
195+
196+
Before submitting, verify the cluster has credentials for the deployment image. See `skills/common/slurm-setup.md` section 6 for the full procedure.
197+
198+
```bash
199+
ssh <host> "cat ~/.config/enroot/.credentials 2>/dev/null"
200+
```
201+
202+
**Default behavior: use DockerHub image first.** If the job fails with a `401` or image pull error, fall back to the NGC alternative by adding `deployment.image` to the config:
203+
204+
```yaml
205+
deployment:
206+
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
207+
```
208+
209+
**Decision flow:**
210+
1. Submit with the default DockerHub image (`vllm/vllm-openai:latest`)
211+
2. If the job fails with image pull auth error (401) → check credentials
212+
3. If DockerHub credentials can be added → add them and resubmit
213+
4. If not → override `deployment.image` to the NGC vLLM image and resubmit
214+
5. **Do not retry more than once** without fixing the auth issue
215+
184216
**Step 8: Run the evaluation**
185217

186218
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.

0 commit comments

Comments
 (0)