Skip to content

Commit 4bf8253

Browse files
committed
Polish eval skills
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 0357cb9 commit 4bf8253

File tree

3 files changed

+38
-0
lines changed

3 files changed

+38
-0
lines changed

.claude/skills/common/remote-execution.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,17 @@ clusters:
2828
default_cluster: my-cluster
2929
```
3030
31+
### Checkpoint and storage availability
32+
33+
Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:
34+
35+
| Cluster type | Compute-node storage | NOT accessible from compute nodes |
36+
|-------------|---------------------|----------------------------------|
37+
| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
38+
| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |
39+
40+
If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.
41+
3142
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
3243

3344
---

.claude/skills/common/slurm-setup.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,20 @@ srun \
5151
"
5252
```
5353

54+
### Container registry credentials (pyxis)
55+
56+
If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:
57+
58+
```bash
59+
cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
60+
# To add NGC credentials:
61+
mkdir -p ~/.config/enroot
62+
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
63+
chmod 600 ~/.config/enroot/.credentials
64+
```
65+
66+
Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.
67+
5468
Submit and capture the job ID:
5569

5670
```bash

.claude/skills/evaluation/SKILL.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,19 @@ After job submission, you can monitor progress using:
286286

287287
---
288288

289+
### NEL CI and Cluster-Specific Notes
290+
291+
For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
292+
- NEL CI GitLab trigger pattern vs NEL SLURM executor
293+
- Cluster-specific GPU counts and storage paths
294+
- Checkpoint availability (compute nodes may not share login node filesystems)
295+
- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
296+
- SGLang must bind `--host 0.0.0.0` for health checks
297+
- Directory setup and `chmod 777` for JET service account access
298+
- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)
299+
300+
---
301+
289302
Direct users with issues to:
290303

291304
- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>

0 commit comments

Comments
 (0)