Skip to content

Commit 2e84f3b

Browse files
committed
Polish eval skills
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 4bf8253 commit 2e84f3b

File tree

1 file changed

+189
-0
lines changed

1 file changed

+189
-0
lines changed
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# NEL CI Evaluation Guide
2+
3+
NEL CI is the recommended entry point for running evaluations on NVIDIA JET infrastructure. This guide covers patterns for evaluating quantized checkpoints using both the NEL SLURM executor (direct) and the NEL CI GitLab pipeline.
4+
5+
Reference repo: `gitlab-master.nvidia.com/dl/JoC/competitive_evaluation/nemo-evaluator-launcher-ci`
6+
7+
---
8+
9+
## 1. Two Execution Paths
10+
11+
| Path | When to use | How it works |
12+
|------|-------------|--------------|
13+
| **NEL SLURM executor** | You have SSH access to the cluster, checkpoint is on cluster storage | `nel run --config config.yaml` from your workstation; NEL SSHes to cluster and submits sbatch jobs |
14+
| **NEL CI GitLab pipeline** | You want managed infrastructure, MLflow export, reproducible configs | Trigger via GitLab API or UI; JET orchestrates everything |
15+
16+
### NEL SLURM executor
17+
18+
Best for iterative development and debugging. Run from any machine with SSH access to the cluster:
19+
20+
```bash
21+
export DUMMY_API_KEY=dummy
22+
export HF_TOKEN=<your_token>
23+
24+
nel run --config eval_config.yaml \
25+
-o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # test first
26+
```
27+
28+
### NEL CI trigger
29+
30+
Best for production evaluations with MLflow tracking. See the trigger script pattern in section 4.
31+
32+
---
33+
34+
## 2. Cluster Reference
35+
36+
| Cluster | GPUs/Node | Architecture | Max Walltime | Storage | Notes |
37+
|---------|-----------|-------------|--------------|---------|-------|
38+
| oci-hsg | 4 | GB200 | 4 hours | `/lustre/` | Set `tensor_parallel_size=4` |
39+
| cw | 8 | H100 || `/lustre/` ||
40+
| oci-nrt | 8 | H100 || `/lustre/` | Numerics configs |
41+
| dlcluster | 4 (B100 partition) | B100 | 8 hours | `/home/omniml_data_*` | No `/lustre/`; use local NFS paths |
42+
43+
**Important**: `deployment.tensor_parallel_size` determines how many GPUs are requested. If this exceeds the cluster's GPUs per node, the job fails with a memory allocation error.
44+
45+
---
46+
47+
## 3. Checkpoint Availability
48+
49+
The checkpoint must be on a filesystem accessible from the cluster's **compute nodes** (not just login nodes).
50+
51+
| Cluster type | Accessible storage | NOT accessible |
52+
|-------------|-------------------|----------------|
53+
| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation paths (`/home/scratch.*`), NFS mounts from other clusters |
54+
| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` (not available) |
55+
56+
If the checkpoint is on a workstation, **copy it to cluster storage first**:
57+
58+
```bash
59+
rsync -av /path/to/local/checkpoint \
60+
<cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
61+
```
62+
63+
For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
64+
65+
---
66+
67+
## 4. NEL CI Trigger Pattern
68+
69+
For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands.
70+
71+
```bash
72+
export GITLAB_TOKEN=<your_gitlab_token>
73+
74+
curl -k --request POST \
75+
--header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
76+
--header "Content-Type: application/json" \
77+
--data '{
78+
"ref": "main",
79+
"variables": [
80+
{"key": "NEL_CONFIG_PATH", "value": "configs/AA/minimax_m2_5_lbd_lax.yaml"},
81+
{"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
82+
{"key": "NEL_CLUSTER", "value": "oci-hsg"},
83+
{"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
84+
{"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
85+
{"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"},
86+
{"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"},
87+
{"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"},
88+
{"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"},
89+
{"key": "NEL_VLLM_CACHE", "value": "/lustre/.../cache/vllm"},
90+
{"key": "NEL_CLUSTER_OUTPUT_DIR", "value": "/lustre/.../nv-eval-rundirs"}
91+
]
92+
}' \
93+
"https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"
94+
```
95+
96+
---
97+
98+
## 5. Environment Variables
99+
100+
### SLURM executor format
101+
102+
Env vars in NEL SLURM configs require explicit prefixes:
103+
104+
| Prefix | Meaning | Example |
105+
|--------|---------|---------|
106+
| `host:VAR_NAME` | Read from the host environment where `nel run` is executed | `host:HF_TOKEN` |
107+
| `lit:value` | Literal string value | `lit:dummy` |
108+
109+
```yaml
110+
evaluation:
111+
env_vars:
112+
DUMMY_API_KEY: host:DUMMY_API_KEY
113+
HF_TOKEN: host:HF_TOKEN
114+
```
115+
116+
### JET executor format
117+
118+
JET configs reference JET secrets with `$SECRET_NAME`:
119+
120+
```yaml
121+
execution:
122+
env_vars:
123+
evaluation:
124+
HF_TOKEN: $COMPEVAL_HF_TOKEN
125+
```
126+
127+
### Gated datasets
128+
129+
Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. Set it at the evaluation level or per-task:
130+
131+
```yaml
132+
evaluation:
133+
env_vars:
134+
HF_TOKEN: host:HF_TOKEN # SLURM executor
135+
tasks:
136+
- name: simple_evals.gpqa_diamond
137+
env_vars:
138+
HF_TOKEN: host:HF_TOKEN
139+
```
140+
141+
---
142+
143+
## 6. Serving Framework Notes
144+
145+
### vLLM
146+
147+
- Binds to `0.0.0.0` by default — health checks work out of the box
148+
- For NVFP4: `--quantization modelopt_fp4`
149+
- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)
150+
151+
### SGLang
152+
153+
- **Must include `--host 0.0.0.0`** — SGLang defaults to `127.0.0.1` which blocks health checks from the eval client
154+
- Must include `--port 8000` to match NEL's expected port
155+
- For NVFP4: `--quantization modelopt_fp4`
156+
157+
---
158+
159+
## 7. Common Issues
160+
161+
| Issue | Cause | Fix |
162+
|-------|-------|-----|
163+
| `401 Unauthorized` pulling eval container | NGC credentials not set on cluster | Set up `~/.config/enroot/.credentials` with NGC API key |
164+
| `PermissionError: /hf-cache/...` | HF cache dir not writable by svc-jet | Set `NEL_HF_HOME` to your own `chmod 777` directory |
165+
| Health check stuck at `000` | Server binding to localhost | Add `--host 0.0.0.0` to deployment command (SGLang) |
166+
| `Memory required by task is not available` | TP size exceeds GPUs/node | Set `tensor_parallel_size` to match cluster (4 for oci-hsg, dlcluster B100) |
167+
| TIMEOUT after eval completes | Walltime too short for eval + MLflow export | Set `execution.walltime=04:00:00` |
168+
| Gated dataset auth failure | `HF_TOKEN` not passed to eval container | Add `env_vars.HF_TOKEN` at evaluation or task level |
169+
| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead |
170+
| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
171+
| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
172+
173+
---
174+
175+
## 8. Directory Setup for JET Clusters
176+
177+
Before running evaluations on a JET cluster, create writable directories:
178+
179+
```bash
180+
ssh <cluster-login>
181+
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
182+
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
183+
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
184+
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
185+
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
186+
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
187+
```
188+
189+
`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.

0 commit comments

Comments
 (0)