Skip to content

Commit 3f96a6a

Browse files
committed
chore(skills): redact NVIDIA-internal references in vendored skills
Surfaced by an internal-keyword scan over .agents/skills/. All four findings replaced with vendor-neutral wording: - launching-evals/SKILL.md: replace concrete Slurm account names (coreai_dlalgo_compeval / coreai_dlalgo_llm) used as the "PPP -> X" rename example with placeholders <old_account> / <new_account>. - launching-evals/SKILL.md: generalise the HF cache path from /lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface to HF_HOME=<your_hf_cache_path>, with a parenthetical note that lustre- style HPC clusters typically organise this under /lustre/.../<group>/users/<username>/... - launching-evals/references/debug-failed-runs.md: rephrase the "Drop ':5005' from GitLab container registry URLs" advice (port 5005 is the standard port for an on-prem GitLab container registry; the raw advice only made sense in that context) to a vendor-neutral "If the image is on an on-prem GitLab registry, drop the registry port suffix (e.g. ':5005') from the URL." Applied at both occurrences. - common/slurm-setup.md: change the enroot/pyxis "Typical clusters" cell from "NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT)" to "HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups)" -- removes internal cluster codenames (EOS, Selene, GCP-NRT) and the "NVIDIA internal" label. Caveat: the three launching-evals/* files are vendored verbatim from NVIDIA-NeMo/Evaluator (per the provenance header injected by .agents/scripts/sync-upstream-skills.sh). The next sync will overwrite them. Follow-ups: (1) upstream MR against NVIDIA-NeMo/Evaluator, and/or (2) add a redaction post-process to sync-upstream-skills.sh so the scrub survives re-syncs. Signed-off-by: Seonghee Lee <seongheel@nvidia.com> Made-with: Cursor
1 parent 347906f commit 3f96a6a

3 files changed

Lines changed: 5 additions & 5 deletions

File tree

.agents/skills/common/slurm-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ which docker 2>/dev/null && echo "RUNTIME=docker"
215215

216216
| Runtime | Typical clusters | SLURM integration |
217217
| --- | --- | --- |
218-
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
218+
| **enroot/pyxis** | HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups) | `srun --container-image` |
219219
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
220220

221221
### Step 2: Check credentials for the image's registry

.agents/skills/launching-evals/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,9 @@ The complete evaluation workflow is divided into the following steps you should
6262
# Key Facts
6363

6464
- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
65-
- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval``coreai_dlalgo_llm`).
65+
- **PPP** = Slurm account / project portfolio code (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `<old_account>``<new_account>`).
6666
- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
67-
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
67+
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=<your_hf_cache_path> hf download <model>` (on lustre-style HPC clusters this is typically under `/lustre/.../<group>/users/<username>/cache/huggingface`). Without this, vLLM will fail with `LocalEntryNotFoundError`.
6868
- **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
6969
- **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
7070
- **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).

.agents/skills/launching-evals/references/debug-failed-runs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ tail -200 $LOGS/client-*.log
7070
- **CUDA OOM**: Increase `deployment.tensor_parallel_size` to shard across more GPUs. For multi-node: increase `execution.num_nodes` and set `deployment.pipeline_parallel_size`. As last resort: add `--max-model-len <lower_value>` to `deployment.extra_args`. Do NOT quantize as a first fix — scale compute instead.
7171
- **Missing model/checkpoint**: `FileNotFoundError` or `RepositoryNotFoundError` or `GatedRepoError: 403` — verify `deployment.checkpoint_path` or `deployment.hf_model_handle`. For gated models, set `HF_TOKEN` via `deployment.env_vars`.
7272
- **Bad `extra_args`**: `unrecognized arguments` or `unexpected keyword argument` — check flags against deployment engine version. Some flags change between versions (e.g., `--rope-scaling` removed in vLLM > 0.11.0).
73-
- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. Drop `:5005` from GitLab container registry URLs.
73+
- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. If the image is on an on-prem GitLab registry, drop the registry port suffix (e.g. `:5005`) from the URL.
7474
- **GPU driver mismatch**: `CUDA driver version is insufficient` — use an older container image matching the host CUDA driver.
7575
- **Health check timeout / connection refused**: Server didn't start — check server logs first. Increase `execution.endpoint_readiness_timeout` (seconds). SLURM default: `null` (falls back to walltime).
7676
- **Server crashed mid-eval**: `Connection reset by peer` — check server logs for OOM. Reduce `parallelism` (concurrent requests). Check SLURM logs for preemption or walltime exceeded.
@@ -80,7 +80,7 @@ tail -200 $LOGS/client-*.log
8080
- **Config validation**: `MissingMandatoryValue` (unfilled `???`), `ValidationError` (type mismatch), `ScannerError` (invalid YAML) — run `--dry-run` to catch these upfront.
8181
- **Walltime exceeded**: `CANCELLED DUE TO TIME LIMIT` — NEL submits paired restart jobs that automatically resume when walltime expires, so this is often expected behavior, not a failure. Only increase `execution.walltime` if the evaluation isn't making progress across restarts.
8282
- **Preemption**: `CANCELLED DUE TO PREEMPTION` — the paired restart job should automatically resume. If it doesn't, use non-preemptible partition, or re-run.
83-
- **Container not found**: Applies to both `deployment.image` and task-level eval container. Drop `:5005` from GitLab registry URLs.
83+
- **Container not found**: Applies to both `deployment.image` and task-level eval container. For on-prem GitLab registries, drop the registry port suffix (e.g. `:5005`) from the URL.
8484
- Troubleshooting docs: list files with WebFetch `https://api.github.com/repos/NVIDIA-NeMo/Evaluator/contents/docs/troubleshooting`, then fetch relevant ones from `https://raw.githubusercontent.com/NVIDIA-NeMo/Evaluator/main/docs/troubleshooting/<file>`
8585

8686
**Fix Slurm invalid account/partition:**

0 commit comments

Comments
 (0)