chore(skills): redact NVIDIA-internal references in vendored skills

shljessie · shljessie · commit 3f96a6ae9ca3 · 2026-04-28T11:35:58.000-07:00
Surfaced by an internal-keyword scan over .agents/skills/. All four
findings replaced with vendor-neutral wording:

- launching-evals/SKILL.md: replace concrete Slurm account names
  (coreai_dlalgo_compeval / coreai_dlalgo_llm) used as the "PPP -&gt; X"
  rename example with placeholders &lt;old_account&gt; / &lt;new_account&gt;.
- launching-evals/SKILL.md: generalise the HF cache path from
  /lustre/fsw/portfolios/coreai/users/&lt;username&gt;/cache/huggingface to
  HF_HOME=&lt;your_hf_cache_path&gt;, with a parenthetical note that lustre-
  style HPC clusters typically organise this under
  /lustre/.../&lt;group&gt;/users/&lt;username&gt;/...
- launching-evals/references/debug-failed-runs.md: rephrase the
  "Drop ':5005' from GitLab container registry URLs" advice (port 5005
  is the standard port for an on-prem GitLab container registry; the
  raw advice only made sense in that context) to a vendor-neutral
  "If the image is on an on-prem GitLab registry, drop the registry
  port suffix (e.g. ':5005') from the URL." Applied at both occurrences.
- common/slurm-setup.md: change the enroot/pyxis "Typical clusters"
  cell from "NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT)" to
  "HPC clusters with container runtime (e.g. DGX Cloud and similar
  Slurm + container setups)" -- removes internal cluster codenames
  (EOS, Selene, GCP-NRT) and the "NVIDIA internal" label.

Caveat: the three launching-evals/* files are vendored verbatim from
NVIDIA-NeMo/Evaluator (per the provenance header injected by
.agents/scripts/sync-upstream-skills.sh). The next sync will overwrite
them. Follow-ups: (1) upstream MR against NVIDIA-NeMo/Evaluator, and/or
(2) add a redaction post-process to sync-upstream-skills.sh so the
scrub survives re-syncs.

Signed-off-by: Seonghee Lee &lt;seongheel@nvidia.com&gt;
Made-with: Cursor
diff --git a/.agents/skills/common/slurm-setup.md b/.agents/skills/common/slurm-setup.md
@@ -215,7 +215,7 @@ which docker 2>/dev/null && echo "RUNTIME=docker"
 
 | Runtime | Typical clusters | SLURM integration |
 | --- | --- | --- |
-| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
+| **enroot/pyxis** | HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups) | `srun --container-image` |
 | **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
 
 ### Step 2: Check credentials for the image's registry
diff --git a/.agents/skills/launching-evals/SKILL.md b/.agents/skills/launching-evals/SKILL.md
@@ -62,9 +62,9 @@ The complete evaluation workflow is divided into the following steps you should
 # Key Facts
 
 - Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
-- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
+- **PPP** = Slurm account / project portfolio code (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `<old_account>` → `<new_account>`).
 - **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
-- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
+- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=<your_hf_cache_path> hf download <model>` (on lustre-style HPC clusters this is typically under `/lustre/.../<group>/users/<username>/cache/huggingface`). Without this, vLLM will fail with `LocalEntryNotFoundError`.
 - **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
 - **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
 - **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).
diff --git a/.agents/skills/launching-evals/references/debug-failed-runs.md b/.agents/skills/launching-evals/references/debug-failed-runs.md
@@ -70,7 +70,7 @@ tail -200 $LOGS/client-*.log
 - **CUDA OOM**: Increase `deployment.tensor_parallel_size` to shard across more GPUs. For multi-node: increase `execution.num_nodes` and set `deployment.pipeline_parallel_size`. As last resort: add `--max-model-len <lower_value>` to `deployment.extra_args`. Do NOT quantize as a first fix — scale compute instead.
 - **Missing model/checkpoint**: `FileNotFoundError` or `RepositoryNotFoundError` or `GatedRepoError: 403` — verify `deployment.checkpoint_path` or `deployment.hf_model_handle`. For gated models, set `HF_TOKEN` via `deployment.env_vars`.
 - **Bad `extra_args`**: `unrecognized arguments` or `unexpected keyword argument` — check flags against deployment engine version. Some flags change between versions (e.g., `--rope-scaling` removed in vLLM > 0.11.0).
-- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. Drop `:5005` from GitLab container registry URLs.
+- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. If the image is on an on-prem GitLab registry, drop the registry port suffix (e.g. `:5005`) from the URL.
 - **GPU driver mismatch**: `CUDA driver version is insufficient` — use an older container image matching the host CUDA driver.
 - **Health check timeout / connection refused**: Server didn't start — check server logs first. Increase `execution.endpoint_readiness_timeout` (seconds). SLURM default: `null` (falls back to walltime).
 - **Server crashed mid-eval**: `Connection reset by peer` — check server logs for OOM. Reduce `parallelism` (concurrent requests). Check SLURM logs for preemption or walltime exceeded.
@@ -80,7 +80,7 @@ tail -200 $LOGS/client-*.log
 - **Config validation**: `MissingMandatoryValue` (unfilled `???`), `ValidationError` (type mismatch), `ScannerError` (invalid YAML) — run `--dry-run` to catch these upfront.
 - **Walltime exceeded**: `CANCELLED DUE TO TIME LIMIT` — NEL submits paired restart jobs that automatically resume when walltime expires, so this is often expected behavior, not a failure. Only increase `execution.walltime` if the evaluation isn't making progress across restarts.
 - **Preemption**: `CANCELLED DUE TO PREEMPTION` — the paired restart job should automatically resume. If it doesn't, use non-preemptible partition, or re-run.
-- **Container not found**: Applies to both `deployment.image` and task-level eval container. Drop `:5005` from GitLab registry URLs.
+- **Container not found**: Applies to both `deployment.image` and task-level eval container. For on-prem GitLab registries, drop the registry port suffix (e.g. `:5005`) from the URL.
 - Troubleshooting docs: list files with WebFetch `https://api.github.com/repos/NVIDIA-NeMo/Evaluator/contents/docs/troubleshooting`, then fetch relevant ones from `https://raw.githubusercontent.com/NVIDIA-NeMo/Evaluator/main/docs/troubleshooting/<file>`
 
 **Fix Slurm invalid account/partition:**