From 54a7f4d0755ae145cbcca6e456601eb4f56eec5d Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Thu, 21 May 2026 11:13:04 -0700 Subject: [PATCH] docs: genericize NVIDIA-internal references in launching-evals SKILL The "Key Facts" section of `launching-evals/SKILL.md` had two NVIDIA-internal references that bias the skill toward NVIDIA infra and confuse external users: 1. "PPP" terminology with `coreai_dlalgo_*` example account names. "PPP" is internal NVIDIA jargon; the example values are NVIDIA-specific. Renamed the bullet to "SLURM account" (the universally-correct term) and kept "PPP" as a parenthetical alias so internal users still recognize it. Genericized the example values to `` / ``. 2. The HF cache download example hardcoded an NVIDIA-internal lustre path (`/lustre/fsw/portfolios/coreai/users//cache/...`). Replaced with `` placeholder, with a hint that it should be a shared filesystem accessible from compute nodes (`/lustre/...` for multi-node, `~/.cache/huggingface` for single-node). Closes #938 Signed-off-by: Zhiyu Cheng --- .../.claude/skills/launching-evals/SKILL.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md b/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md index 88b367400..48dd96001 100644 --- a/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md +++ b/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md @@ -58,9 +58,9 @@ The complete evaluation workflow is divided into the following steps you should # Key Facts - Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/` -- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`). +- **SLURM account**: the `account` field in `cluster_config.yaml`. When the user asks to change it (some teams call this a "PPP"), update the value (e.g., `` → ``). - **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary. -- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users//cache/huggingface hf download `. Without this, vLLM will fail with `LocalEntryNotFoundError`. +- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME= hf download ` (typically a shared filesystem accessible from compute nodes — e.g., a `/lustre/...` mount on multi-node clusters or `~/.cache/huggingface` for single-node setups). Without this, vLLM will fail with `LocalEntryNotFoundError`. - **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count. - **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need. - **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).