From 54a7f4d0755ae145cbcca6e456601eb4f56eec5d Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Thu, 21 May 2026 11:13:04 -0700
Subject: [PATCH] docs: genericize NVIDIA-internal references in
 launching-evals SKILL

The "Key Facts" section of `launching-evals/SKILL.md` had two
NVIDIA-internal references that bias the skill toward NVIDIA infra and
confuse external users:

1. "PPP" terminology with `coreai_dlalgo_*` example account names. "PPP"
   is internal NVIDIA jargon; the example values are NVIDIA-specific.
   Renamed the bullet to "SLURM account" (the universally-correct term)
   and kept "PPP" as a parenthetical alias so internal users still
   recognize it. Genericized the example values to `<account_name>` /
   `<new_account_name>`.

2. The HF cache download example hardcoded an NVIDIA-internal lustre
   path (`/lustre/fsw/portfolios/coreai/users/<username>/cache/...`).
   Replaced with `<your_hf_cache_dir>` placeholder, with a hint that it
   should be a shared filesystem accessible from compute nodes
   (`/lustre/...` for multi-node, `~/.cache/huggingface` for single-node).

Closes #938

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .../.claude/skills/launching-evals/SKILL.md                   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md b/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md
index 88b367400..48dd96001 100644
--- a/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md
+++ b/packages/nemo-evaluator-launcher/.claude/skills/launching-evals/SKILL.md
@@ -58,9 +58,9 @@ The complete evaluation workflow is divided into the following steps you should
 # Key Facts
 
 - Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
-- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
+- **SLURM account**: the `account` field in `cluster_config.yaml`. When the user asks to change it (some teams call this a "PPP"), update the value (e.g., `<account_name>` → `<new_account_name>`).
 - **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
-- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
+- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=<your_hf_cache_dir> hf download <model>` (typically a shared filesystem accessible from compute nodes — e.g., a `/lustre/...` mount on multi-node clusters or `~/.cache/huggingface` for single-node setups). Without this, vLLM will fail with `LocalEntryNotFoundError`.
 - **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
 - **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
 - **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).