eval skill: parameterize external judge/user-sim endpoints via .env (#1591)

cjluo-nv · claude · web-flow · commit db9ea8f17e53 · 2026-06-01T22:35:03.000-07:00
### What does this PR do? Type of change: documentation Several AA tasks call an external **judge / user-simulator / scoring endpoint** whose `model_id` + `url` vary per user/site (HLE, AA-LCR, Tau2 today — and the guidance is written as a general pattern for any future such benchmark). Previously each recipe hardcoded `<...>` placeholders that a user had to hand-edit in every config. This makes them reusable **without committing any internal infrastructure**: - **`recipes/env.example`**: add placeholders — `NS_JUDGE_URL`, `HLE_JUDGE_MODEL_ID`, `LCR_JUDGE_MODEL_ID`, `TAU2_USER_MODEL_ID`, `TAU2_JUDGER_MODEL_ID`, `TAU2_ENDPOINT_URL` — with the **recommended model named** (GPT-4o / Qwen3 235B / gpt-oss-120B) but only **generic hosts** (`https://<your-inference-host>/v1`). No internal hostnames or gateway model-routing strings are committed; real values live in the user's gitignored `.env`. - **`recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md`**: carry `<VAR>` literal placeholders (named after the `.env` keys) that the skill **substitutes as literal values** from the user's `.env`. These are *config, not secrets*, so they are **not** exported — which avoids the `${oc.env:...}` footgun (it silently fails unless the var was exported with `set -a`). Only `api_key` (`INFERENCE_API_KEY`) stays an exported env var read by the harness. - **`SKILL.md`** (Step 5): instructs literal substitution from `.env`, framed as a general pattern for any external-endpoint task. The `/v1`-base (nemo-skills) vs full `/v1/chat/completions` (tau2-bench) URL distinction is documented. ### Usage N/A — documentation / skill-template only. ### Testing `pre-commit run` passes (markdownlint). Verified no internal hostnames / gateway model IDs are present in any committed file. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (documentation) - Did you update Changelog?: N/A (skill docs) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Branched off latest `main` (includes #1583). Touches `lcr.md`, which #1583 also edited (parallelism field) — different lines, rebased cleanly. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Documentation** * Added comprehensive guidance for configuring external judge and user-simulator endpoints in evaluation tasks * Clarified best practices for substituting configuration values from environment configuration files while keeping API keys as environment variables * Updated task documentation for HLE, LCR, and Tau2-Bench with improved instructions for credential and endpoint handling  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -227,6 +227,8 @@ Implications for the agent:
 
 4. Apply, show updated list, ask "Final, or more changes?" Loop until confirmed.
 
+**Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `<VAR>` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness.
+
 **Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
 
 ```yaml
diff --git a/.claude/skills/evaluation/recipes/env.example b/.claude/skills/evaluation/recipes/env.example
@@ -32,6 +32,26 @@ NEMO_EVALUATOR_TRUST_PRE_CMD=1
 # JUDGE_API_KEY=
 # INFERENCE_API_KEY=
 
+# --- Optional: judge / user-simulator endpoints (model_id + URL) ---
+#
+# External judge / user-simulator / scoring endpoints, for any task that needs one
+# (HLE, AA-LCR, Tau2 below — add more for other such benchmarks; auth via
+# INFERENCE_API_KEY above). These are config, not secrets: the values you set here are
+# substituted as literal model_id/url into the config (matching <VAR> placeholders in
+# the recipes) — they do NOT need to be exported; only INFERENCE_API_KEY is.
+# URL note: nemo-skills uses the /v1 base; tau2-bench needs the full /v1/chat/completions.
+
+# HLE judge (ns_hle_aa) — recommended GPT-4o
+# HLE_JUDGE_MODEL_ID=<judge-model-id>
+# AA-LCR judge (ns_aa_lcr) — recommended Qwen3 235B
+# LCR_JUDGE_MODEL_ID=<judge-model-id>
+# NS_JUDGE_URL=https://<your-inference-host>/v1        # shared by both judges above
+
+# Tau2 (tau2_bench_telecom) — user-sim Qwen3 235B, judger gpt-oss-120B
+# TAU2_USER_MODEL_ID=<user-simulator-model-id>
+# TAU2_JUDGER_MODEL_ID=<judger-model-id>
+# TAU2_ENDPOINT_URL=https://<your-inference-host>/v1/chat/completions   # user + judger
+
 # terminal-bench-hard (AWS sandbox)
 # AWS_ACCESS_KEY_ID=
 # AWS_SECRET_ACCESS_KEY=
diff --git a/.claude/skills/evaluation/recipes/tasks/aa/hle.md b/.claude/skills/evaluation/recipes/tasks/aa/hle.md
@@ -6,8 +6,12 @@
 
 ## Params
 
-This is the text-only HLE task with params aligned to Artificial Analysis Index
-v2. HLE is judge-scored and requires judge credentials.
+Text-only HLE, params aligned to Artificial Analysis Index v2; judge-scored.
+Substitute the judge `model_id`/`url` with the literal values you keep in `.env`
+(`HLE_JUDGE_MODEL_ID` rec. **GPT-4o**, `NS_JUDGE_URL`; see `recipes/env.example`) —
+they're config, not secrets, so they don't need exporting. Only `api_key`
+(`INFERENCE_API_KEY`) is exported and read by the harness. Keep the judge fixed
+across comparable runs.
 
 ## YAML Fragment
 
@@ -23,9 +27,9 @@ Use this inside the top-level `evaluation.tasks` list:
       params:
         extra:
           judge:
-            model_id: <hle_aa_judge_model_id>
-            url: <openai_compatible_judge_chat_completions_url>
-            api_key: INFERENCE_API_KEY
+            model_id: <HLE_JUDGE_MODEL_ID>   # from .env; recommended GPT-4o
+            url: <NS_JUDGE_URL>              # from .env (/v1 base)
+            api_key: INFERENCE_API_KEY       # env-var name; exported, read by harness
 ```
 
 ## Score Extraction from mlflow
diff --git a/.claude/skills/evaluation/recipes/tasks/aa/lcr.md b/.claude/skills/evaluation/recipes/tasks/aa/lcr.md
@@ -6,8 +6,10 @@
 
 ## Params
 
-Recommended judge: use Qwen3 235B as an OpenAI-compatible equality-checker
-judge, and keep the same judge across comparable runs.
+Judge-scored (equality checker). Substitute the judge `model_id`/`url` with the
+literal values you keep in `.env` (`LCR_JUDGE_MODEL_ID` rec. **Qwen3 235B**,
+`NS_JUDGE_URL`; see `recipes/env.example`) — config, not secrets, so no export
+needed; only `api_key` (`INFERENCE_API_KEY`) is exported. Keep the judge fixed.
 
 AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
 generation tokens. Set deployment `--max-model-len` to at least `131072`, and
@@ -52,9 +54,9 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
         extra:
           num_repeats: 16
           judge:
-            model_id: <qwen3_235b_judge_model_id>
-            url: <openai_compatible_judge_chat_completions_url>
-            api_key: INFERENCE_API_KEY
+            model_id: <LCR_JUDGE_MODEL_ID>   # from .env; recommended Qwen3 235B
+            url: <NS_JUDGE_URL>              # from .env (/v1 base)
+            api_key: INFERENCE_API_KEY       # env-var name; exported, read by harness
 ```
 
 ## Score Extraction from mlflow
diff --git a/.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md b/.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md
@@ -6,9 +6,13 @@
 
 ## Params
 
-Tau2 Bench uses the evaluated model as the agent and a separate LLM endpoint as
-the user simulator. Configure the user simulator explicitly and keep it fixed
-across comparable runs.
+Tau2 uses the evaluated model as the agent plus a separate user-simulator endpoint;
+keep both fixed across runs. Substitute the user-sim & judger `model_id`/`url` with the
+literal values you keep in `.env` (`TAU2_USER_MODEL_ID` rec. **Qwen3 235B**,
+`TAU2_JUDGER_MODEL_ID` rec. **gpt-oss-120B**, `TAU2_ENDPOINT_URL`; see
+`recipes/env.example`) — config, not secrets, so no export needed; only `api_key`
+(`INFERENCE_API_KEY`) is exported. tau2-bench needs the full `/v1/chat/completions`
+URL (nemo-skills judges use the `/v1` base).
 
 For parallelism, we have to throttle to a smaller cap due to the test may be throttled by
 user and judger API rate limit. If frequent 429 errors are hit, the reported scores could be much lower.
@@ -40,13 +44,13 @@ Use this inside the top-level `evaluation.tasks` list:
           skip_failed_samples: true
           n_samples: 8
           user:
-            model_id: <user_simulator_qwen_235b_model_id>
-            url: <openai_compatible_user_simulator_chat_completions_url>
-            api_key: INFERENCE_API_KEY
+            model_id: <TAU2_USER_MODEL_ID>     # from .env; recommended Qwen3 235B
+            url: <TAU2_ENDPOINT_URL>           # from .env (full /v1/chat/completions)
+            api_key: INFERENCE_API_KEY         # env-var name; exported, read by harness
           judger:
-            model_id: <judger_gpt_oss_120b_model_id>
-            url: <openai_compatible_judger_chat_completions_url>
-            api_key: INFERENCE_API_KEY
+            model_id: <TAU2_JUDGER_MODEL_ID>   # from .env; recommended gpt-oss-120B
+            url: <TAU2_ENDPOINT_URL>           # from .env (full /v1/chat/completions)
+            api_key: INFERENCE_API_KEY         # env-var name; exported, read by harness
 ```
 
 ## Score Extraction