Skip to content

Commit db9ea8f

Browse files
cjluo-nvclaude
andauthored
eval skill: parameterize external judge/user-sim endpoints via .env (#1591)
### What does this PR do? Type of change: documentation Several AA tasks call an external **judge / user-simulator / scoring endpoint** whose `model_id` + `url` vary per user/site (HLE, AA-LCR, Tau2 today — and the guidance is written as a general pattern for any future such benchmark). Previously each recipe hardcoded `<...>` placeholders that a user had to hand-edit in every config. This makes them reusable **without committing any internal infrastructure**: - **`recipes/env.example`**: add placeholders — `NS_JUDGE_URL`, `HLE_JUDGE_MODEL_ID`, `LCR_JUDGE_MODEL_ID`, `TAU2_USER_MODEL_ID`, `TAU2_JUDGER_MODEL_ID`, `TAU2_ENDPOINT_URL` — with the **recommended model named** (GPT-4o / Qwen3 235B / gpt-oss-120B) but only **generic hosts** (`https://<your-inference-host>/v1`). No internal hostnames or gateway model-routing strings are committed; real values live in the user's gitignored `.env`. - **`recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md`**: carry `<VAR>` literal placeholders (named after the `.env` keys) that the skill **substitutes as literal values** from the user's `.env`. These are *config, not secrets*, so they are **not** exported — which avoids the `${oc.env:...}` footgun (it silently fails unless the var was exported with `set -a`). Only `api_key` (`INFERENCE_API_KEY`) stays an exported env var read by the harness. - **`SKILL.md`** (Step 5): instructs literal substitution from `.env`, framed as a general pattern for any external-endpoint task. The `/v1`-base (nemo-skills) vs full `/v1/chat/completions` (tau2-bench) URL distinction is documented. ### Usage N/A — documentation / skill-template only. ### Testing `pre-commit run` passes (markdownlint). Verified no internal hostnames / gateway model IDs are present in any committed file. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (documentation) - Did you update Changelog?: N/A (skill docs) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Branched off latest `main` (includes #1583). Touches `lcr.md`, which #1583 also edited (parallelism field) — different lines, rebased cleanly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added comprehensive guidance for configuring external judge and user-simulator endpoints in evaluation tasks * Clarified best practices for substituting configuration values from environment configuration files while keeping API keys as environment variables * Updated task documentation for HLE, LCR, and Tau2-Bench with improved instructions for credential and endpoint handling <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 905259f commit db9ea8f

5 files changed

Lines changed: 51 additions & 19 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,8 @@ Implications for the agent:
227227

228228
4. Apply, show updated list, ask "Final, or more changes?" Loop until confirmed.
229229

230+
**Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `<VAR>` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness.
231+
230232
**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
231233

232234
```yaml

.claude/skills/evaluation/recipes/env.example

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,26 @@ NEMO_EVALUATOR_TRUST_PRE_CMD=1
3232
# JUDGE_API_KEY=
3333
# INFERENCE_API_KEY=
3434

35+
# --- Optional: judge / user-simulator endpoints (model_id + URL) ---
36+
#
37+
# External judge / user-simulator / scoring endpoints, for any task that needs one
38+
# (HLE, AA-LCR, Tau2 below — add more for other such benchmarks; auth via
39+
# INFERENCE_API_KEY above). These are config, not secrets: the values you set here are
40+
# substituted as literal model_id/url into the config (matching <VAR> placeholders in
41+
# the recipes) — they do NOT need to be exported; only INFERENCE_API_KEY is.
42+
# URL note: nemo-skills uses the /v1 base; tau2-bench needs the full /v1/chat/completions.
43+
44+
# HLE judge (ns_hle_aa) — recommended GPT-4o
45+
# HLE_JUDGE_MODEL_ID=<judge-model-id>
46+
# AA-LCR judge (ns_aa_lcr) — recommended Qwen3 235B
47+
# LCR_JUDGE_MODEL_ID=<judge-model-id>
48+
# NS_JUDGE_URL=https://<your-inference-host>/v1 # shared by both judges above
49+
50+
# Tau2 (tau2_bench_telecom) — user-sim Qwen3 235B, judger gpt-oss-120B
51+
# TAU2_USER_MODEL_ID=<user-simulator-model-id>
52+
# TAU2_JUDGER_MODEL_ID=<judger-model-id>
53+
# TAU2_ENDPOINT_URL=https://<your-inference-host>/v1/chat/completions # user + judger
54+
3555
# terminal-bench-hard (AWS sandbox)
3656
# AWS_ACCESS_KEY_ID=
3757
# AWS_SECRET_ACCESS_KEY=

.claude/skills/evaluation/recipes/tasks/aa/hle.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@
66

77
## Params
88

9-
This is the text-only HLE task with params aligned to Artificial Analysis Index
10-
v2. HLE is judge-scored and requires judge credentials.
9+
Text-only HLE, params aligned to Artificial Analysis Index v2; judge-scored.
10+
Substitute the judge `model_id`/`url` with the literal values you keep in `.env`
11+
(`HLE_JUDGE_MODEL_ID` rec. **GPT-4o**, `NS_JUDGE_URL`; see `recipes/env.example`) —
12+
they're config, not secrets, so they don't need exporting. Only `api_key`
13+
(`INFERENCE_API_KEY`) is exported and read by the harness. Keep the judge fixed
14+
across comparable runs.
1115

1216
## YAML Fragment
1317

@@ -23,9 +27,9 @@ Use this inside the top-level `evaluation.tasks` list:
2327
params:
2428
extra:
2529
judge:
26-
model_id: <hle_aa_judge_model_id>
27-
url: <openai_compatible_judge_chat_completions_url>
28-
api_key: INFERENCE_API_KEY
30+
model_id: <HLE_JUDGE_MODEL_ID> # from .env; recommended GPT-4o
31+
url: <NS_JUDGE_URL> # from .env (/v1 base)
32+
api_key: INFERENCE_API_KEY # env-var name; exported, read by harness
2933
```
3034
3135
## Score Extraction from mlflow

.claude/skills/evaluation/recipes/tasks/aa/lcr.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@
66

77
## Params
88

9-
Recommended judge: use Qwen3 235B as an OpenAI-compatible equality-checker
10-
judge, and keep the same judge across comparable runs.
9+
Judge-scored (equality checker). Substitute the judge `model_id`/`url` with the
10+
literal values you keep in `.env` (`LCR_JUDGE_MODEL_ID` rec. **Qwen3 235B**,
11+
`NS_JUDGE_URL`; see `recipes/env.example`) — config, not secrets, so no export
12+
needed; only `api_key` (`INFERENCE_API_KEY`) is exported. Keep the judge fixed.
1113

1214
AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
1315
generation tokens. Set deployment `--max-model-len` to at least `131072`, and
@@ -52,9 +54,9 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
5254
extra:
5355
num_repeats: 16
5456
judge:
55-
model_id: <qwen3_235b_judge_model_id>
56-
url: <openai_compatible_judge_chat_completions_url>
57-
api_key: INFERENCE_API_KEY
57+
model_id: <LCR_JUDGE_MODEL_ID> # from .env; recommended Qwen3 235B
58+
url: <NS_JUDGE_URL> # from .env (/v1 base)
59+
api_key: INFERENCE_API_KEY # env-var name; exported, read by harness
5860
```
5961
6062
## Score Extraction from mlflow

.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,13 @@
66

77
## Params
88

9-
Tau2 Bench uses the evaluated model as the agent and a separate LLM endpoint as
10-
the user simulator. Configure the user simulator explicitly and keep it fixed
11-
across comparable runs.
9+
Tau2 uses the evaluated model as the agent plus a separate user-simulator endpoint;
10+
keep both fixed across runs. Substitute the user-sim & judger `model_id`/`url` with the
11+
literal values you keep in `.env` (`TAU2_USER_MODEL_ID` rec. **Qwen3 235B**,
12+
`TAU2_JUDGER_MODEL_ID` rec. **gpt-oss-120B**, `TAU2_ENDPOINT_URL`; see
13+
`recipes/env.example`) — config, not secrets, so no export needed; only `api_key`
14+
(`INFERENCE_API_KEY`) is exported. tau2-bench needs the full `/v1/chat/completions`
15+
URL (nemo-skills judges use the `/v1` base).
1216

1317
For parallelism, we have to throttle to a smaller cap due to the test may be throttled by
1418
user and judger API rate limit. If frequent 429 errors are hit, the reported scores could be much lower.
@@ -40,13 +44,13 @@ Use this inside the top-level `evaluation.tasks` list:
4044
skip_failed_samples: true
4145
n_samples: 8
4246
user:
43-
model_id: <user_simulator_qwen_235b_model_id>
44-
url: <openai_compatible_user_simulator_chat_completions_url>
45-
api_key: INFERENCE_API_KEY
47+
model_id: <TAU2_USER_MODEL_ID> # from .env; recommended Qwen3 235B
48+
url: <TAU2_ENDPOINT_URL> # from .env (full /v1/chat/completions)
49+
api_key: INFERENCE_API_KEY # env-var name; exported, read by harness
4650
judger:
47-
model_id: <judger_gpt_oss_120b_model_id>
48-
url: <openai_compatible_judger_chat_completions_url>
49-
api_key: INFERENCE_API_KEY
51+
model_id: <TAU2_JUDGER_MODEL_ID> # from .env; recommended gpt-oss-120B
52+
url: <TAU2_ENDPOINT_URL> # from .env (full /v1/chat/completions)
53+
api_key: INFERENCE_API_KEY # env-var name; exported, read by harness
5054
```
5155
5256
## Score Extraction

0 commit comments

Comments
 (0)