Commit db9ea8f
eval skill: parameterize external judge/user-sim endpoints via .env (#1591)
### What does this PR do?
Type of change: documentation
Several AA tasks call an external **judge / user-simulator / scoring
endpoint** whose `model_id` + `url` vary per user/site (HLE, AA-LCR,
Tau2 today — and the guidance is written as a general pattern for any
future such benchmark). Previously each recipe hardcoded `<...>`
placeholders that a user had to hand-edit in every config. This makes
them reusable **without committing any internal infrastructure**:
- **`recipes/env.example`**: add placeholders — `NS_JUDGE_URL`,
`HLE_JUDGE_MODEL_ID`, `LCR_JUDGE_MODEL_ID`, `TAU2_USER_MODEL_ID`,
`TAU2_JUDGER_MODEL_ID`, `TAU2_ENDPOINT_URL` — with the **recommended
model named** (GPT-4o / Qwen3 235B / gpt-oss-120B) but only **generic
hosts** (`https://<your-inference-host>/v1`). No internal hostnames or
gateway model-routing strings are committed; real values live in the
user's gitignored `.env`.
- **`recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md`**: carry `<VAR>`
literal placeholders (named after the `.env` keys) that the skill
**substitutes as literal values** from the user's `.env`. These are
*config, not secrets*, so they are **not** exported — which avoids the
`${oc.env:...}` footgun (it silently fails unless the var was exported
with `set -a`). Only `api_key` (`INFERENCE_API_KEY`) stays an exported
env var read by the harness.
- **`SKILL.md`** (Step 5): instructs literal substitution from `.env`,
framed as a general pattern for any external-endpoint task.
The `/v1`-base (nemo-skills) vs full `/v1/chat/completions` (tau2-bench)
URL distinction is documented.
### Usage
N/A — documentation / skill-template only.
### Testing
`pre-commit run` passes (markdownlint). Verified no internal hostnames /
gateway model IDs are present in any committed file.
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (documentation)
- Did you update Changelog?: N/A (skill docs)
- Did you get Claude approval on this PR?: ❌ (pending)
### Additional Information
Branched off latest `main` (includes #1583). Touches `lcr.md`, which
#1583 also edited (parallelism field) — different lines, rebased
cleanly.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Added comprehensive guidance for configuring external judge and
user-simulator endpoints in evaluation tasks
* Clarified best practices for substituting configuration values from
environment configuration files while keeping API keys as environment
variables
* Updated task documentation for HLE, LCR, and Tau2-Bench with improved
instructions for credential and endpoint handling
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 905259f commit db9ea8f
5 files changed
Lines changed: 51 additions & 19 deletions
File tree
- .claude/skills/evaluation
- recipes
- tasks/aa
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
227 | 227 | | |
228 | 228 | | |
229 | 229 | | |
| 230 | + | |
| 231 | + | |
230 | 232 | | |
231 | 233 | | |
232 | 234 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
35 | 55 | | |
36 | 56 | | |
37 | 57 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
10 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
11 | 15 | | |
12 | 16 | | |
13 | 17 | | |
| |||
23 | 27 | | |
24 | 28 | | |
25 | 29 | | |
26 | | - | |
27 | | - | |
28 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
29 | 33 | | |
30 | 34 | | |
31 | 35 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
10 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
| |||
52 | 54 | | |
53 | 55 | | |
54 | 56 | | |
55 | | - | |
56 | | - | |
57 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
| |||
Lines changed: 13 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
10 | | - | |
11 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
12 | 16 | | |
13 | 17 | | |
14 | 18 | | |
| |||
40 | 44 | | |
41 | 45 | | |
42 | 46 | | |
43 | | - | |
44 | | - | |
45 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
46 | 50 | | |
47 | | - | |
48 | | - | |
49 | | - | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
50 | 54 | | |
51 | 55 | | |
52 | 56 | | |
| |||
0 commit comments