feat: Add agent skills for NeMo Gym by lbliii · Pull Request #1062 · NVIDIA-NeMo/Gym

lbliii · 2026-04-13T23:53:44Z

Internal validation only. This PR exercises NeMo Gym end-to-end on a non-trivial workload (paired-rollout grading of agent skills) and ships the harness so other teams can reproduce, validate, and reuse the components. The substantive material — methodology, results, gotchas, upstream pitches — lives in notes/skill-eval/, not in docs/.

Why this exists

Two questions, two days:

Can we measure whether a .claude/skills/*/SKILL.md actually helps an agent do its job? Answer: yes, with a paired-rollout 2×2 over (skill-in-prompt × references-on-disk) — and we found one skill that competes with its own references and three skills whose apparent value was mostly the references' value, not the skill's.
Is NeMo Gym the right substrate for that kind of evaluation? Answer: yes for skill eval today (rollout collection + service orchestration earned their keep); the methodology ports outside NeMo Gym for generic doc eval. Concrete sharp edges and upstream candidates listed below.

For internal validators: clone, bring it up (5 commands), run a 480-rollout scoreboard (~30 min), and read the results doc. You'll have a working harness and a decision-shaped artifact at the end.

What's here

Three NeMo Gym servers

flowchart LR
  J[input JSONL] --> A[skill_eval_agent</br>/run]
  A -->|/seed_session</br>/run_bash</br>/read_file</br>/close| W[skill_workspace</br>per-session sandbox]
  A -->|/v1/responses| M[policy_model]
  A -->|/verify| JG[skill_judge</br>LLM-as-judge]
  JG -->|/v1/responses| M
  JG --> A
  A --> R[output JSONL</br>reward + per-assertion grades]

resources_servers/skill_workspace/ — per-session tmpdir with run_bash + read_file tools. Seeds the skill's scripts/ and references/ based on per-request flags. Sandbox env strips host PATH; workspace-local python → python3 symlink covers macOS rollouts. SKILL.md is never seeded (see contamination note in results.md).
resources_servers/skill_judge/ — LLM-as-judge. One call grades (response, tool_calls, assertions[]) → grades[] with per-assertion binary verdicts plus evidence strings. Reward = fraction satisfied.
responses_api_agents/skill_eval_agent/ — orchestrator. Seeds workspace → model↔tool loop → forwards transcript to judge → closes workspace (in finally).

The 2×2 control structure

Two independent flags in verifier_metadata produce four cells per scenario:

	`with_references=False`	`with_references=True`
`with_skill=False`	`blind` — model priors only	`docs-only` — realistic reader without the skill pack
`with_skill=True`	`skill-only` — SKILL.md in prompt, nothing on disk	`skill+docs` — realistic reader with the skill pack

The diff tool reports four named marginal effects on three axes (Δreward, Δtools, Δtokens):

skill | refs=T = skill+docs − docs-only — realistic-deployment value of the skill overlay (the number that matters for shipping).
skill | refs=F = skill-only − blind — skill as a standalone doc.
refs | skill=T = skill+docs − skill-only — do refs still matter when the skill is prompted?
refs | skill=F = docs-only − blind — marginal value of references alone.

Five-field provenance

Every record carries skill_md_sha, evals_sha, fixtures_sha, judge_prompt_sha, harness_version in verifier_metadata. The diff tool tags each delta-of-delta with what changed: md, evals, md+evals, harness, or same-all (no input changed → it's noise/judge drift, not a skill effect).

Tooling

scripts/build_skill_eval_jsonl.py — emits 4-cell 2×2 with full provenance.
scripts/diff_skill_scoreboards.py — auto-detects 2×2 vs legacy; renders multi-axis deltas with provenance attribution.

Validate locally (5 commands, ~30 min)

# 1. Setup
uv venv && uv sync --extra dev --group docs

# 2. Configure model endpoint
cat > env.yaml <<'EOF'
policy_base_url: <your /v1 endpoint>
policy_api_key: <your key>
policy_model_name: <your model>
EOF

# 3. Bring up the four servers
ng_run "+config_paths=[
  resources_servers/skill_workspace/configs/skill_workspace.yaml,
  resources_servers/skill_judge/configs/skill_judge.yaml,
  responses_api_models/openai_model/configs/openai_model.yaml,
  responses_api_agents/skill_eval_agent/configs/skill_eval_agent.yaml
]" +skip_venv_if_present=true
ng_status   # expect "4 healthy"

# 4. Build input JSONL (96 records: 4 cells × 8 skills × 3 scenarios)
python scripts/build_skill_eval_jsonl.py \
  --skills-dir .claude/skills \
  --output responses_api_agents/skill_eval_agent/data/example.jsonl

# 5. Collect rollouts + render scoreboard
ng_collect_rollouts \
  +agent_name=skill_eval_agent \
  +input_jsonl_fpath=responses_api_agents/skill_eval_agent/data/example.jsonl \
  +output_jsonl_fpath=results/v8.jsonl \
  +num_repeats=5 +num_samples_in_parallel=6 \
  "+responses_create_params={max_output_tokens: 8192}"
python scripts/diff_skill_scoreboards.py results/v8.jsonl

Pace observed: ~30 min wall-clock for 480 rollouts at 6-way parallel on the NVIDIA inference API. Lighter validation: pass --cells=blind,skill+docs to the build script (only 2 cells × 5 = 240 rollouts, ~15 min).

Headline results (v8, n=5, 480 rollouts)

Realistic-deployment column only. Bold = effect outside ~±0.10 noise floor. Full table + per-skill verdicts in notes/skill-eval/results.md.

skill	`skill \| refs=T` Δreward	Δtools	verdict
`gym-run`	+0.380	−1.67	load-bearing keeper
`add-benchmark`	+0.162	−1.13	keep
`gym-debug`	+0.133	−4.13	keep; biggest efficiency teach
`gym-data`	+0.093	−3.00	now measurable after scenario rewrite
`gym-scaffold-agent`	+0.053	−1.00	content gap (non-RL agent patterns missing)
`gym-review`	+0.048	−4.73	redundant with refs on accuracy; teaches efficiency
`gym-profile`	+0.040	−2.00	flipped from −0.107 after patterns-table rewrite
`gym-config`	+0.027	−1.60	2/3 scenarios ceiling-clipped

Three predicted prescriptions, all landed with clean per-skill provenance attribution:

gym-profile patterns-table rewrite → skill | refs=T flipped −0.107 → +0.040 (Δ +0.147)
gym-data adversarial scenarios → docs-only 0.97 → 0.88, opening measurable headroom
gym-review SKILL.md shrink (110 → 53 lines) → realistic preserved, standalone collapsed (context-dependent win)

One universal pattern: every skill reduces tool calls in the realistic cell (−1.00 to −4.73 per rollout). Reward-only scoring is lossy; always read both axes.

NeMo Gym sharp edges (8 items, ~1 hour each the first time)

These are concrete, reproducible, and cheap to fix. Most useful PR output for the framework team:

ng_run runs python app.py not python -m. Relative imports break — must use absolute imports from project root.
Trailing slashes in policy_base_url produce double-slash 404s on some providers.
Some /v1/responses providers return object: "chat.completion"; NeMoGymResponse validation fails on the literal. Normalize in the model server.
OpenAI's FunctionToolParam requires explicit strict: False or validation fails at the model server.
Host NeMo Gym .venv/bin leaks into subprocess PATH via inherited env; rollouts can see ng_* binaries, Ray sockets, HF/MLflow credentials.
Sandbox PATH strip breaks python on macOS — needed a workspace-local python → python3 symlink.
Rollout JSONL only persists the final turn's token usage. Multi-turn loops lose intermediate tokens.
Workspace cleanup must be inside a finally block — anything else leaks tmpdirs.

Methodology lessons that likely generalize past skill eval:

Every artifact seeded into a workspace contaminates the control arm. SKILL.md peek 100% pre-fix; references peek 100% pre-second-fix.
Content-hash provenance is necessary but not sufficient. Same-sha runs at temperature=0 still drift up to 0.20 per cell from judge non-determinism.

What's worth absorbing upstream

Two with existing demand. Two parked.

Worth a PR:

Assertion-grade LLM-as-judge as a shared base class. Three existing in-tree implementations (ours, code_gen, equivalence_llm_judge). Common shape; small base class saves the next team re-inventing it.
Per-turn token aggregation in the agent base class. Today only the final response's usage reaches the output JSONL; every multi-turn agent eventually wants the sum.

Parked (no second customer yet):

Paired-arm pattern as a framework primitive (a pairing_key + post-hoc join CLI).
Framework-level provenance stamping (a narrow ng_stamp_provenance helper that hashes Hydra-config-referenced files).

Test plan

221 core unit tests pass at 96.41% coverage (well above the 96.0% threshold).
Per-server tests pass: skill_workspace (28), skill_judge (22), skill_eval_agent (9).
Builder + script tests pass: test_build_skill_eval_jsonl.py (15), test_eval_skills.py (33).
CI lint clean: ruff format --check, ruff check, README env-list row, secrets-detector hook.
CI docs build clean: sphinx-build --fail-on-warning succeeds with the skill-eval material moved out of the Sphinx tree.
End-to-end: ran 480-rollout v8 scoreboard against NVIDIA inference API; provenance tags match expected md / evals+fx / same-all per skill edited.

🤖 Generated with Claude Code

…, config, data, scaffolding, and execution Add comprehensive agent skills covering the full NeMo Gym workflow: - gym-review: Anti-pattern detection with deterministic Python scanner - gym-debug: Runtime failure diagnosis and log analysis - gym-profile: Rollout result analysis and reward profiling - gym-config: Hydra YAML configuration composition and validation - gym-data: Dataset preparation, validation, and HuggingFace registry - gym-scaffold-agent: Custom agent server creation patterns - gym-run: End-to-end benchmark execution workflow Each skill includes SKILL.md with step-by-step instructions, self-contained references, eval fixtures, and evals.json for quality measurement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Ships a three-server NeMo Gym environment that grades agent-skill content (.claude/skills/*/SKILL.md) via paired with-skill vs without-skill rollouts: - resources_servers/skill_workspace: per-session tmpdir with run_bash/read_file tools. Sandbox env strips host PATH to prevent rollouts seeing ng_* binaries, provides a python -> python3 shim for prompts assuming `python`. - resources_servers/skill_judge: LLM-as-judge returning per-assertion binary grades with evidence; reward = fraction satisfied. - responses_api_agents/skill_eval_agent: orchestrates seed -> tool loop -> verify -> close. Reads verifier_metadata.with_skill to decide whether to prepend SKILL.md as a system message. Tooling: - scripts/build_skill_eval_jsonl.py emits paired records with 5 content hashes in verifier_metadata (skill_md_sha, evals_sha, fixtures_sha, judge_prompt_sha, harness_version) so downstream diffs are attribution-safe. - scripts/diff_skill_scoreboards.py renders per-skill w/wo/delta with a same-all / partial(N/5) / legacy provenance tag. - scripts/eval_skills.py is a standalone runner for ad-hoc evaluation. - scripts/build_shape_probe.py drives one-off content A/B tests. Documentation: - Two-part tutorial under docs/environment-tutorials (harness + scoreboard) covering methodology: with-vs-without deltas, provenance, noise floor calibration, per-scenario breakdown, and an iteration loop. - A dogfood checkpoint summarizing findings, gotchas, and upstream candidates for the NeMo Gym team. Skill-content fixes found during iteration: - gym-config: rewrote as a "read-before-answer" checklist (flipped from -0.112 to +0.033 on the first iteration). - gym-run/evals: loosened over-literal assertions that punished valid paraphrases. - gym-scaffold-agent/evals: tightened scenario-3 assertion 5 so the judge can't credit critical reviews of clean code. One framework nit: responses_api_models/openai_model normalizes object="chat.completion" to "response" because some providers return the former on /v1/responses, which otherwise fails NeMoGymResponse validation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Previously the harness conflated "skill in system prompt" with "skill's supporting artifacts on disk" under a single with_skill flag. Because seed_session unconditionally copied references/ and scripts/ into the workspace, the control arm wasn't actually a control — we measured 100% of without_skill rollouts on gym-profile reading references/metrics-guide.md (which contains the exact nouns the assertions test for). This commit splits the factor: - skill_workspace: SkillWorkspaceSeedSessionRequest now has independent with_references / with_scripts flags. seed_session gates the copytree for references/ and scripts/ on those. SKILL.md remains never-seeded. - skill_eval_agent: reads with_references / with_scripts from verifier_metadata and forwards to seed_session. Back-compat: when those fields are absent, defaults to the with_skill value. - build_skill_eval_jsonl: emits the full 2×2 by default over (with_skill, with_references) — four cells per scenario labeled blind / docs-only / skill-only / skill+docs. --cells flag restricts the subset emitted. Tests added for each cell combination in the workspace, and for cell-label emission and subsetting in the builder. What the cells mean for interpretation: blind model priors only; no prompt, no disk docs-only realistic reader without the skill pack installed skill-only prompt-only; diagnoses how load-bearing on-disk artifacts are skill+docs realistic reader with the skill pack (what previous "with" was) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Diff tool now auto-detects whether a JSONL is the legacy two-arm shape (pre-Phase-1) or the 4-cell 2×2 shape, and renders accordingly. Both modes now report three axes: - Δreward — accuracy (per-assertion satisfaction rate) - Δtools — number of tool calls made during the rollout - Δtokens — final-response output tokens (per-turn aggregation is Phase 2) For 2×2 JSONLs, each skill shows four cell-level rows plus four named marginal effects: skill | refs=T value of SKILL.md prompt given references on disk (= realistic-deployment marginal effect) skill | refs=F value of SKILL.md prompt without references (= skill-as-standalone-doc) refs | skill=T value of references given the skill is already in prompt refs | skill=F value of references as the reader's only scaffold Receipt for why this matters: v6's legacy two-arm read showed gym-run at Δreward=+0.487 and gym-data at Δreward=-0.013, but both skills cut tool calls by 2.4–3.3 per rollout. The reward-only view was hiding half the signal; the new multi-axis view surfaces it automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Addresses PM review feedback: the original checkpoint conflated what we built with what we learned about specific skills. The infra work is defensible; the per-skill claims rest on measurements with known contamination (SKILL.md on disk in the control arm, references/*.md on disk in both arms). Separating the two makes the artifact sharable without carrying forward claims we can't yet defend. - skill-eval-infra-v1.md: infra artifact + sharp-edges list + two upstream candidates with existing receipts. No per-skill claims. - skill-eval-findings.md: gated on v7 (post-contamination-fix) rollout. Pre-fix numbers archived but marked as not-defensible. Retracted claims listed explicitly. - Original checkpoint retained with a "superseded by" banner pointing to the split. Also retracts two prior claims that don't survive the noise-floor check: "every skill reduces tool calls 0.8–4.8" (measurement against contaminated control) and "gym-profile is actively misleading" (aggregate delta inside per-cell noise; per-scenario sc3 claim preserved but scoped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

First clean measurement since the references-contamination fix. 480 rollouts, 4 cells × 8 skills × 3 scenarios × n=5. Key results on the realistic-deployment contrast (skill | refs=T): - gym-run +0.436 (the only pre-fix claim that fully survives; has no references/ dir so its control was never contaminated) - add-benchmark +0.141, gym-config +0.111, gym-debug +0.080 (survive but smaller than the contaminated numbers) - gym-profile -0.107 (refined read: skill as standalone is +0.278, competes with its own references when both present) - gym-review +0.029 (mostly redundant with references — refs|skill=F is +0.602) - gym-scaffold-agent -0.040, gym-data +0.013 (both inside noise) Efficiency story holds and is the strongest multi-skill pattern: every skill reduces tool calls on the realistic contrast (-0.73 to -4.87 calls per rollout). The earlier checkpoint's "reward-only is lossy" claim survives on the correct contrast. Retracts three prior claims: - "every skill reduces tool calls 0.8-4.8" replaced by correct-contrast magnitudes above - "gym-profile is actively misleading" replaced by "competes with its own references; useful standalone" - shape-probe null result withdrawn pending clean rerun Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Sweeps the long-form docs to match the current state of the harness: - skill-eval-harness.md (Part 1): /seed_session now documents with_references / with_scripts flags; adds a "Gotcha: workspace contamination" section explaining why SKILL.md is never seeded and why references gating matters; orchestrator /run example updated to forward both flags from verifier_metadata. - skill-eval-scoreboard.md (Part 2): methodology section rewritten to introduce the 2×2 cells and the four named marginal effects as the primary frame; Step 2 JSONL example updated to show the new provenance fields and cell labels; Step 4 replaces the old legacy-contaminated scoreboard with the v7 multi-axis table and interpretive heuristics for each row pattern; Step 7 retires the v1→v2 and v3→v4 narratives in favor of a v6→v7 contamination-fix worked example plus an honest judge-drift noise-floor note; "What's next" replaced with concrete follow-ons and cross-links to findings.md and infra-v1.md. - skill_workspace/README: documents with_references / with_scripts on /seed_session; explains the "SKILL.md never seeded" invariant. - skill_eval_agent/README: schemas the 2×2 cell in the input JSONL example; documents the five provenance fields; updates the /run flow. - index.md: updated card blurbs to reflect 2×2 methodology. Scope: documentation only. No code or test changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Per-skill verdicts from v7 (4-cell 2×2, n=5), grounded in: - the four Δreward effects (skill|refs=T, skill|refs=F, refs|skill=F, Δtools|refs=T) - per-scenario cell means - a root-cause hypothesis for why each skill performs as measured - specific file-level prescriptions (SKILL.md edits, new scenarios, references to keep/delete) Three skills with clear keep verdicts (gym-run, add-benchmark, gym-config). Two with clear structural issues: gym-review (SKILL.md redundant with its references, shrink), gym-profile (SKILL.md competes with its references, rewrite to narrate). One with content gap (gym-scaffold-agent missing non-RL agent patterns). Two blocked on scenario difficulty (gym-data fully ceiling-clipped, gym-debug partially). Cross-cutting patterns section lists: where to shrink SKILL.md, where to rewrite to narrate-to-refs, where to rewrite evals.json for harder scenarios, where there are real content gaps. Recommendations are explicitly gated on: (a) n=20 calibration to confirm effect sizes, (b) one-edit-at-a-time discipline, (c) provenance diff confirming the right field moved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Predicted prescription from the skill-by-skill audit: the v7 Δreward on the realistic contrast was -0.107, driven by sc2 and sc3 where assertions test conceptual noun recall (name `pass_threshold`, name `extracted_model_code`) and the how-to-shaped SKILL.md never reaches those nouns in prose. Two targeted edits: 1. Promote `pass_threshold` from a command flag into a named concept subsection under Step 2. The concept is now explained (how it changes pass@k, when to raise/lower) rather than only appearing as `+pass_threshold=1.0` on a command line. 2. Rewrite the "Suspicious patterns" table so every row names both the trigger (what you observe) and the confirming field (what you read from the rollout JSONL to verify the diagnosis). Adds a "Rule" note that tells the model to cite the specific field by identifier in its diagnosis. Expected v8 effect: `skill | refs=T` moves from -0.107 toward zero or positive. sc2 and sc3 should stop losing their key assertions to the "cited pattern but not the confirming field" failure mode. Single-skill change; diff against v7 should flag `md` for gym-profile and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Predicted prescription from the skill-by-skill audit: v7 `skill | refs=T` was +0.029, `refs | skill=F` was +0.602 — the references + scripts/review.py carry the entire load and the 110-line SKILL.md was dead weight. Kept in the new SKILL.md: - How to invoke scripts/review.py (one paragraph). - What BLOCK vs WARN severity means. - An affirmative framing: if the script is quiet, the code is clean — approve the review explicitly rather than manufacturing concerns. - The five judgment checks the pattern matcher can't do. - Cross-references to references/anti-patterns.md and fix-patterns.md. Removed: - Full BLOCK/WARN tables (they duplicate references/anti-patterns.md and are echoed in the script's own per-finding output). - Review report template (trivial, doesn't need a skill to teach). - The "Apply judgment" numbered list's per-item detail — kept as bullets, context lives in the references. Expected v8 effect: `skill | refs=T` holds near zero (it's already there); Δtools holds near -4.87 (the skill's main contribution was teaching the model to trust the script). If Δtools drops materially, the shrink was too aggressive. Single-skill change; diff against v7 should flag `md` for gym-review and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

All three gym-data scenarios in v7 were ceiling-clipped (docs=0.96-1.00, skill-only=1.00, skill+docs=0.96-1.00). The skill's content was impossible to evaluate because the scenarios were trivially solvable from first principles by a frontier model. This replaces the scenarios with harder versions that require real judgment: - sc1 (schema audit): a 5-line tool-calling dataset with 4 intentionally planted schema bugs (required field not in properties, parallel_tool_calls / expected_tool_calls mismatch, missing `function` wrapping, and a parallel_tool_calls / single-tool inconsistency). Model must read each entry and identify the exact violation; 1 entry is clean. - sc2 (semantic mislabeling): a 7-line math/trivia dataset where 3 `expected_answer` values are factually wrong (capital of Australia = Canberra not Sydney; leap year = 366 not 365; gold symbol = Au not Go). Schema is fine; the model has to apply domain judgment to detect the mislabels. - sc3 (schema extension): a complex multi-turn branching customer-support benchmark schema with expected_tool_sequence, forbidden_sequence, and partial_credit. Model must generate 3 new entries that follow the schema AND exercise specified branching scenarios (email-only refund, abuse attempt, ambiguous-amount refund). Expected v8 effect: `docs-only` drops below 0.95 (likely much lower on sc1 and sc3), opening measurable headroom for the skill. `skill | refs=T` becomes interpretable instead of structurally-zero. Old fixture files (sample_tool_calling.jsonl, sample_bad_data.jsonl, sample_sql_benchmark.jsonl, sample_judge_benchmark.jsonl) remain in evals/files/ unreferenced — kept for future scenario work. Single-skill change; diff against v7 should flag `evals+fx` for gym-data and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

v8 measured the three predicted skill-content prescriptions from the v7 audit. Two outright hits, one caveated hit — all three landed with clean provenance attribution and single-skill effects. findings.md: - Lead with a v8 summary table (edit / predicted direction / observed / verdict). - Replace the v7 headline table with v8 numbers; keep v7 archived for comparison. - Update the Δtools table to show v7 → v8 side-by-side (pattern held). - Rewrite per-skill reads to include v8 outcome blocks where edits were made. - Expand the retraction log and update "what we can claim" with the v8 evidence. skill-eval-skill-review.md: - Title + header updated to "v7 prescriptions, v8 outcomes". - Each edited skill (gym-profile, gym-review, gym-data) gets a v8 outcome block citing the commit SHA, the realistic-contrast movement, and the provenance diff. - Cross-cutting patterns section annotated with which prescriptions landed and which are outstanding. - New "v7 → v8 results on the three prescriptions" summary table. - Prerequisites updated with the gym-review shrink lesson (measure standalone contrast too). Headline: - gym-profile skill|refs=T: −0.107 → +0.040 (change +0.147, md diff, cleanly above noise) - gym-data skill|refs=T: +0.013 → +0.093 (evals+fx diff, now measurable) - gym-review skill|refs=T: +0.029 → +0.048 (realistic held); skill|refs=F: +0.298 → +0.012 (standalone collapsed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

The v8 outcome blocks narrated what changed but didn't show it. Making the edits concrete so readers can see the pattern — this is the whole point of the prescription framework; abstracted "rewrite patterns table" is not as useful as showing an actual row go from 2 columns to 3. Per-skill additions: - gym-profile: before/after of one patterns-table row (think-block stripping), showing the diagnostic chain pattern → cause → confirming field now completing in prose. Notes the parallel pass_threshold change from command flag to named concept subsection. - gym-review: kept-vs-dropped lists. What stayed in SKILL.md (invocation, severity, 5 judgment bullets, cross-refs). What moved out (full BLOCK / WARN tables, report template, verbose judgment list). Explains why the shrink works for realistic-deployment (refs carry the tables) and why it regressed skill-only (no refs → no tables). - gym-data: sc1 / sc2 / sc3 prompts shown before vs after. In every case, "before" is pattern-matchable from fixture format; "after" requires reading contents and applying judgment. Explicit framing: scenarios moved from validating format to testing understanding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

SHAs updated for the three skills edited in 3f3a330 (gym-profile md), 8fdcdb2 (gym-review md), and 5e84dd0 (gym-data evals+fx). This is the input JSONL v8 was run against. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

…line Four failing CI jobs, fixed: Test (coverage 92.5% < 96.0% threshold) — core `pytest tests/unit_tests/` was measuring `resources_servers/skill_workspace/app.py` and `resources_servers/skill_judge/app.py` because new test files import SkillWorkspaceResourcesServer as a fixture. Per-server modules are tested via `ng_test` (dedicated venv per server), not by the core suite. Added `resources_servers/*`, `responses_api_agents/*`, and `responses_api_models/*` to the coverage `omit` list. Also added CLI tests for `scripts/build_skill_eval_jsonl.py` to bring its coverage from 81% to 98%. Total coverage now 96.41%. Lint check — ruff format pass on 7 files (pure formatting, no behavior changes). 5 ruff --fix items applied (mostly removed-blank-line nits). Lint check (README row) — main's `update_env_list.py` adds a "Skill Eval Agent" row when it sees `responses_api_agents/skill_eval_agent/`. Added the row manually so the merge-commit hook reports clean. secrets-detector — `Hex High Entropy String` warnings on our 12-char content-hash provenance fields (`skill_md_sha`, etc.) in `responses_api_agents/skill_eval_agent/data/example.jsonl` and the agent README. Extended `.secrets.baseline` `should_exclude_file` regex to cover those paths plus `notes/skill-eval/*.md` (next commit moves internal docs there). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

…x tree The skill-eval material is internal stakeholder validation, not user-facing docs. The repo is migrating user docs to Fern (fern/ is in main); the Sphinx tree under docs/ is being deprecated. This material shouldn't live in either system. Move out of docs/environment-tutorials/ (Sphinx) into notes/skill-eval/. Two consolidated files instead of six: - notes/skill-eval/harness.md — build and operate the harness. Combines skill-eval-harness.md (build) and skill-eval-scoreboard.md (run + interpret), stripped of Sphinx tutorial framing. Plain markdown. - notes/skill-eval/results.md — what we measured and what to take from it. Replaces checkpoint.md, infra-v1.md, findings.md, and skill-review.md. Sections: TL;DR, v8 scoreboard, per-skill audit with before/after edits and v8 outcomes, NeMo Gym sharp edges, upstream candidates, methodology learnings, retraction log. Deleted: - docs/environment-tutorials/skill-eval-{harness,scoreboard,checkpoint, infra-v1,findings,skill-review}.md - The two skill-eval grid-item-cards in docs/environment-tutorials/index.md The grid-card removal also fixes the docs-build CI failure (the six files were warning about not being in any toctree). With them gone from the Sphinx tree, the build is clean — verified locally with sphinx-build --fail-on-warning. Stripped Sphinx-isms in the new notes: dropped (label)= ref targets, {ref}/{doc} cross-refs, :::{note}/:::{tip}/:::{button-ref} admonitions, :orphan: headers, grid-card directives, math $$...$$ blocks. Plain markdown that renders identically in any viewer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Three additions sourced from PR #1062 (skill-eval harness dogfooding). - Quickstart step 3: replace "Four Uvicorn lines print" with ng_status verification. Lawrence's flow uses ng_status; it's faster than scanning logs and produces an unambiguous "N healthy" signal. - Quickstart step 3: add a Tip for returning users — append +skip_venv_if_present=true to ng_run to skip venv re-creation. Real flag in cli_setup_command.py:116-117. - Configuration troubleshooting: new accordion for the trailing-slash gotcha (a "/" at the end of policy_base_url produces double-slash request paths some providers 404 on). From PR #1062 sharp edge #2. Verified ng_run blocks (cli.py:410 calls rh.run_forever); two-terminal pattern in Quickstart is correct, no change needed there. Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii mentioned this pull request Apr 13, 2026

feat: Add agent skills for NeMo Gym #1061

Closed

3 tasks

lbliii self-assigned this Apr 14, 2026

lbliii and others added 15 commits April 24, 2026 12:08

lbliii requested a review from a team as a code owner April 27, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add agent skills for NeMo Gym#1062

feat: Add agent skills for NeMo Gym#1062
lbliii wants to merge 16 commits intomainfrom
lbliii/prague-v2

lbliii commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lbliii commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this exists

What's here

Three NeMo Gym servers

The 2×2 control structure

Five-field provenance

Tooling

Validate locally (5 commands, ~30 min)

Headline results (v8, n=5, 480 rollouts)

NeMo Gym sharp edges (8 items, ~1 hour each the first time)

What's worth absorbing upstream

Read more

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lbliii commented Apr 13, 2026 •

edited

Loading