Open
Conversation
…, config, data, scaffolding, and execution Add comprehensive agent skills covering the full NeMo Gym workflow: - gym-review: Anti-pattern detection with deterministic Python scanner - gym-debug: Runtime failure diagnosis and log analysis - gym-profile: Rollout result analysis and reward profiling - gym-config: Hydra YAML configuration composition and validation - gym-data: Dataset preparation, validation, and HuggingFace registry - gym-scaffold-agent: Custom agent server creation patterns - gym-run: End-to-end benchmark execution workflow Each skill includes SKILL.md with step-by-step instructions, self-contained references, eval fixtures, and evals.json for quality measurement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
3 tasks
Ships a three-server NeMo Gym environment that grades agent-skill content (.claude/skills/*/SKILL.md) via paired with-skill vs without-skill rollouts: - resources_servers/skill_workspace: per-session tmpdir with run_bash/read_file tools. Sandbox env strips host PATH to prevent rollouts seeing ng_* binaries, provides a python -> python3 shim for prompts assuming `python`. - resources_servers/skill_judge: LLM-as-judge returning per-assertion binary grades with evidence; reward = fraction satisfied. - responses_api_agents/skill_eval_agent: orchestrates seed -> tool loop -> verify -> close. Reads verifier_metadata.with_skill to decide whether to prepend SKILL.md as a system message. Tooling: - scripts/build_skill_eval_jsonl.py emits paired records with 5 content hashes in verifier_metadata (skill_md_sha, evals_sha, fixtures_sha, judge_prompt_sha, harness_version) so downstream diffs are attribution-safe. - scripts/diff_skill_scoreboards.py renders per-skill w/wo/delta with a same-all / partial(N/5) / legacy provenance tag. - scripts/eval_skills.py is a standalone runner for ad-hoc evaluation. - scripts/build_shape_probe.py drives one-off content A/B tests. Documentation: - Two-part tutorial under docs/environment-tutorials (harness + scoreboard) covering methodology: with-vs-without deltas, provenance, noise floor calibration, per-scenario breakdown, and an iteration loop. - A dogfood checkpoint summarizing findings, gotchas, and upstream candidates for the NeMo Gym team. Skill-content fixes found during iteration: - gym-config: rewrote as a "read-before-answer" checklist (flipped from -0.112 to +0.033 on the first iteration). - gym-run/evals: loosened over-literal assertions that punished valid paraphrases. - gym-scaffold-agent/evals: tightened scenario-3 assertion 5 so the judge can't credit critical reviews of clean code. One framework nit: responses_api_models/openai_model normalizes object="chat.completion" to "response" because some providers return the former on /v1/responses, which otherwise fails NeMoGymResponse validation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Previously the harness conflated "skill in system prompt" with "skill's supporting artifacts on disk" under a single with_skill flag. Because seed_session unconditionally copied references/ and scripts/ into the workspace, the control arm wasn't actually a control — we measured 100% of without_skill rollouts on gym-profile reading references/metrics-guide.md (which contains the exact nouns the assertions test for). This commit splits the factor: - skill_workspace: SkillWorkspaceSeedSessionRequest now has independent with_references / with_scripts flags. seed_session gates the copytree for references/ and scripts/ on those. SKILL.md remains never-seeded. - skill_eval_agent: reads with_references / with_scripts from verifier_metadata and forwards to seed_session. Back-compat: when those fields are absent, defaults to the with_skill value. - build_skill_eval_jsonl: emits the full 2×2 by default over (with_skill, with_references) — four cells per scenario labeled blind / docs-only / skill-only / skill+docs. --cells flag restricts the subset emitted. Tests added for each cell combination in the workspace, and for cell-label emission and subsetting in the builder. What the cells mean for interpretation: blind model priors only; no prompt, no disk docs-only realistic reader without the skill pack installed skill-only prompt-only; diagnoses how load-bearing on-disk artifacts are skill+docs realistic reader with the skill pack (what previous "with" was) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Diff tool now auto-detects whether a JSONL is the legacy two-arm shape
(pre-Phase-1) or the 4-cell 2×2 shape, and renders accordingly. Both modes
now report three axes:
- Δreward — accuracy (per-assertion satisfaction rate)
- Δtools — number of tool calls made during the rollout
- Δtokens — final-response output tokens (per-turn aggregation is Phase 2)
For 2×2 JSONLs, each skill shows four cell-level rows plus four named
marginal effects:
skill | refs=T value of SKILL.md prompt given references on disk
(= realistic-deployment marginal effect)
skill | refs=F value of SKILL.md prompt without references
(= skill-as-standalone-doc)
refs | skill=T value of references given the skill is already in prompt
refs | skill=F value of references as the reader's only scaffold
Receipt for why this matters: v6's legacy two-arm read showed gym-run at
Δreward=+0.487 and gym-data at Δreward=-0.013, but both skills cut tool
calls by 2.4–3.3 per rollout. The reward-only view was hiding half the
signal; the new multi-axis view surfaces it automatically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Addresses PM review feedback: the original checkpoint conflated what we built with what we learned about specific skills. The infra work is defensible; the per-skill claims rest on measurements with known contamination (SKILL.md on disk in the control arm, references/*.md on disk in both arms). Separating the two makes the artifact sharable without carrying forward claims we can't yet defend. - skill-eval-infra-v1.md: infra artifact + sharp-edges list + two upstream candidates with existing receipts. No per-skill claims. - skill-eval-findings.md: gated on v7 (post-contamination-fix) rollout. Pre-fix numbers archived but marked as not-defensible. Retracted claims listed explicitly. - Original checkpoint retained with a "superseded by" banner pointing to the split. Also retracts two prior claims that don't survive the noise-floor check: "every skill reduces tool calls 0.8–4.8" (measurement against contaminated control) and "gym-profile is actively misleading" (aggregate delta inside per-cell noise; per-scenario sc3 claim preserved but scoped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
First clean measurement since the references-contamination fix. 480 rollouts, 4 cells × 8 skills × 3 scenarios × n=5. Key results on the realistic-deployment contrast (skill | refs=T): - gym-run +0.436 (the only pre-fix claim that fully survives; has no references/ dir so its control was never contaminated) - add-benchmark +0.141, gym-config +0.111, gym-debug +0.080 (survive but smaller than the contaminated numbers) - gym-profile -0.107 (refined read: skill as standalone is +0.278, competes with its own references when both present) - gym-review +0.029 (mostly redundant with references — refs|skill=F is +0.602) - gym-scaffold-agent -0.040, gym-data +0.013 (both inside noise) Efficiency story holds and is the strongest multi-skill pattern: every skill reduces tool calls on the realistic contrast (-0.73 to -4.87 calls per rollout). The earlier checkpoint's "reward-only is lossy" claim survives on the correct contrast. Retracts three prior claims: - "every skill reduces tool calls 0.8-4.8" replaced by correct-contrast magnitudes above - "gym-profile is actively misleading" replaced by "competes with its own references; useful standalone" - shape-probe null result withdrawn pending clean rerun Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sweeps the long-form docs to match the current state of the harness: - skill-eval-harness.md (Part 1): /seed_session now documents with_references / with_scripts flags; adds a "Gotcha: workspace contamination" section explaining why SKILL.md is never seeded and why references gating matters; orchestrator /run example updated to forward both flags from verifier_metadata. - skill-eval-scoreboard.md (Part 2): methodology section rewritten to introduce the 2×2 cells and the four named marginal effects as the primary frame; Step 2 JSONL example updated to show the new provenance fields and cell labels; Step 4 replaces the old legacy-contaminated scoreboard with the v7 multi-axis table and interpretive heuristics for each row pattern; Step 7 retires the v1→v2 and v3→v4 narratives in favor of a v6→v7 contamination-fix worked example plus an honest judge-drift noise-floor note; "What's next" replaced with concrete follow-ons and cross-links to findings.md and infra-v1.md. - skill_workspace/README: documents with_references / with_scripts on /seed_session; explains the "SKILL.md never seeded" invariant. - skill_eval_agent/README: schemas the 2×2 cell in the input JSONL example; documents the five provenance fields; updates the /run flow. - index.md: updated card blurbs to reflect 2×2 methodology. Scope: documentation only. No code or test changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Per-skill verdicts from v7 (4-cell 2×2, n=5), grounded in: - the four Δreward effects (skill|refs=T, skill|refs=F, refs|skill=F, Δtools|refs=T) - per-scenario cell means - a root-cause hypothesis for why each skill performs as measured - specific file-level prescriptions (SKILL.md edits, new scenarios, references to keep/delete) Three skills with clear keep verdicts (gym-run, add-benchmark, gym-config). Two with clear structural issues: gym-review (SKILL.md redundant with its references, shrink), gym-profile (SKILL.md competes with its references, rewrite to narrate). One with content gap (gym-scaffold-agent missing non-RL agent patterns). Two blocked on scenario difficulty (gym-data fully ceiling-clipped, gym-debug partially). Cross-cutting patterns section lists: where to shrink SKILL.md, where to rewrite to narrate-to-refs, where to rewrite evals.json for harder scenarios, where there are real content gaps. Recommendations are explicitly gated on: (a) n=20 calibration to confirm effect sizes, (b) one-edit-at-a-time discipline, (c) provenance diff confirming the right field moved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Predicted prescription from the skill-by-skill audit: the v7 Δreward on the realistic contrast was -0.107, driven by sc2 and sc3 where assertions test conceptual noun recall (name `pass_threshold`, name `extracted_model_code`) and the how-to-shaped SKILL.md never reaches those nouns in prose. Two targeted edits: 1. Promote `pass_threshold` from a command flag into a named concept subsection under Step 2. The concept is now explained (how it changes pass@k, when to raise/lower) rather than only appearing as `+pass_threshold=1.0` on a command line. 2. Rewrite the "Suspicious patterns" table so every row names both the trigger (what you observe) and the confirming field (what you read from the rollout JSONL to verify the diagnosis). Adds a "Rule" note that tells the model to cite the specific field by identifier in its diagnosis. Expected v8 effect: `skill | refs=T` moves from -0.107 toward zero or positive. sc2 and sc3 should stop losing their key assertions to the "cited pattern but not the confirming field" failure mode. Single-skill change; diff against v7 should flag `md` for gym-profile and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Predicted prescription from the skill-by-skill audit: v7 `skill | refs=T` was +0.029, `refs | skill=F` was +0.602 — the references + scripts/review.py carry the entire load and the 110-line SKILL.md was dead weight. Kept in the new SKILL.md: - How to invoke scripts/review.py (one paragraph). - What BLOCK vs WARN severity means. - An affirmative framing: if the script is quiet, the code is clean — approve the review explicitly rather than manufacturing concerns. - The five judgment checks the pattern matcher can't do. - Cross-references to references/anti-patterns.md and fix-patterns.md. Removed: - Full BLOCK/WARN tables (they duplicate references/anti-patterns.md and are echoed in the script's own per-finding output). - Review report template (trivial, doesn't need a skill to teach). - The "Apply judgment" numbered list's per-item detail — kept as bullets, context lives in the references. Expected v8 effect: `skill | refs=T` holds near zero (it's already there); Δtools holds near -4.87 (the skill's main contribution was teaching the model to trust the script). If Δtools drops materially, the shrink was too aggressive. Single-skill change; diff against v7 should flag `md` for gym-review and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
All three gym-data scenarios in v7 were ceiling-clipped (docs=0.96-1.00, skill-only=1.00, skill+docs=0.96-1.00). The skill's content was impossible to evaluate because the scenarios were trivially solvable from first principles by a frontier model. This replaces the scenarios with harder versions that require real judgment: - sc1 (schema audit): a 5-line tool-calling dataset with 4 intentionally planted schema bugs (required field not in properties, parallel_tool_calls / expected_tool_calls mismatch, missing `function` wrapping, and a parallel_tool_calls / single-tool inconsistency). Model must read each entry and identify the exact violation; 1 entry is clean. - sc2 (semantic mislabeling): a 7-line math/trivia dataset where 3 `expected_answer` values are factually wrong (capital of Australia = Canberra not Sydney; leap year = 366 not 365; gold symbol = Au not Go). Schema is fine; the model has to apply domain judgment to detect the mislabels. - sc3 (schema extension): a complex multi-turn branching customer-support benchmark schema with expected_tool_sequence, forbidden_sequence, and partial_credit. Model must generate 3 new entries that follow the schema AND exercise specified branching scenarios (email-only refund, abuse attempt, ambiguous-amount refund). Expected v8 effect: `docs-only` drops below 0.95 (likely much lower on sc1 and sc3), opening measurable headroom for the skill. `skill | refs=T` becomes interpretable instead of structurally-zero. Old fixture files (sample_tool_calling.jsonl, sample_bad_data.jsonl, sample_sql_benchmark.jsonl, sample_judge_benchmark.jsonl) remain in evals/files/ unreferenced — kept for future scenario work. Single-skill change; diff against v7 should flag `evals+fx` for gym-data and same-all for every other skill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
v8 measured the three predicted skill-content prescriptions from the v7 audit. Two outright hits, one caveated hit — all three landed with clean provenance attribution and single-skill effects. findings.md: - Lead with a v8 summary table (edit / predicted direction / observed / verdict). - Replace the v7 headline table with v8 numbers; keep v7 archived for comparison. - Update the Δtools table to show v7 → v8 side-by-side (pattern held). - Rewrite per-skill reads to include v8 outcome blocks where edits were made. - Expand the retraction log and update "what we can claim" with the v8 evidence. skill-eval-skill-review.md: - Title + header updated to "v7 prescriptions, v8 outcomes". - Each edited skill (gym-profile, gym-review, gym-data) gets a v8 outcome block citing the commit SHA, the realistic-contrast movement, and the provenance diff. - Cross-cutting patterns section annotated with which prescriptions landed and which are outstanding. - New "v7 → v8 results on the three prescriptions" summary table. - Prerequisites updated with the gym-review shrink lesson (measure standalone contrast too). Headline: - gym-profile skill|refs=T: −0.107 → +0.040 (change +0.147, md diff, cleanly above noise) - gym-data skill|refs=T: +0.013 → +0.093 (evals+fx diff, now measurable) - gym-review skill|refs=T: +0.029 → +0.048 (realistic held); skill|refs=F: +0.298 → +0.012 (standalone collapsed) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
The v8 outcome blocks narrated what changed but didn't show it. Making the edits concrete so readers can see the pattern — this is the whole point of the prescription framework; abstracted "rewrite patterns table" is not as useful as showing an actual row go from 2 columns to 3. Per-skill additions: - gym-profile: before/after of one patterns-table row (think-block stripping), showing the diagnostic chain pattern → cause → confirming field now completing in prose. Notes the parallel pass_threshold change from command flag to named concept subsection. - gym-review: kept-vs-dropped lists. What stayed in SKILL.md (invocation, severity, 5 judgment bullets, cross-refs). What moved out (full BLOCK / WARN tables, report template, verbose judgment list). Explains why the shrink works for realistic-deployment (refs carry the tables) and why it regressed skill-only (no refs → no tables). - gym-data: sc1 / sc2 / sc3 prompts shown before vs after. In every case, "before" is pattern-matchable from fixture format; "after" requires reading contents and applying judgment. Explicit framing: scenarios moved from validating format to testing understanding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
…line Four failing CI jobs, fixed: Test (coverage 92.5% < 96.0% threshold) — core `pytest tests/unit_tests/` was measuring `resources_servers/skill_workspace/app.py` and `resources_servers/skill_judge/app.py` because new test files import SkillWorkspaceResourcesServer as a fixture. Per-server modules are tested via `ng_test` (dedicated venv per server), not by the core suite. Added `resources_servers/*`, `responses_api_agents/*`, and `responses_api_models/*` to the coverage `omit` list. Also added CLI tests for `scripts/build_skill_eval_jsonl.py` to bring its coverage from 81% to 98%. Total coverage now 96.41%. Lint check — ruff format pass on 7 files (pure formatting, no behavior changes). 5 ruff --fix items applied (mostly removed-blank-line nits). Lint check (README row) — main's `update_env_list.py` adds a "Skill Eval Agent" row when it sees `responses_api_agents/skill_eval_agent/`. Added the row manually so the merge-commit hook reports clean. secrets-detector — `Hex High Entropy String` warnings on our 12-char content-hash provenance fields (`skill_md_sha`, etc.) in `responses_api_agents/skill_eval_agent/data/example.jsonl` and the agent README. Extended `.secrets.baseline` `should_exclude_file` regex to cover those paths plus `notes/skill-eval/*.md` (next commit moves internal docs there). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
…x tree
The skill-eval material is internal stakeholder validation, not
user-facing docs. The repo is migrating user docs to Fern (fern/ is in
main); the Sphinx tree under docs/ is being deprecated. This material
shouldn't live in either system.
Move out of docs/environment-tutorials/ (Sphinx) into notes/skill-eval/.
Two consolidated files instead of six:
- notes/skill-eval/harness.md — build and operate the harness. Combines
skill-eval-harness.md (build) and skill-eval-scoreboard.md (run +
interpret), stripped of Sphinx tutorial framing. Plain markdown.
- notes/skill-eval/results.md — what we measured and what to take from
it. Replaces checkpoint.md, infra-v1.md, findings.md, and
skill-review.md. Sections: TL;DR, v8 scoreboard, per-skill audit with
before/after edits and v8 outcomes, NeMo Gym sharp edges, upstream
candidates, methodology learnings, retraction log.
Deleted:
- docs/environment-tutorials/skill-eval-{harness,scoreboard,checkpoint,
infra-v1,findings,skill-review}.md
- The two skill-eval grid-item-cards in
docs/environment-tutorials/index.md
The grid-card removal also fixes the docs-build CI failure (the six
files were warning about not being in any toctree). With them gone from
the Sphinx tree, the build is clean — verified locally with sphinx-build
--fail-on-warning.
Stripped Sphinx-isms in the new notes: dropped (label)= ref targets,
{ref}/{doc} cross-refs, :::{note}/:::{tip}/:::{button-ref} admonitions,
:orphan: headers, grid-card directives, math $$...$$ blocks. Plain
markdown that renders identically in any viewer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
lbliii
added a commit
that referenced
this pull request
Apr 30, 2026
Three additions sourced from PR #1062 (skill-eval harness dogfooding). - Quickstart step 3: replace "Four Uvicorn lines print" with ng_status verification. Lawrence's flow uses ng_status; it's faster than scanning logs and produces an unambiguous "N healthy" signal. - Quickstart step 3: add a Tip for returning users — append +skip_venv_if_present=true to ng_run to skip venv re-creation. Real flag in cli_setup_command.py:116-117. - Configuration troubleshooting: new accordion for the trailing-slash gotcha (a "/" at the end of policy_base_url produces double-slash request paths some providers 404 on). From PR #1062 sharp edge #2. Verified ng_run blocks (cli.py:410 calls rh.run_forever); two-terminal pattern in Quickstart is correct, no change needed there. Signed-off-by: Lawrence Lane <llane@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this exists
Two questions, two days:
.claude/skills/*/SKILL.mdactually helps an agent do its job? Answer: yes, with a paired-rollout 2×2 over (skill-in-prompt × references-on-disk) — and we found one skill that competes with its own references and three skills whose apparent value was mostly the references' value, not the skill's.For internal validators: clone, bring it up (5 commands), run a 480-rollout scoreboard (~30 min), and read the results doc. You'll have a working harness and a decision-shaped artifact at the end.
What's here
Three NeMo Gym servers
resources_servers/skill_workspace/— per-session tmpdir withrun_bash+read_filetools. Seeds the skill'sscripts/andreferences/based on per-request flags. Sandbox env strips host PATH; workspace-localpython → python3symlink covers macOS rollouts.SKILL.mdis never seeded (see contamination note inresults.md).resources_servers/skill_judge/— LLM-as-judge. One call grades(response, tool_calls, assertions[]) → grades[]with per-assertion binary verdicts plus evidence strings. Reward = fraction satisfied.responses_api_agents/skill_eval_agent/— orchestrator. Seeds workspace → model↔tool loop → forwards transcript to judge → closes workspace (infinally).The 2×2 control structure
Two independent flags in
verifier_metadataproduce four cells per scenario:with_references=Falsewith_references=Truewith_skill=Falseblind— model priors onlydocs-only— realistic reader without the skill packwith_skill=Trueskill-only— SKILL.md in prompt, nothing on diskskill+docs— realistic reader with the skill packThe diff tool reports four named marginal effects on three axes (Δreward, Δtools, Δtokens):
skill | refs=T=skill+docs − docs-only— realistic-deployment value of the skill overlay (the number that matters for shipping).skill | refs=F=skill-only − blind— skill as a standalone doc.refs | skill=T=skill+docs − skill-only— do refs still matter when the skill is prompted?refs | skill=F=docs-only − blind— marginal value of references alone.Five-field provenance
Every record carries
skill_md_sha,evals_sha,fixtures_sha,judge_prompt_sha,harness_versioninverifier_metadata. The diff tool tags each delta-of-delta with what changed:md,evals,md+evals,harness, orsame-all(no input changed → it's noise/judge drift, not a skill effect).Tooling
scripts/build_skill_eval_jsonl.py— emits 4-cell 2×2 with full provenance.scripts/diff_skill_scoreboards.py— auto-detects 2×2 vs legacy; renders multi-axis deltas with provenance attribution.Validate locally (5 commands, ~30 min)
Pace observed: ~30 min wall-clock for 480 rollouts at 6-way parallel on the NVIDIA inference API. Lighter validation: pass
--cells=blind,skill+docsto the build script (only 2 cells × 5 = 240 rollouts, ~15 min).Headline results (v8, n=5, 480 rollouts)
Realistic-deployment column only. Bold = effect outside ~±0.10 noise floor. Full table + per-skill verdicts in
notes/skill-eval/results.md.skill | refs=TΔrewardgym-runadd-benchmarkgym-debuggym-datagym-scaffold-agentgym-reviewgym-profilegym-configThree predicted prescriptions, all landed with clean per-skill provenance attribution:
gym-profilepatterns-table rewrite →skill | refs=Tflipped −0.107 → +0.040 (Δ +0.147)gym-dataadversarial scenarios →docs-only0.97 → 0.88, opening measurable headroomgym-reviewSKILL.md shrink (110 → 53 lines) → realistic preserved, standalone collapsed (context-dependent win)One universal pattern: every skill reduces tool calls in the realistic cell (−1.00 to −4.73 per rollout). Reward-only scoring is lossy; always read both axes.
NeMo Gym sharp edges (8 items, ~1 hour each the first time)
These are concrete, reproducible, and cheap to fix. Most useful PR output for the framework team:
ng_runrunspython app.pynotpython -m. Relative imports break — must use absolute imports from project root.policy_base_urlproduce double-slash 404s on some providers./v1/responsesproviders returnobject: "chat.completion";NeMoGymResponsevalidation fails on the literal. Normalize in the model server.FunctionToolParamrequires explicitstrict: Falseor validation fails at the model server..venv/binleaks into subprocess PATH via inherited env; rollouts can seeng_*binaries, Ray sockets, HF/MLflow credentials.pythonon macOS — needed a workspace-localpython → python3symlink.finallyblock — anything else leaks tmpdirs.Methodology lessons that likely generalize past skill eval:
What's worth absorbing upstream
Two with existing demand. Two parked.
Worth a PR:
code_gen,equivalence_llm_judge). Common shape; small base class saves the next team re-inventing it.usagereaches the output JSONL; every multi-turn agent eventually wants the sum.Parked (no second customer yet):
pairing_key+ post-hoc join CLI).ng_stamp_provenancehelper that hashes Hydra-config-referenced files).Read more
notes/skill-eval/harness.md— build the harness, the 2×2, provenance, iteration loop, sandbox details, scenarios how-to.notes/skill-eval/results.md— full v8 scoreboard, per-skill audit (verdicts + before/after edits + v8 outcomes), the contamination story, retraction log, open methodology questions.Test plan
skill_workspace(28),skill_judge(22),skill_eval_agent(9).test_build_skill_eval_jsonl.py(15),test_eval_skills.py(33).ruff format --check,ruff check, README env-list row, secrets-detector hook.sphinx-build --fail-on-warningsucceeds with the skill-eval material moved out of the Sphinx tree.md/evals+fx/same-allper skill edited.🤖 Generated with Claude Code