Skip to content

feat: Add agent skills for NeMo Gym#1062

Open
lbliii wants to merge 16 commits intomainfrom
lbliii/prague-v2
Open

feat: Add agent skills for NeMo Gym#1062
lbliii wants to merge 16 commits intomainfrom
lbliii/prague-v2

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Apr 13, 2026

Internal validation only. This PR exercises NeMo Gym end-to-end on a non-trivial workload (paired-rollout grading of agent skills) and ships the harness so other teams can reproduce, validate, and reuse the components. The substantive material — methodology, results, gotchas, upstream pitches — lives in notes/skill-eval/, not in docs/.

Why this exists

Two questions, two days:

  1. Can we measure whether a .claude/skills/*/SKILL.md actually helps an agent do its job? Answer: yes, with a paired-rollout 2×2 over (skill-in-prompt × references-on-disk) — and we found one skill that competes with its own references and three skills whose apparent value was mostly the references' value, not the skill's.
  2. Is NeMo Gym the right substrate for that kind of evaluation? Answer: yes for skill eval today (rollout collection + service orchestration earned their keep); the methodology ports outside NeMo Gym for generic doc eval. Concrete sharp edges and upstream candidates listed below.

For internal validators: clone, bring it up (5 commands), run a 480-rollout scoreboard (~30 min), and read the results doc. You'll have a working harness and a decision-shaped artifact at the end.

What's here

Three NeMo Gym servers

flowchart LR
  J[input JSONL] --> A[skill_eval_agent</br>/run]
  A -->|/seed_session</br>/run_bash</br>/read_file</br>/close| W[skill_workspace</br>per-session sandbox]
  A -->|/v1/responses| M[policy_model]
  A -->|/verify| JG[skill_judge</br>LLM-as-judge]
  JG -->|/v1/responses| M
  JG --> A
  A --> R[output JSONL</br>reward + per-assertion grades]
Loading
  • resources_servers/skill_workspace/ — per-session tmpdir with run_bash + read_file tools. Seeds the skill's scripts/ and references/ based on per-request flags. Sandbox env strips host PATH; workspace-local python → python3 symlink covers macOS rollouts. SKILL.md is never seeded (see contamination note in results.md).
  • resources_servers/skill_judge/ — LLM-as-judge. One call grades (response, tool_calls, assertions[]) → grades[] with per-assertion binary verdicts plus evidence strings. Reward = fraction satisfied.
  • responses_api_agents/skill_eval_agent/ — orchestrator. Seeds workspace → model↔tool loop → forwards transcript to judge → closes workspace (in finally).

The 2×2 control structure

Two independent flags in verifier_metadata produce four cells per scenario:

with_references=False with_references=True
with_skill=False blind — model priors only docs-only — realistic reader without the skill pack
with_skill=True skill-only — SKILL.md in prompt, nothing on disk skill+docs — realistic reader with the skill pack

The diff tool reports four named marginal effects on three axes (Δreward, Δtools, Δtokens):

  • skill | refs=T = skill+docs − docs-only — realistic-deployment value of the skill overlay (the number that matters for shipping).
  • skill | refs=F = skill-only − blind — skill as a standalone doc.
  • refs | skill=T = skill+docs − skill-only — do refs still matter when the skill is prompted?
  • refs | skill=F = docs-only − blind — marginal value of references alone.

Five-field provenance

Every record carries skill_md_sha, evals_sha, fixtures_sha, judge_prompt_sha, harness_version in verifier_metadata. The diff tool tags each delta-of-delta with what changed: md, evals, md+evals, harness, or same-all (no input changed → it's noise/judge drift, not a skill effect).

Tooling

  • scripts/build_skill_eval_jsonl.py — emits 4-cell 2×2 with full provenance.
  • scripts/diff_skill_scoreboards.py — auto-detects 2×2 vs legacy; renders multi-axis deltas with provenance attribution.

Validate locally (5 commands, ~30 min)

# 1. Setup
uv venv && uv sync --extra dev --group docs

# 2. Configure model endpoint
cat > env.yaml <<'EOF'
policy_base_url: <your /v1 endpoint>
policy_api_key: <your key>
policy_model_name: <your model>
EOF

# 3. Bring up the four servers
ng_run "+config_paths=[
  resources_servers/skill_workspace/configs/skill_workspace.yaml,
  resources_servers/skill_judge/configs/skill_judge.yaml,
  responses_api_models/openai_model/configs/openai_model.yaml,
  responses_api_agents/skill_eval_agent/configs/skill_eval_agent.yaml
]" +skip_venv_if_present=true
ng_status   # expect "4 healthy"

# 4. Build input JSONL (96 records: 4 cells × 8 skills × 3 scenarios)
python scripts/build_skill_eval_jsonl.py \
  --skills-dir .claude/skills \
  --output responses_api_agents/skill_eval_agent/data/example.jsonl

# 5. Collect rollouts + render scoreboard
ng_collect_rollouts \
  +agent_name=skill_eval_agent \
  +input_jsonl_fpath=responses_api_agents/skill_eval_agent/data/example.jsonl \
  +output_jsonl_fpath=results/v8.jsonl \
  +num_repeats=5 +num_samples_in_parallel=6 \
  "+responses_create_params={max_output_tokens: 8192}"
python scripts/diff_skill_scoreboards.py results/v8.jsonl

Pace observed: ~30 min wall-clock for 480 rollouts at 6-way parallel on the NVIDIA inference API. Lighter validation: pass --cells=blind,skill+docs to the build script (only 2 cells × 5 = 240 rollouts, ~15 min).

Headline results (v8, n=5, 480 rollouts)

Realistic-deployment column only. Bold = effect outside ~±0.10 noise floor. Full table + per-skill verdicts in notes/skill-eval/results.md.

skill skill | refs=T Δreward Δtools verdict
gym-run +0.380 −1.67 load-bearing keeper
add-benchmark +0.162 −1.13 keep
gym-debug +0.133 −4.13 keep; biggest efficiency teach
gym-data +0.093 −3.00 now measurable after scenario rewrite
gym-scaffold-agent +0.053 −1.00 content gap (non-RL agent patterns missing)
gym-review +0.048 −4.73 redundant with refs on accuracy; teaches efficiency
gym-profile +0.040 −2.00 flipped from −0.107 after patterns-table rewrite
gym-config +0.027 −1.60 2/3 scenarios ceiling-clipped

Three predicted prescriptions, all landed with clean per-skill provenance attribution:

  • gym-profile patterns-table rewrite → skill | refs=T flipped −0.107 → +0.040 (Δ +0.147)
  • gym-data adversarial scenarios → docs-only 0.97 → 0.88, opening measurable headroom
  • gym-review SKILL.md shrink (110 → 53 lines) → realistic preserved, standalone collapsed (context-dependent win)

One universal pattern: every skill reduces tool calls in the realistic cell (−1.00 to −4.73 per rollout). Reward-only scoring is lossy; always read both axes.

NeMo Gym sharp edges (8 items, ~1 hour each the first time)

These are concrete, reproducible, and cheap to fix. Most useful PR output for the framework team:

  1. ng_run runs python app.py not python -m. Relative imports break — must use absolute imports from project root.
  2. Trailing slashes in policy_base_url produce double-slash 404s on some providers.
  3. Some /v1/responses providers return object: "chat.completion"; NeMoGymResponse validation fails on the literal. Normalize in the model server.
  4. OpenAI's FunctionToolParam requires explicit strict: False or validation fails at the model server.
  5. Host NeMo Gym .venv/bin leaks into subprocess PATH via inherited env; rollouts can see ng_* binaries, Ray sockets, HF/MLflow credentials.
  6. Sandbox PATH strip breaks python on macOS — needed a workspace-local python → python3 symlink.
  7. Rollout JSONL only persists the final turn's token usage. Multi-turn loops lose intermediate tokens.
  8. Workspace cleanup must be inside a finally block — anything else leaks tmpdirs.

Methodology lessons that likely generalize past skill eval:

  1. Every artifact seeded into a workspace contaminates the control arm. SKILL.md peek 100% pre-fix; references peek 100% pre-second-fix.
  2. Content-hash provenance is necessary but not sufficient. Same-sha runs at temperature=0 still drift up to 0.20 per cell from judge non-determinism.

What's worth absorbing upstream

Two with existing demand. Two parked.

Worth a PR:

  • Assertion-grade LLM-as-judge as a shared base class. Three existing in-tree implementations (ours, code_gen, equivalence_llm_judge). Common shape; small base class saves the next team re-inventing it.
  • Per-turn token aggregation in the agent base class. Today only the final response's usage reaches the output JSONL; every multi-turn agent eventually wants the sum.

Parked (no second customer yet):

  • Paired-arm pattern as a framework primitive (a pairing_key + post-hoc join CLI).
  • Framework-level provenance stamping (a narrow ng_stamp_provenance helper that hashes Hydra-config-referenced files).

Read more

  • notes/skill-eval/harness.md — build the harness, the 2×2, provenance, iteration loop, sandbox details, scenarios how-to.
  • notes/skill-eval/results.md — full v8 scoreboard, per-skill audit (verdicts + before/after edits + v8 outcomes), the contamination story, retraction log, open methodology questions.

Test plan

  • 221 core unit tests pass at 96.41% coverage (well above the 96.0% threshold).
  • Per-server tests pass: skill_workspace (28), skill_judge (22), skill_eval_agent (9).
  • Builder + script tests pass: test_build_skill_eval_jsonl.py (15), test_eval_skills.py (33).
  • CI lint clean: ruff format --check, ruff check, README env-list row, secrets-detector hook.
  • CI docs build clean: sphinx-build --fail-on-warning succeeds with the skill-eval material moved out of the Sphinx tree.
  • End-to-end: ran 480-rollout v8 scoreboard against NVIDIA inference API; provenance tags match expected md / evals+fx / same-all per skill edited.

🤖 Generated with Claude Code

…, config, data, scaffolding, and execution

Add comprehensive agent skills covering the full NeMo Gym workflow:
- gym-review: Anti-pattern detection with deterministic Python scanner
- gym-debug: Runtime failure diagnosis and log analysis
- gym-profile: Rollout result analysis and reward profiling
- gym-config: Hydra YAML configuration composition and validation
- gym-data: Dataset preparation, validation, and HuggingFace registry
- gym-scaffold-agent: Custom agent server creation patterns
- gym-run: End-to-end benchmark execution workflow

Each skill includes SKILL.md with step-by-step instructions, self-contained
references, eval fixtures, and evals.json for quality measurement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii mentioned this pull request Apr 13, 2026
3 tasks
@lbliii lbliii self-assigned this Apr 14, 2026
lbliii and others added 15 commits April 24, 2026 12:08
Ships a three-server NeMo Gym environment that grades agent-skill content
(.claude/skills/*/SKILL.md) via paired with-skill vs without-skill rollouts:

- resources_servers/skill_workspace: per-session tmpdir with run_bash/read_file
  tools. Sandbox env strips host PATH to prevent rollouts seeing ng_* binaries,
  provides a python -> python3 shim for prompts assuming `python`.
- resources_servers/skill_judge: LLM-as-judge returning per-assertion binary
  grades with evidence; reward = fraction satisfied.
- responses_api_agents/skill_eval_agent: orchestrates seed -> tool loop ->
  verify -> close. Reads verifier_metadata.with_skill to decide whether to
  prepend SKILL.md as a system message.

Tooling:
- scripts/build_skill_eval_jsonl.py emits paired records with 5 content
  hashes in verifier_metadata (skill_md_sha, evals_sha, fixtures_sha,
  judge_prompt_sha, harness_version) so downstream diffs are attribution-safe.
- scripts/diff_skill_scoreboards.py renders per-skill w/wo/delta with a
  same-all / partial(N/5) / legacy provenance tag.
- scripts/eval_skills.py is a standalone runner for ad-hoc evaluation.
- scripts/build_shape_probe.py drives one-off content A/B tests.

Documentation:
- Two-part tutorial under docs/environment-tutorials (harness + scoreboard)
  covering methodology: with-vs-without deltas, provenance, noise floor
  calibration, per-scenario breakdown, and an iteration loop.
- A dogfood checkpoint summarizing findings, gotchas, and upstream
  candidates for the NeMo Gym team.

Skill-content fixes found during iteration:
- gym-config: rewrote as a "read-before-answer" checklist (flipped from
  -0.112 to +0.033 on the first iteration).
- gym-run/evals: loosened over-literal assertions that punished valid
  paraphrases.
- gym-scaffold-agent/evals: tightened scenario-3 assertion 5 so the judge
  can't credit critical reviews of clean code.

One framework nit: responses_api_models/openai_model normalizes
object="chat.completion" to "response" because some providers return the
former on /v1/responses, which otherwise fails NeMoGymResponse validation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Previously the harness conflated "skill in system prompt" with "skill's
supporting artifacts on disk" under a single with_skill flag. Because
seed_session unconditionally copied references/ and scripts/ into the
workspace, the control arm wasn't actually a control — we measured 100%
of without_skill rollouts on gym-profile reading references/metrics-guide.md
(which contains the exact nouns the assertions test for).

This commit splits the factor:

- skill_workspace: SkillWorkspaceSeedSessionRequest now has independent
  with_references / with_scripts flags. seed_session gates the copytree
  for references/ and scripts/ on those. SKILL.md remains never-seeded.
- skill_eval_agent: reads with_references / with_scripts from
  verifier_metadata and forwards to seed_session. Back-compat: when those
  fields are absent, defaults to the with_skill value.
- build_skill_eval_jsonl: emits the full 2×2 by default over
  (with_skill, with_references) — four cells per scenario labeled
  blind / docs-only / skill-only / skill+docs. --cells flag restricts
  the subset emitted.

Tests added for each cell combination in the workspace, and for
cell-label emission and subsetting in the builder.

What the cells mean for interpretation:
  blind       model priors only; no prompt, no disk
  docs-only   realistic reader without the skill pack installed
  skill-only  prompt-only; diagnoses how load-bearing on-disk artifacts are
  skill+docs  realistic reader with the skill pack (what previous "with" was)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Diff tool now auto-detects whether a JSONL is the legacy two-arm shape
(pre-Phase-1) or the 4-cell 2×2 shape, and renders accordingly. Both modes
now report three axes:

- Δreward    — accuracy (per-assertion satisfaction rate)
- Δtools     — number of tool calls made during the rollout
- Δtokens    — final-response output tokens (per-turn aggregation is Phase 2)

For 2×2 JSONLs, each skill shows four cell-level rows plus four named
marginal effects:

  skill | refs=T    value of SKILL.md prompt given references on disk
                    (= realistic-deployment marginal effect)
  skill | refs=F    value of SKILL.md prompt without references
                    (= skill-as-standalone-doc)
  refs  | skill=T   value of references given the skill is already in prompt
  refs  | skill=F   value of references as the reader's only scaffold

Receipt for why this matters: v6's legacy two-arm read showed gym-run at
Δreward=+0.487 and gym-data at Δreward=-0.013, but both skills cut tool
calls by 2.4–3.3 per rollout. The reward-only view was hiding half the
signal; the new multi-axis view surfaces it automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Addresses PM review feedback: the original checkpoint conflated what we
built with what we learned about specific skills. The infra work is
defensible; the per-skill claims rest on measurements with known
contamination (SKILL.md on disk in the control arm, references/*.md on
disk in both arms). Separating the two makes the artifact sharable
without carrying forward claims we can't yet defend.

- skill-eval-infra-v1.md: infra artifact + sharp-edges list + two upstream
  candidates with existing receipts. No per-skill claims.
- skill-eval-findings.md: gated on v7 (post-contamination-fix) rollout.
  Pre-fix numbers archived but marked as not-defensible. Retracted claims
  listed explicitly.
- Original checkpoint retained with a "superseded by" banner pointing to
  the split.

Also retracts two prior claims that don't survive the noise-floor check:
"every skill reduces tool calls 0.8–4.8" (measurement against contaminated
control) and "gym-profile is actively misleading" (aggregate delta inside
per-cell noise; per-scenario sc3 claim preserved but scoped).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
First clean measurement since the references-contamination fix. 480
rollouts, 4 cells × 8 skills × 3 scenarios × n=5.

Key results on the realistic-deployment contrast (skill | refs=T):

- gym-run +0.436 (the only pre-fix claim that fully survives; has no
  references/ dir so its control was never contaminated)
- add-benchmark +0.141, gym-config +0.111, gym-debug +0.080 (survive
  but smaller than the contaminated numbers)
- gym-profile -0.107 (refined read: skill as standalone is +0.278,
  competes with its own references when both present)
- gym-review +0.029 (mostly redundant with references — refs|skill=F
  is +0.602)
- gym-scaffold-agent -0.040, gym-data +0.013 (both inside noise)

Efficiency story holds and is the strongest multi-skill pattern: every
skill reduces tool calls on the realistic contrast (-0.73 to -4.87 calls
per rollout). The earlier checkpoint's "reward-only is lossy" claim
survives on the correct contrast.

Retracts three prior claims:
- "every skill reduces tool calls 0.8-4.8" replaced by correct-contrast
  magnitudes above
- "gym-profile is actively misleading" replaced by "competes with its
  own references; useful standalone"
- shape-probe null result withdrawn pending clean rerun

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sweeps the long-form docs to match the current state of the harness:

- skill-eval-harness.md (Part 1): /seed_session now documents with_references
  / with_scripts flags; adds a "Gotcha: workspace contamination" section
  explaining why SKILL.md is never seeded and why references gating matters;
  orchestrator /run example updated to forward both flags from
  verifier_metadata.
- skill-eval-scoreboard.md (Part 2): methodology section rewritten to
  introduce the 2×2 cells and the four named marginal effects as the
  primary frame; Step 2 JSONL example updated to show the new provenance
  fields and cell labels; Step 4 replaces the old legacy-contaminated
  scoreboard with the v7 multi-axis table and interpretive heuristics for
  each row pattern; Step 7 retires the v1→v2 and v3→v4 narratives in favor
  of a v6→v7 contamination-fix worked example plus an honest judge-drift
  noise-floor note; "What's next" replaced with concrete follow-ons and
  cross-links to findings.md and infra-v1.md.
- skill_workspace/README: documents with_references / with_scripts on
  /seed_session; explains the "SKILL.md never seeded" invariant.
- skill_eval_agent/README: schemas the 2×2 cell in the input JSONL
  example; documents the five provenance fields; updates the /run flow.
- index.md: updated card blurbs to reflect 2×2 methodology.

Scope: documentation only. No code or test changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Per-skill verdicts from v7 (4-cell 2×2, n=5), grounded in:

- the four Δreward effects (skill|refs=T, skill|refs=F, refs|skill=F,
  Δtools|refs=T)
- per-scenario cell means
- a root-cause hypothesis for why each skill performs as measured
- specific file-level prescriptions (SKILL.md edits, new scenarios,
  references to keep/delete)

Three skills with clear keep verdicts (gym-run, add-benchmark, gym-config).
Two with clear structural issues: gym-review (SKILL.md redundant with its
references, shrink), gym-profile (SKILL.md competes with its references,
rewrite to narrate). One with content gap (gym-scaffold-agent missing
non-RL agent patterns). Two blocked on scenario difficulty (gym-data
fully ceiling-clipped, gym-debug partially).

Cross-cutting patterns section lists: where to shrink SKILL.md, where to
rewrite to narrate-to-refs, where to rewrite evals.json for harder
scenarios, where there are real content gaps.

Recommendations are explicitly gated on: (a) n=20 calibration to confirm
effect sizes, (b) one-edit-at-a-time discipline, (c) provenance diff
confirming the right field moved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Predicted prescription from the skill-by-skill audit: the v7 Δreward on
the realistic contrast was -0.107, driven by sc2 and sc3 where assertions
test conceptual noun recall (name `pass_threshold`, name `extracted_model_code`)
and the how-to-shaped SKILL.md never reaches those nouns in prose.

Two targeted edits:

1. Promote `pass_threshold` from a command flag into a named concept
   subsection under Step 2. The concept is now explained (how it changes
   pass@k, when to raise/lower) rather than only appearing as `+pass_threshold=1.0`
   on a command line.
2. Rewrite the "Suspicious patterns" table so every row names both the
   trigger (what you observe) and the confirming field (what you read from
   the rollout JSONL to verify the diagnosis). Adds a "Rule" note that
   tells the model to cite the specific field by identifier in its
   diagnosis.

Expected v8 effect: `skill | refs=T` moves from -0.107 toward zero or
positive. sc2 and sc3 should stop losing their key assertions to the
"cited pattern but not the confirming field" failure mode.

Single-skill change; diff against v7 should flag `md` for gym-profile
and same-all for every other skill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Predicted prescription from the skill-by-skill audit: v7 `skill | refs=T`
was +0.029, `refs | skill=F` was +0.602 — the references + scripts/review.py
carry the entire load and the 110-line SKILL.md was dead weight.

Kept in the new SKILL.md:
- How to invoke scripts/review.py (one paragraph).
- What BLOCK vs WARN severity means.
- An affirmative framing: if the script is quiet, the code is clean —
  approve the review explicitly rather than manufacturing concerns.
- The five judgment checks the pattern matcher can't do.
- Cross-references to references/anti-patterns.md and fix-patterns.md.

Removed:
- Full BLOCK/WARN tables (they duplicate references/anti-patterns.md and
  are echoed in the script's own per-finding output).
- Review report template (trivial, doesn't need a skill to teach).
- The "Apply judgment" numbered list's per-item detail — kept as bullets,
  context lives in the references.

Expected v8 effect: `skill | refs=T` holds near zero (it's already there);
Δtools holds near -4.87 (the skill's main contribution was teaching the
model to trust the script). If Δtools drops materially, the shrink was
too aggressive.

Single-skill change; diff against v7 should flag `md` for gym-review
and same-all for every other skill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
All three gym-data scenarios in v7 were ceiling-clipped (docs=0.96-1.00,
skill-only=1.00, skill+docs=0.96-1.00). The skill's content was impossible
to evaluate because the scenarios were trivially solvable from first
principles by a frontier model. This replaces the scenarios with harder
versions that require real judgment:

- sc1 (schema audit): a 5-line tool-calling dataset with 4 intentionally
  planted schema bugs (required field not in properties, parallel_tool_calls
  / expected_tool_calls mismatch, missing `function` wrapping, and a
  parallel_tool_calls / single-tool inconsistency). Model must read each
  entry and identify the exact violation; 1 entry is clean.
- sc2 (semantic mislabeling): a 7-line math/trivia dataset where 3
  `expected_answer` values are factually wrong (capital of Australia =
  Canberra not Sydney; leap year = 366 not 365; gold symbol = Au not Go).
  Schema is fine; the model has to apply domain judgment to detect the
  mislabels.
- sc3 (schema extension): a complex multi-turn branching customer-support
  benchmark schema with expected_tool_sequence, forbidden_sequence, and
  partial_credit. Model must generate 3 new entries that follow the
  schema AND exercise specified branching scenarios (email-only refund,
  abuse attempt, ambiguous-amount refund).

Expected v8 effect: `docs-only` drops below 0.95 (likely much lower on
sc1 and sc3), opening measurable headroom for the skill. `skill | refs=T`
becomes interpretable instead of structurally-zero.

Old fixture files (sample_tool_calling.jsonl, sample_bad_data.jsonl,
sample_sql_benchmark.jsonl, sample_judge_benchmark.jsonl) remain in
evals/files/ unreferenced — kept for future scenario work.

Single-skill change; diff against v7 should flag `evals+fx` for gym-data
and same-all for every other skill.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
v8 measured the three predicted skill-content prescriptions from the v7
audit. Two outright hits, one caveated hit — all three landed with clean
provenance attribution and single-skill effects.

findings.md:
- Lead with a v8 summary table (edit / predicted direction / observed / verdict).
- Replace the v7 headline table with v8 numbers; keep v7 archived for comparison.
- Update the Δtools table to show v7 → v8 side-by-side (pattern held).
- Rewrite per-skill reads to include v8 outcome blocks where edits were made.
- Expand the retraction log and update "what we can claim" with the v8 evidence.

skill-eval-skill-review.md:
- Title + header updated to "v7 prescriptions, v8 outcomes".
- Each edited skill (gym-profile, gym-review, gym-data) gets a v8 outcome
  block citing the commit SHA, the realistic-contrast movement, and the
  provenance diff.
- Cross-cutting patterns section annotated with which prescriptions
  landed and which are outstanding.
- New "v7 → v8 results on the three prescriptions" summary table.
- Prerequisites updated with the gym-review shrink lesson (measure
  standalone contrast too).

Headline:
- gym-profile skill|refs=T: −0.107 → +0.040 (change +0.147, md diff, cleanly above noise)
- gym-data skill|refs=T:     +0.013 → +0.093 (evals+fx diff, now measurable)
- gym-review skill|refs=T:   +0.029 → +0.048 (realistic held); skill|refs=F: +0.298 → +0.012 (standalone collapsed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
The v8 outcome blocks narrated what changed but didn't show it. Making
the edits concrete so readers can see the pattern — this is the whole
point of the prescription framework; abstracted "rewrite patterns table"
is not as useful as showing an actual row go from 2 columns to 3.

Per-skill additions:

- gym-profile: before/after of one patterns-table row (think-block stripping),
  showing the diagnostic chain pattern → cause → confirming field now
  completing in prose. Notes the parallel pass_threshold change from
  command flag to named concept subsection.
- gym-review: kept-vs-dropped lists. What stayed in SKILL.md (invocation,
  severity, 5 judgment bullets, cross-refs). What moved out (full BLOCK
  / WARN tables, report template, verbose judgment list). Explains why
  the shrink works for realistic-deployment (refs carry the tables) and
  why it regressed skill-only (no refs → no tables).
- gym-data: sc1 / sc2 / sc3 prompts shown before vs after. In every
  case, "before" is pattern-matchable from fixture format; "after"
  requires reading contents and applying judgment. Explicit framing:
  scenarios moved from validating format to testing understanding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
SHAs updated for the three skills edited in 3f3a330 (gym-profile md),
8fdcdb2 (gym-review md), and 5e84dd0 (gym-data evals+fx). This is the
input JSONL v8 was run against.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
…line

Four failing CI jobs, fixed:

Test (coverage 92.5% < 96.0% threshold) — core `pytest tests/unit_tests/`
was measuring `resources_servers/skill_workspace/app.py` and
`resources_servers/skill_judge/app.py` because new test files import
SkillWorkspaceResourcesServer as a fixture. Per-server modules are tested
via `ng_test` (dedicated venv per server), not by the core suite. Added
`resources_servers/*`, `responses_api_agents/*`, and
`responses_api_models/*` to the coverage `omit` list. Also added CLI tests
for `scripts/build_skill_eval_jsonl.py` to bring its coverage from 81% to
98%. Total coverage now 96.41%.

Lint check — ruff format pass on 7 files (pure formatting, no behavior
changes). 5 ruff --fix items applied (mostly removed-blank-line nits).

Lint check (README row) — main's `update_env_list.py` adds a "Skill Eval
Agent" row when it sees `responses_api_agents/skill_eval_agent/`. Added
the row manually so the merge-commit hook reports clean.

secrets-detector — `Hex High Entropy String` warnings on our 12-char
content-hash provenance fields (`skill_md_sha`, etc.) in
`responses_api_agents/skill_eval_agent/data/example.jsonl` and the
agent README. Extended `.secrets.baseline` `should_exclude_file` regex
to cover those paths plus `notes/skill-eval/*.md` (next commit moves
internal docs there).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
…x tree

The skill-eval material is internal stakeholder validation, not
user-facing docs. The repo is migrating user docs to Fern (fern/ is in
main); the Sphinx tree under docs/ is being deprecated. This material
shouldn't live in either system.

Move out of docs/environment-tutorials/ (Sphinx) into notes/skill-eval/.
Two consolidated files instead of six:

- notes/skill-eval/harness.md — build and operate the harness. Combines
  skill-eval-harness.md (build) and skill-eval-scoreboard.md (run +
  interpret), stripped of Sphinx tutorial framing. Plain markdown.
- notes/skill-eval/results.md — what we measured and what to take from
  it. Replaces checkpoint.md, infra-v1.md, findings.md, and
  skill-review.md. Sections: TL;DR, v8 scoreboard, per-skill audit with
  before/after edits and v8 outcomes, NeMo Gym sharp edges, upstream
  candidates, methodology learnings, retraction log.

Deleted:
- docs/environment-tutorials/skill-eval-{harness,scoreboard,checkpoint,
  infra-v1,findings,skill-review}.md
- The two skill-eval grid-item-cards in
  docs/environment-tutorials/index.md

The grid-card removal also fixes the docs-build CI failure (the six
files were warning about not being in any toctree). With them gone from
the Sphinx tree, the build is clean — verified locally with sphinx-build
--fail-on-warning.

Stripped Sphinx-isms in the new notes: dropped (label)= ref targets,
{ref}/{doc} cross-refs, :::{note}/:::{tip}/:::{button-ref} admonitions,
:orphan: headers, grid-card directives, math $$...$$ blocks. Plain
markdown that renders identically in any viewer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii requested a review from a team as a code owner April 27, 2026 14:42
lbliii added a commit that referenced this pull request Apr 30, 2026
Three additions sourced from PR #1062 (skill-eval harness dogfooding).

- Quickstart step 3: replace "Four Uvicorn lines print" with
  ng_status verification. Lawrence's flow uses ng_status; it's faster
  than scanning logs and produces an unambiguous "N healthy" signal.
- Quickstart step 3: add a Tip for returning users — append
  +skip_venv_if_present=true to ng_run to skip venv re-creation.
  Real flag in cli_setup_command.py:116-117.
- Configuration troubleshooting: new accordion for the trailing-slash
  gotcha (a "/" at the end of policy_base_url produces double-slash
  request paths some providers 404 on). From PR #1062 sharp edge #2.

Verified ng_run blocks (cli.py:410 calls rh.run_forever); two-terminal
pattern in Quickstart is correct, no change needed there.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant