Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions .claude/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# NeMo Gym Agent Skills

Agent skills for NeMo Gym development. Each skill follows the [agentskills.io](https://agentskills.io/specification) specification and can be used standalone or composed into multi-step chains.

## Skills

| Skill | What it does | When to use it |
|-------|-------------|----------------|
| [gym-review](gym-review/) | Deterministic anti-pattern checker + judgment-based review | Reviewing PRs, auditing servers before merge |
| [gym-debug](gym-debug/) | Diagnose server failures, rollout errors, unexpected rewards | Servers won't start, rollouts hang, rewards look wrong |
| [gym-run](gym-run/) | Run benchmarks — env.yaml setup, server launch, rollout collection | First run, smoke testing, full rollout collection |
| [gym-profile](gym-profile/) | Analyze rollout results, reward distributions, pass rates | Baselining benchmarks, comparing models, investigating variance |
| [gym-config](gym-config/) | Compose and validate Hydra YAML configurations | Setting up server configs, debugging composition errors |
| [gym-data](gym-data/) | Prepare, validate, and register JSONL datasets | Converting data, uploading to GitLab registry, validating schemas |
| [gym-scaffold-agent](gym-scaffold-agent/) | Create custom agent servers | Multi-turn interaction, external library wrapping, tool orchestration |
| [add-benchmark](add-benchmark/) | End-to-end benchmark creation guide | Adding a new resources server + agent + data + config |

## Chains

Chains compose skills into multi-step workflows. Defined in [`chains.yaml`](chains.yaml).

| Chain | Steps | Use case |
|-------|-------|----------|
| **run** | gym-config > gym-run > gym-profile | Executing a configured benchmark end-to-end |
| **new-benchmark** | add-benchmark > gym-data > gym-config > gym-run > gym-profile > gym-review | Building a benchmark from scratch |
| **validate** | gym-config > gym-data > gym-run > gym-profile | Checking an existing benchmark works correctly |
| **diagnose** | gym-debug > gym-review | Debugging a failing benchmark |
| **external-integration** | gym-scaffold-agent > gym-data > gym-config > gym-run > gym-profile > gym-review | Wrapping a 3rd-party benchmark library |
| **pre-merge** | gym-review > gym-config > gym-data | Final checks before merging a PR |

## Skill structure

Each skill follows a consistent layout:

```
skill-name/
SKILL.md # Skill definition (YAML frontmatter + instructions)
evals/
evals.json # Assertion-based evaluations
files/ # Self-contained test fixtures (if applicable)
references/ # Portable reference docs (if applicable)
scripts/ # Deterministic tooling (if applicable)
```

**gym-review** is the reference implementation: it includes a standalone Python checker (`scripts/review.py`), self-contained reference docs, and eval fixtures that work without the NeMo Gym repo.

## Evaluating skills

Each skill has 3 evals in `evals/evals.json`. Evals follow the [agentskills.io evaluation spec](https://agentskills.io/skill-creation/evaluating-skills).

### Running evals

Compare agent performance **with-skill** vs **without-skill** (baseline):

1. **With-skill**: Load the SKILL.md, give the agent the eval prompt, grade the response against assertions.
2. **Without-skill (baseline)**: Give the agent the same prompt with no skill loaded, grade against the same assertions.
3. **Compute delta**: The percentage-point improvement from loading the skill.

Each eval in `evals.json` has:

```json
{
"id": 1,
"prompt": "The task the agent must perform",
"expected_output": "What a good response looks like",
"files": ["evals/files/fixture.py"],
"assertions": [
"Specific claim that must be true in the response",
"Another required element"
]
}
```

### Grading

For each assertion, score 1 (present in response) or 0 (missing). The skill's score is the average across all assertions and evals. A skill is useful when its with-skill score meaningfully exceeds the baseline.

### Example: gym-review

```bash
# The review script can also be tested directly
python .claude/skills/gym-review/scripts/review.py .claude/skills/gym-review/evals/files/

# Expected: 9 BLOCK, 4 WARN across the fixture files
# sample_clean_server.py should produce 0 findings
```

## Portability

Skills are designed to work when pulled standalone. Key design principles:

- **References are self-contained** -- no links to repo-internal paths that won't exist for external users
- **Scripts have zero dependencies** -- `review.py` uses only the Python standard library
- **Eval fixtures are bundled** -- test files live in `evals/files/`, not scattered across the repo
- **SKILL.md frontmatter** includes `license`, `compatibility`, and `allowed-tools` per the spec
9 changes: 7 additions & 2 deletions .claude/skills/add-benchmark/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,13 @@ description: >
training environment, or resources server into NeMo-Gym. Also use when wrapping
an existing 3rd-party benchmark library. Covers the full workflow: data preparation,
resources server implementation, agent wiring, YAML config, testing, and reward
profiling (baselining). Triggered by: "add benchmark", "new resources server",
"integrate benchmark", "wrap benchmark", "add training environment", "add eval".
profiling (baselining).
license: Apache-2.0
compatibility: Requires Python 3.12+, uv, git. NeMo Gym must be installed.
metadata:
author: nvidia-nemo-gym
version: "1.0"
allowed-tools: Bash(python:*) Bash(ng_*) Bash(git:*) Bash(pre-commit:*) Read Write Edit Grep Glob
---

# Add Benchmark to NeMo-Gym
Expand Down
46 changes: 46 additions & 0 deletions .claude/skills/add-benchmark/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"skill_name": "add-benchmark",
"evals": [
{
"id": 1,
"prompt": "I want to add a math algebra benchmark. Here's a sample server at evals/files/sample_math_server.py, data at evals/files/sample_math_example.jsonl, and config at evals/files/sample_math_config.yaml. Review them for completeness.",
"expected_output": "Review confirming server, data, and config are correct, identifying any missing pieces.",
"files": ["evals/files/sample_math_server.py", "evals/files/sample_math_example.jsonl", "evals/files/sample_math_config.yaml"],
"assertions": [
"Server extends SimpleResourcesServer with async verify()",
"Think-block stripping is present and validated",
"Rewards are binary (0.0 or 1.0)",
"example.jsonl has valid entries with responses_create_params and verifier_metadata",
"Config correctly wires resources server, agent, and datasets",
"verified: false is confirmed for new server",
"Response identifies any missing pieces (requirements.txt, README.md)"
]
},
{
"id": 2,
"prompt": "Review the test file at evals/files/sample_math_test.py for my math benchmark. Does it cover enough cases?",
"expected_output": "Test coverage assessment identifying what's covered and what's missing.",
"files": ["evals/files/sample_math_test.py", "evals/files/sample_math_server.py"],
"assertions": [
"Test coverage assessed: pass, fail (wrong answer), fail (no extraction), fail (think block)",
"Missing test case identified: timeout handling",
"Response notes test coverage is close to but not at the 95% requirement",
"Tests correctly validate binary rewards",
"Response recommends adding edge cases (empty output, very long output)"
]
},
{
"id": 3,
"prompt": "I want to wrap an external code execution benchmark. The library uses httpx for API calls and has its own scoring. How should I structure this?",
"expected_output": "Integration guide recommending agent-level wrapping with httpx replacement.",
"assertions": [
"Agent-level integration recommended (wrap in /run, not /verify)",
"httpx replacement with aiohttp adapter is mentioned",
"Pre-processing from Gym schema to library format described",
"Post-processing to BaseVerifyResponse with reward described",
"Reproduce published numbers with original repo first, then reproduce after integration",
"asyncio.Semaphore for concurrent library calls mentioned"
]
}
]
}
33 changes: 33 additions & 0 deletions .claude/skills/add-benchmark/evals/files/sample_math_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
math_algebra:
resources_servers:
math_algebra:
entrypoint: app.py
domain: math
verified: false
datasets:
- name: math_example
type: example
jsonl_fpath: resources_servers/math_algebra/data/example.jsonl
- name: math_train
type: train
jsonl_fpath: resources_servers/math_algebra/data/train.jsonl
gitlab_identifier:
dataset_name: math_algebra
version: 0.0.1
artifact_fpath: train.jsonl
license: Apache-2.0

math_agent:
responses_api_agents:
simple_agent:
entrypoint: app.py
resources_server:
type: resources_servers
name: math_algebra
model_server:
type: responses_api_models
name: policy_model
datasets:
- name: math_example
type: example
jsonl_fpath: resources_servers/math_algebra/data/example.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "If x + 5 = 12, what is x?"}]}, "verifier_metadata": {"expected_answer": "7"}}
{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "A store sells apples for $2 each. If you buy 8 apples and pay with a $20 bill, how much change do you get?"}]}, "verifier_metadata": {"expected_answer": "4"}}
{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "Solve for y: 3y - 9 = 0"}]}, "verifier_metadata": {"expected_answer": "3"}}
{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "What is the area of a rectangle with length 7 and width 4?"}]}, "verifier_metadata": {"expected_answer": "28"}}
{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "If 2x + 3 = 15, what is x?"}]}, "verifier_metadata": {"expected_answer": "6"}}
48 changes: 48 additions & 0 deletions .claude/skills/add-benchmark/evals/files/sample_math_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Math algebra resources server — extracts numerical answer and compares to expected."""

import re

from nemo_gym.servers.resources_server import SimpleResourcesServer


class MathAlgebraServer(SimpleResourcesServer):
async def verify(self, body):
output_text = body.get("output_text", "")
verifier_metadata = body.get("verifier_metadata", {})
expected_answer = str(verifier_metadata.get("expected_answer", ""))

# Strip think blocks before extraction
if "</think>" in output_text:
output_text = output_text.split("</think>")[-1].strip()

# Extract answer after "Answer:" marker
match = re.search(r"(?:Answer|ANSWER)\s*[:\s]\s*(.+)", output_text)
if not match:
return {"reward": 0.0, "extracted_answer": None, "reason": "no_answer_marker"}

extracted = match.group(1).strip().rstrip(".")

# Compare
if self._normalize(extracted) == self._normalize(expected_answer):
return {"reward": 1.0, "extracted_answer": extracted}
else:
return {"reward": 0.0, "extracted_answer": extracted, "reason": "wrong_answer"}

@staticmethod
def _normalize(s: str) -> str:
"""Normalize answer for comparison: strip whitespace, lowercase."""
return re.sub(r"\s+", " ", s.strip().lower())
76 changes: 76 additions & 0 deletions .claude/skills/add-benchmark/evals/files/sample_math_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for the math algebra resources server."""

import pytest


@pytest.fixture
def server():
from sample_math_server import MathAlgebraServer

return MathAlgebraServer()


@pytest.mark.asyncio
async def test_verify_pass(server):
"""Correct answer should get reward 1.0."""
result = await server.verify(
{
"output_text": "Let me solve this.\nx + 5 = 12\nx = 7\nAnswer: 7",
"verifier_metadata": {"expected_answer": "7"},
}
)
assert result["reward"] == 1.0
assert result["extracted_answer"] == "7"


@pytest.mark.asyncio
async def test_verify_fail_wrong_answer(server):
"""Wrong answer should get reward 0.0."""
result = await server.verify(
{
"output_text": "I think the answer is:\nAnswer: 5",
"verifier_metadata": {"expected_answer": "7"},
}
)
assert result["reward"] == 0.0
assert result["reason"] == "wrong_answer"


@pytest.mark.asyncio
async def test_verify_fail_no_answer(server):
"""Missing 'Answer:' marker should get reward 0.0."""
result = await server.verify(
{
"output_text": "The solution is 7, which we can verify by substituting back.",
"verifier_metadata": {"expected_answer": "7"},
}
)
assert result["reward"] == 0.0
assert result["reason"] == "no_answer_marker"


@pytest.mark.asyncio
async def test_verify_fail_think_block(server):
"""Answer only inside think block should get 0.0 after stripping."""
result = await server.verify(
{
"output_text": "<think>\nThe answer is 7.\nAnswer: 7\n</think>",
"verifier_metadata": {"expected_answer": "7"},
}
)
assert result["reward"] == 0.0
assert result["reason"] == "no_answer_marker"
Loading
Loading