NVIDIA-NeMo · lbliii · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/.claude/skills/README.md b/.claude/skills/README.md
@@ -0,0 +1,95 @@
+# NeMo Gym Agent Skills
+
+Agent skills for NeMo Gym development. Each skill follows the [agentskills.io](https://agentskills.io/specification) specification and can be used standalone or composed into multi-step chains.
+
+## Skills
+
+| Skill | What it does | When to use it |
+|-------|-------------|----------------|
+| [gym-review](gym-review/) | Deterministic anti-pattern checker + judgment-based review | Reviewing PRs, auditing servers before merge |
+| [gym-debug](gym-debug/) | Diagnose server failures, rollout errors, unexpected rewards | Servers won't start, rollouts hang, rewards look wrong |
+| [gym-run](gym-run/) | Run benchmarks — env.yaml setup, server launch, rollout collection | First run, smoke testing, full rollout collection |
+| [gym-profile](gym-profile/) | Analyze rollout results, reward distributions, pass rates | Baselining benchmarks, comparing models, investigating variance |
+| [gym-config](gym-config/) | Compose and validate Hydra YAML configurations | Setting up server configs, debugging composition errors |
+| [gym-data](gym-data/) | Prepare, validate, and register JSONL datasets | Converting data, uploading to GitLab registry, validating schemas |
+| [gym-scaffold-agent](gym-scaffold-agent/) | Create custom agent servers | Multi-turn interaction, external library wrapping, tool orchestration |
+| [add-benchmark](add-benchmark/) | End-to-end benchmark creation guide | Adding a new resources server + agent + data + config |
+
+## Chains
+
+Chains compose skills into multi-step workflows. Defined in [`chains.yaml`](chains.yaml).
+
+| Chain | Steps | Use case |
+|-------|-------|----------|
+| **run** | gym-config > gym-run > gym-profile | Executing a configured benchmark end-to-end |
+| **new-benchmark** | add-benchmark > gym-data > gym-config > gym-run > gym-profile > gym-review | Building a benchmark from scratch |
+| **validate** | gym-config > gym-data > gym-run > gym-profile | Checking an existing benchmark works correctly |
+| **diagnose** | gym-debug > gym-review | Debugging a failing benchmark |
+| **external-integration** | gym-scaffold-agent > gym-data > gym-config > gym-run > gym-profile > gym-review | Wrapping a 3rd-party benchmark library |
+| **pre-merge** | gym-review > gym-config > gym-data | Final checks before merging a PR |
+
+## Skill structure
+
+Each skill follows a consistent layout:
+
+```
+skill-name/
+  SKILL.md             # Skill definition (YAML frontmatter + instructions)
+  evals/
+    evals.json         # Assertion-based evaluations
+    files/             # Self-contained test fixtures (if applicable)
+  references/          # Portable reference docs (if applicable)
+  scripts/             # Deterministic tooling (if applicable)
+```
+
+**gym-review** is the reference implementation: it includes a standalone Python checker (`scripts/review.py`), self-contained reference docs, and eval fixtures that work without the NeMo Gym repo.
+
+## Evaluating skills
+
+Each skill has 3 evals in `evals/evals.json`. Evals follow the [agentskills.io evaluation spec](https://agentskills.io/skill-creation/evaluating-skills).
+
+### Running evals
+
+Compare agent performance **with-skill** vs **without-skill** (baseline):
+
+1. **With-skill**: Load the SKILL.md, give the agent the eval prompt, grade the response against assertions.
+2. **Without-skill (baseline)**: Give the agent the same prompt with no skill loaded, grade against the same assertions.
+3. **Compute delta**: The percentage-point improvement from loading the skill.
+
+Each eval in `evals.json` has:
+
+```json
+{
+  "id": 1,
+  "prompt": "The task the agent must perform",
+  "expected_output": "What a good response looks like",
+  "files": ["evals/files/fixture.py"],
+  "assertions": [
+    "Specific claim that must be true in the response",
+    "Another required element"
+  ]
+}
+```
+
+### Grading
+
+For each assertion, score 1 (present in response) or 0 (missing). The skill's score is the average across all assertions and evals. A skill is useful when its with-skill score meaningfully exceeds the baseline.
+
+### Example: gym-review
+
+```bash
+# The review script can also be tested directly
+python .claude/skills/gym-review/scripts/review.py .claude/skills/gym-review/evals/files/
+
+# Expected: 9 BLOCK, 4 WARN across the fixture files
+# sample_clean_server.py should produce 0 findings
+```
+
+## Portability
+
+Skills are designed to work when pulled standalone. Key design principles:
+
+- **References are self-contained** -- no links to repo-internal paths that won't exist for external users
+- **Scripts have zero dependencies** -- `review.py` uses only the Python standard library
+- **Eval fixtures are bundled** -- test files live in `evals/files/`, not scattered across the repo
+- **SKILL.md frontmatter** includes `license`, `compatibility`, and `allowed-tools` per the spec
diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
@@ -6,8 +6,13 @@ description: >
   training environment, or resources server into NeMo-Gym. Also use when wrapping
   an existing 3rd-party benchmark library. Covers the full workflow: data preparation,
   resources server implementation, agent wiring, YAML config, testing, and reward
-  profiling (baselining). Triggered by: "add benchmark", "new resources server",
-  "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
+  profiling (baselining).
+license: Apache-2.0
+compatibility: Requires Python 3.12+, uv, git. NeMo Gym must be installed.
+metadata:
+  author: nvidia-nemo-gym
+  version: "1.0"
+allowed-tools: Bash(python:*) Bash(ng_*) Bash(git:*) Bash(pre-commit:*) Read Write Edit Grep Glob
 ---
 
 # Add Benchmark to NeMo-Gym

diff --git a/.claude/skills/add-benchmark/evals/evals.json b/.claude/skills/add-benchmark/evals/evals.json
@@ -0,0 +1,46 @@
+{
+  "skill_name": "add-benchmark",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I want to add a math algebra benchmark. Here's a sample server at evals/files/sample_math_server.py, data at evals/files/sample_math_example.jsonl, and config at evals/files/sample_math_config.yaml. Review them for completeness.",
+      "expected_output": "Review confirming server, data, and config are correct, identifying any missing pieces.",
+      "files": ["evals/files/sample_math_server.py", "evals/files/sample_math_example.jsonl", "evals/files/sample_math_config.yaml"],
+      "assertions": [
+        "Server extends SimpleResourcesServer with async verify()",
+        "Think-block stripping is present and validated",
+        "Rewards are binary (0.0 or 1.0)",
+        "example.jsonl has valid entries with responses_create_params and verifier_metadata",
+        "Config correctly wires resources server, agent, and datasets",
+        "verified: false is confirmed for new server",
+        "Response identifies any missing pieces (requirements.txt, README.md)"
+      ]
+    },
+    {
+      "id": 2,
+      "prompt": "Review the test file at evals/files/sample_math_test.py for my math benchmark. Does it cover enough cases?",
+      "expected_output": "Test coverage assessment identifying what's covered and what's missing.",
+      "files": ["evals/files/sample_math_test.py", "evals/files/sample_math_server.py"],
+      "assertions": [
+        "Test coverage assessed: pass, fail (wrong answer), fail (no extraction), fail (think block)",
+        "Missing test case identified: timeout handling",
+        "Response notes test coverage is close to but not at the 95% requirement",
+        "Tests correctly validate binary rewards",
+        "Response recommends adding edge cases (empty output, very long output)"
+      ]
+    },
+    {
+      "id": 3,
+      "prompt": "I want to wrap an external code execution benchmark. The library uses httpx for API calls and has its own scoring. How should I structure this?",
+      "expected_output": "Integration guide recommending agent-level wrapping with httpx replacement.",
+      "assertions": [
+        "Agent-level integration recommended (wrap in /run, not /verify)",
+        "httpx replacement with aiohttp adapter is mentioned",
+        "Pre-processing from Gym schema to library format described",
+        "Post-processing to BaseVerifyResponse with reward described",
+        "Reproduce published numbers with original repo first, then reproduce after integration",
+        "asyncio.Semaphore for concurrent library calls mentioned"
+      ]
+    }
+  ]
+}
diff --git a/.claude/skills/add-benchmark/evals/files/sample_math_config.yaml b/.claude/skills/add-benchmark/evals/files/sample_math_config.yaml
@@ -0,0 +1,33 @@
+math_algebra:
+  resources_servers:
+    math_algebra:
+      entrypoint: app.py
+      domain: math
+      verified: false
+      datasets:
+      - name: math_example
+        type: example
+        jsonl_fpath: resources_servers/math_algebra/data/example.jsonl
+      - name: math_train
+        type: train
+        jsonl_fpath: resources_servers/math_algebra/data/train.jsonl
+        gitlab_identifier:
+          dataset_name: math_algebra
+          version: 0.0.1
+          artifact_fpath: train.jsonl
+        license: Apache-2.0
+
+math_agent:
+  responses_api_agents:
+    simple_agent:
+      entrypoint: app.py
+      resources_server:
+        type: resources_servers
+        name: math_algebra
+      model_server:
+        type: responses_api_models
+        name: policy_model
+      datasets:
+      - name: math_example
+        type: example
+        jsonl_fpath: resources_servers/math_algebra/data/example.jsonl
diff --git a/.claude/skills/add-benchmark/evals/files/sample_math_example.jsonl b/.claude/skills/add-benchmark/evals/files/sample_math_example.jsonl
@@ -0,0 +1,5 @@
+{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "If x + 5 = 12, what is x?"}]}, "verifier_metadata": {"expected_answer": "7"}}
+{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "A store sells apples for $2 each. If you buy 8 apples and pay with a $20 bill, how much change do you get?"}]}, "verifier_metadata": {"expected_answer": "4"}}
+{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "Solve for y: 3y - 9 = 0"}]}, "verifier_metadata": {"expected_answer": "3"}}
+{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "What is the area of a rectangle with length 7 and width 4?"}]}, "verifier_metadata": {"expected_answer": "28"}}
+{"responses_create_params": {"input": [{"role": "system", "content": "Solve the algebra problem. Show your work, then give your final numerical answer after 'Answer:'."}, {"role": "user", "content": "If 2x + 3 = 15, what is x?"}]}, "verifier_metadata": {"expected_answer": "6"}}
diff --git a/.claude/skills/add-benchmark/evals/files/sample_math_server.py b/.claude/skills/add-benchmark/evals/files/sample_math_server.py
@@ -0,0 +1,48 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Math algebra resources server — extracts numerical answer and compares to expected."""
+
+import re
+
+from nemo_gym.servers.resources_server import SimpleResourcesServer
+
+
+class MathAlgebraServer(SimpleResourcesServer):
+    async def verify(self, body):
+        output_text = body.get("output_text", "")
+        verifier_metadata = body.get("verifier_metadata", {})
+        expected_answer = str(verifier_metadata.get("expected_answer", ""))
+
+        # Strip think blocks before extraction
+        if "</think>" in output_text:
+            output_text = output_text.split("</think>")[-1].strip()
+
+        # Extract answer after "Answer:" marker
+        match = re.search(r"(?:Answer|ANSWER)\s*[:\s]\s*(.+)", output_text)
+        if not match:
+            return {"reward": 0.0, "extracted_answer": None, "reason": "no_answer_marker"}
+
+        extracted = match.group(1).strip().rstrip(".")
+
+        # Compare
+        if self._normalize(extracted) == self._normalize(expected_answer):
+            return {"reward": 1.0, "extracted_answer": extracted}
+        else:
+            return {"reward": 0.0, "extracted_answer": extracted, "reason": "wrong_answer"}
+
+    @staticmethod
+    def _normalize(s: str) -> str:
+        """Normalize answer for comparison: strip whitespace, lowercase."""
+        return re.sub(r"\s+", " ", s.strip().lower())
diff --git a/.claude/skills/add-benchmark/evals/files/sample_math_test.py b/.claude/skills/add-benchmark/evals/files/sample_math_test.py
@@ -0,0 +1,76 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for the math algebra resources server."""
+
+import pytest
+
+
+@pytest.fixture
+def server():
+    from sample_math_server import MathAlgebraServer
+
+    return MathAlgebraServer()
+
+
+@pytest.mark.asyncio
+async def test_verify_pass(server):
+    """Correct answer should get reward 1.0."""
+    result = await server.verify(
+        {
+            "output_text": "Let me solve this.\nx + 5 = 12\nx = 7\nAnswer: 7",
+            "verifier_metadata": {"expected_answer": "7"},
+        }
+    )
+    assert result["reward"] == 1.0
+    assert result["extracted_answer"] == "7"
+
+
+@pytest.mark.asyncio
+async def test_verify_fail_wrong_answer(server):
+    """Wrong answer should get reward 0.0."""
+    result = await server.verify(
+        {
+            "output_text": "I think the answer is:\nAnswer: 5",
+            "verifier_metadata": {"expected_answer": "7"},
+        }
+    )
+    assert result["reward"] == 0.0
+    assert result["reason"] == "wrong_answer"
+
+
+@pytest.mark.asyncio
+async def test_verify_fail_no_answer(server):
+    """Missing 'Answer:' marker should get reward 0.0."""
+    result = await server.verify(
+        {
+            "output_text": "The solution is 7, which we can verify by substituting back.",
+            "verifier_metadata": {"expected_answer": "7"},
+        }
+    )
+    assert result["reward"] == 0.0
+    assert result["reason"] == "no_answer_marker"
+
+
+@pytest.mark.asyncio
+async def test_verify_fail_think_block(server):
+    """Answer only inside think block should get 0.0 after stripping."""
+    result = await server.verify(
+        {
+            "output_text": "<think>\nThe answer is 7.\nAnswer: 7\n</think>",
+            "verifier_metadata": {"expected_answer": "7"},
+        }
+    )
+    assert result["reward"] == 0.0
+    assert result["reason"] == "no_answer_marker"