chore: add data-designer skill evals by johnnygreco · Pull Request #718 · NVIDIA-NeMo/DataDesigner

johnnygreco · 2026-05-29T20:12:33Z

📋 Summary

Adds targeted eval coverage for the data-designer skill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.

🔗 Related Issue

N/A

🔄 Changes

Add skills/data-designer/evals/evals.json with focused positive evals for Autopilot dataset generation scenarios.
Add negative evals for unrelated PostgreSQL and React UI requests to verify the skill is not selected outside dataset-generation tasks.
Keep behavior checks short, targeted, and easy to grade as true or false.

🧪 Testing

make test passes — not run; eval JSON only
Unit tests added/updated — skill evals added
E2E tests added/updated (if applicable) — N/A
python3 -m json.tool skills/data-designer/evals/evals.json passes

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO) — commit was not created with signoff
Architecture docs updated (if applicable) — N/A

github-actions · 2026-05-29T20:14:37Z

Review: PR #718 — `test: add data-designer skill evals`

Summary

Adds a single new file, skills/data-designer/evals/evals.json (86 lines), containing six eval cases for the data-designer skill: four positive cases exercising Autopilot routing plus specific behaviors (person sampling, LLM judge score access, sampler params), and two negative cases (PostgreSQL admin, React UI) that should not invoke the skill. No existing files are modified. JSON parses cleanly.

This is purely test/eval data — no runtime code paths are touched, blast radius is limited to anyone consuming the eval file.

Findings

Correctness & alignment with SKILL.md

The eval expectations line up well with the rules documented in skills/data-designer/SKILL.md:

The sampler-params case asserts params is used and sampler_params is not — directly mirrors the SKILL.md rule at line 37 (SamplerColumnConfig: Takes params, not sampler_params).
The llm-judge-scores case asserts the accepted-boolean derivation references .score — mirrors the LLM judge access guidance at SKILL.md line 38.
The person-reviews case references get_person_object_schema.py, which exists at skills/data-designer/scripts/get_person_object_schema.py. ✓
Both negative cases assert the skill is not selected for unrelated tasks (DB admin, React UI), which is appropriate given the SKILL.md description scopes the skill to dataset/synthetic-data/data-generation tasks.

Schema consistency

All six entries share the same shape: id, question, expected_skill, expected_script, ground_truth, expected_behavior. Good — uniform schema makes downstream grading simpler.
expected_script is null for five entries and "get_person_object_schema.py" for one. That's the only structured assertion about which script the agent should run; the rest of script-execution checks live in expected_behavior strings. Worth noting the schema isn't documented anywhere (no README in evals/, no JSON Schema), so the meaning of expected_script vs. the free-text expected_behavior items is implicit. Not blocking, but a short evals/README.md describing the schema and how cases are graded would help future contributors.

Behavior assertions — graders need to match

The expected_behavior strings are natural-language predicates that presumably get fed to an LLM judge. A few observations:

Sampler-params case, item "The site column is generated by a category sampler" — strong assertion. The user prompt mentions only site as a column; the agent could reasonably pick category or subcategory. Probably fine as written, but if the grader is strict about category vs. any categorical sampler, this could be flaky.
Person-reviews case, item "The agent ran python scripts/get_person_object_schema.py with a locale argument" — assumes the script accepts a locale arg. Worth a quick sanity check that the script's CLI actually supports this; if it's optional, a flaky-grader risk exists.
Negative cases — the assertion "The agent did not read the data-designer SKILL.md" is a precise behavior, but Autopilot/skill-invocation typically reads SKILL.md as part of skill selection, not just execution. Whether the harness counts skill-selection reads against this assertion depends on grader semantics. Worth verifying with one dry run of the negative cases that a passing agent actually trips zero reads.
"The agent did not ask the user a clarifying question before building the script" appears in only one positive case, even though all four positive prompts include "do not ask follow-up questions" / "be opinionated" language. Consider adding this assertion to the other three for consistency, or deliberately leaving it off if the grading focus is different.

Project conventions

File location (skills/<skill>/evals/evals.json) is the first instance of this pattern in the repo (no other evals*.json files exist). The PR establishes the convention rather than following one — fine, but worth being deliberate. If multiple skills will get evals, codifying the layout (one file per skill, top-level array, this schema) in skills/README.md or similar would prevent drift.
No SPDX/license header — appropriate, JSON doesn't carry one.
No tests or CI wiring is added to consume the eval file. The PR description notes this is intentional ("eval JSON only"), and presumably an external evaluator picks it up. If there's a make target or repo-internal runner, a follow-up that wires it in would close the loop.

Minor

id values use kebab-case prefixed with data-designer- — consistent and grep-friendly. ✓
ground_truth and expected_behavior are slightly redundant in places (e.g., the positive cases' ground_truth restates what the behavior list already covers). Not a problem, just verbose.
DCO sign-off is missing per the PR checklist. Will need to be addressed before merge if this repo enforces DCO.

Risks

Low. No code paths change. Worst case is a flaky eval grader; that surfaces as eval-suite noise, not user-facing breakage.
The eval assertions are tightly coupled to current SKILL.md wording (e.g., file names, the .score rule, the params vs. sampler_params rule). Future SKILL.md edits will need to keep this file in sync — a one-line note in SKILL.md or a comment in the eval file pointing at the dependency would help.

Verdict

Looks good. Small, well-scoped, internally consistent, and the assertions match documented skill behavior. Suggested non-blocking follow-ups:

Add a brief skills/data-designer/evals/README.md (or top-level skills/EVALS.md) describing the schema and grading model.
Verify get_person_object_schema.py actually accepts a locale argument before this eval runs in CI.
Consider tightening or relaxing the "did not read SKILL.md" assertion on negative cases based on how the grader treats skill-selection reads.
Address the missing DCO sign-off.

greptile-apps · 2026-05-29T20:15:25Z

Greptile Summary

This PR adds initial eval coverage for the data-designer skill, including a single positive eval case in evals/evals.json, a benchmark report (BENCHMARK.md), a skill card, a Sigstore signature bundle, and minor SKILL.md metadata additions. The DCO workflow is also updated to allowlist the svc-nvskills-signing service account.

evals/evals.json: Adds one positive eval case as a bare JSON object; the question explicitly names the skill, so expected_skill passes trivially and routing is not actually exercised. The single-object structure also prevents appending the negative cases (PostgreSQL, React UI) described in the PR without restructuring.
BENCHMARK.md + skill-card.md: New documentation files reporting NVSkills-Eval results (4 tasks run externally; source dataset not included in the payload) and the skill's public-facing card.
SKILL.md / .github/workflows/dco-assistant.yml: Small additions — frontmatter metadata fields and a DCO allowlist entry for the signing service account.

Confidence Score: 4/5

Safe to merge; the changes are documentation and eval metadata only, with no runtime code modified.

All changed files are documentation, JSON eval data, and a signature bundle — no executable code is touched. The single eval case in evals.json has a directive question that names the skill explicitly, which means the routing dimension it is supposed to cover is not actually exercised; and the bare-object structure blocks the negative cases described in the PR from being added later without a file restructure. These gaps limit the eval's usefulness but don't break anything currently running.

skills/data-designer/evals/evals.json — eval question design and file structure deserve a second look before more cases are added.

Important Files Changed

Filename	Overview
skills/data-designer/evals/evals.json	Adds a single positive eval case as a bare JSON object; the directive question bypasses routing testing, and the single-object structure prevents appending additional cases without restructuring the file.
skills/data-designer/BENCHMARK.md	New benchmark report documenting NVSkills-Eval results; explicitly notes 4 evaluation tasks were run but the source dataset was not available in the report payload.
skills/data-designer/SKILL.md	Adds license and metadata.owner fields to the frontmatter to address Tier 1 findings; no behavioral changes.
skills/data-designer/skill-card.md	New skill card documenting description, owner, license, use case, evaluation results, and ethical considerations; informational only.
skills/data-designer/skill.oms.sig	New Sigstore/in-toto signature bundle covering the skill artifact files including evals/evals.json; generated by svc-nvskills-signing.
.github/workflows/dco-assistant.yml	Adds svc-nvskills-signing service account to the DCO allowlist so automated skill-signing commits are not blocked by the DCO check.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[evals.json\nsingle eval case] -->|expected_skill| B{Skill routing\ncheck}
    A -->|expected_script| C{Output script\ncheck}
    A -->|expected_behavior| D{Behavior\ncheck}
    A -->|ground_truth| E{Correctness\ncheck}

    B -->|question names skill explicitly| F["Always passes\n(routing not tested)"]
    C --> G[customer_support_tickets.py]
    D --> H[Workflow + person-sampling\nbehavior steps]
    E --> I[load_config_builder returns\nDataDesignerConfigBuilder]

    J[PR description:\nnegative evals] -.->|not present| A
    K[Single JSON object\nnot array] -.->|blocks appending\nadditional cases| A

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/data-designer/evals/evals.json:1-13
**Directive question bypasses the routing check it claims to cover**

The `question` opens with "Use the data-designer skill to create…", so the agent is explicitly told which skill to invoke. As a result, `expected_skill: "data-designer"` will always pass regardless of whether the routing logic is actually correct — a genuine routing failure cannot be caught by this case. The PR description lists "Autopilot routing" as one of the primary behaviors this eval covers, but a prompt that names the skill is an execution test only. A routing-focused case should present a natural task description (e.g. "Generate a synthetic customer support ticket dataset…") and let the harness verify the agent selects the skill autonomously.

_{Reviews (14): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile}

johnnygreco · 2026-05-29T22:47:46Z

/nvskills-ci

calliente · 2026-05-30T03:30:43Z

@johnnygreco - I think this is failing because its missing the DCO sign-off. Run git rebase --signoff origin/main && git push --force-with-lease

johnnygreco · 2026-06-01T14:41:06Z

/nvskills-ci

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T14:59:55Z

/nvskills-ci

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T15:28:19Z

/nvskills-ci

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T15:34:19Z

/nvskills-ci

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T16:04:33Z

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

github-actions · 2026-06-02T16:21:43Z

Thank you for your submission! We ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:

I have read the DCO document and I hereby sign the DCO.

1 out of 2 committers have signed the DCO.
✅ (johnnygreco)[https://github.com/johnnygreco]
❌ @svc-nvskills-signing
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the DCO Assistant Lite bot.}

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T16:24:22Z

/nvskills-ci

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco · 2026-06-02T16:36:35Z

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

johnnygreco requested a review from a team as a code owner May 29, 2026 20:12

johnnygreco temporarily deployed to agentic-ci May 29, 2026 20:12 — with GitHub Actions Inactive

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

Comment thread skills/data-designer/evals/evals.json Outdated

johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch from 0a5e916 to b6cd817 Compare May 29, 2026 22:36

calliente mentioned this pull request May 30, 2026

Register Data Designer skill NVIDIA/skills#108

Open

12 tasks

johnnygreco changed the title ~~test: add data-designer skill evals~~ chore: add data-designer skill evals Jun 1, 2026

johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch 2 times, most recently from d0f0a40 to 467a900 Compare June 1, 2026 14:35

johnnygreco added 2 commits June 2, 2026 14:59

test: add data-designer skill evals

72fbd2d

test: simplify data-designer skill evals

abf988a

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

johnnygreco force-pushed the johnny/chore/data-designer-skill-evals branch from 467a900 to abf988a Compare June 2, 2026 14:59

test: align data-designer skill eval with nvskills

262f6e5

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread skills/data-designer/evals/evals.json

test: keep data-designer workflow wording unchanged

7f4dde8

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

test: add data-designer skill publishing metadata

812a0ad

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

Attach NVSkills validation signatures

7f380e7

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

ci: allow nvskills signing account for dco

c1e52e8

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

test: set data-designer skill owner

37a067d

Signed-off-by: Johnny Greco <jogreco@nvidia.com>

Attach NVSkills validation signatures

45c323c

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

Conversation

johnnygreco commented May 29, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 29, 2026

Review: PR #718 — test: add data-designer skill evals

Summary

Findings

Correctness & alignment with SKILL.md

Schema consistency

Behavior assertions — graders need to match

Project conventions

Minor

Risks

Verdict

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

johnnygreco commented May 29, 2026

Uh oh!

calliente commented May 30, 2026

Uh oh!

johnnygreco commented Jun 1, 2026

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

johnnygreco commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review: PR #718 — `test: add data-designer skill evals`

greptile-apps Bot commented May 29, 2026 •

edited

Loading