chore: add data-designer skill evals#718
Conversation
Review: PR #718 —
|
Greptile SummaryThis PR adds initial eval coverage for the
|
| Filename | Overview |
|---|---|
| skills/data-designer/evals/evals.json | Adds a single positive eval case as a bare JSON object; the directive question bypasses routing testing, and the single-object structure prevents appending additional cases without restructuring the file. |
| skills/data-designer/BENCHMARK.md | New benchmark report documenting NVSkills-Eval results; explicitly notes 4 evaluation tasks were run but the source dataset was not available in the report payload. |
| skills/data-designer/SKILL.md | Adds license and metadata.owner fields to the frontmatter to address Tier 1 findings; no behavioral changes. |
| skills/data-designer/skill-card.md | New skill card documenting description, owner, license, use case, evaluation results, and ethical considerations; informational only. |
| skills/data-designer/skill.oms.sig | New Sigstore/in-toto signature bundle covering the skill artifact files including evals/evals.json; generated by svc-nvskills-signing. |
| .github/workflows/dco-assistant.yml | Adds svc-nvskills-signing service account to the DCO allowlist so automated skill-signing commits are not blocked by the DCO check. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[evals.json\nsingle eval case] -->|expected_skill| B{Skill routing\ncheck}
A -->|expected_script| C{Output script\ncheck}
A -->|expected_behavior| D{Behavior\ncheck}
A -->|ground_truth| E{Correctness\ncheck}
B -->|question names skill explicitly| F["Always passes\n(routing not tested)"]
C --> G[customer_support_tickets.py]
D --> H[Workflow + person-sampling\nbehavior steps]
E --> I[load_config_builder returns\nDataDesignerConfigBuilder]
J[PR description:\nnegative evals] -.->|not present| A
K[Single JSON object\nnot array] -.->|blocks appending\nadditional cases| A
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
skills/data-designer/evals/evals.json:1-13
**Directive question bypasses the routing check it claims to cover**
The `question` opens with "Use the data-designer skill to create…", so the agent is explicitly told which skill to invoke. As a result, `expected_skill: "data-designer"` will always pass regardless of whether the routing logic is actually correct — a genuine routing failure cannot be caught by this case. The PR description lists "Autopilot routing" as one of the primary behaviors this eval covers, but a prompt that names the skill is an execution test only. A routing-focused case should present a natural task description (e.g. "Generate a synthetic customer support ticket dataset…") and let the harness verify the agent selects the skill autonomously.
Reviews (14): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile
0a5e916 to
b6cd817
Compare
|
/nvskills-ci |
|
@johnnygreco - I think this is failing because its missing the DCO sign-off. Run git rebase --signoff origin/main && git push --force-with-lease |
d0f0a40 to
467a900
Compare
|
/nvskills-ci |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
467a900 to
abf988a
Compare
|
/nvskills-ci |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
|
/nvskills-ci |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
|
/nvskills-ci |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
|
/nvskills-ci |
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
|
Thank you for your submission! We ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text: I have read the DCO document and I hereby sign the DCO. 1 out of 2 committers have signed the DCO. |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
|
/nvskills-ci |
Signed-off-by: Johnny Greco <jogreco@nvidia.com>
|
/nvskills-ci |
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
📋 Summary
Adds targeted eval coverage for the
data-designerskill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.🔗 Related Issue
N/A
🔄 Changes
skills/data-designer/evals/evals.jsonwith focused positive evals for Autopilot dataset generation scenarios.🧪 Testing
make testpasses — not run; eval JSON onlypython3 -m json.tool skills/data-designer/evals/evals.jsonpasses✅ Checklist