You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/cuopt-numerical-optimization-api-cli/BENCHMARK.md
+11-19Lines changed: 11 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,11 +7,11 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
7
7
## Evaluation Summary
8
8
9
9
- Skill: `cuopt-numerical-optimization-api-cli`
10
-
- Evaluation date: 2026-05-29
10
+
- Evaluation date: 2026-07-01
11
11
- NVSkills-Eval profile: `external`
12
-
- Environment: `local`
12
+
- Environment: `astra-sandbox`
13
13
- Dataset: 1 evaluation tasks
14
-
- Attempts per task: 2
14
+
- Attempts per task: 1
15
15
- Pass threshold: 50%
16
16
- Overall verdict: PASS
17
17
@@ -54,34 +54,26 @@ Task composition is derived from the evaluation dataset when possible. Entries w
54
54
55
55
| Dimension | Num |`claude-code`|`codex`|
56
56
|---|---:|---:|---:|
57
-
| Security |2| 100% (+0%) | 100% (+0%) |
58
-
| Correctness |2|100% (+0%) |97% (+5%) |
59
-
| Discoverability |2|100% (+0%) | 84% (+5%) |
60
-
| Effectiveness |2|78% (+2%) |76% (+4%) |
61
-
| Efficiency |2|93% (-0%) |78% (-0%) |
57
+
| Security |1| 100% (+0%) | 100% (+0%) |
58
+
| Correctness |1|30% (+0%) |77% (+47%) |
59
+
| Discoverability |1|0% (+0%) | 84% (+84%) |
60
+
| Effectiveness |1|39% (-4%) |41% (+4%) |
61
+
| Efficiency |1|27% (-0%) |77% (+49%) |
62
62
63
63
Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
64
64
65
65
## Tier 1: Static Validation Summary
66
66
67
-
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 8 total findings.
67
+
Tier 1 validation passed with observations. NVSkills-Eval ran 1 checks and found 2 total findings.
68
68
69
69
Top findings:
70
70
71
71
- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Instructions' (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
72
-
- LOW QUALITY/quality_discoverability: Broad description without negative triggers may cause over-triggering (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
73
-
- LOW QUALITY/quality_discoverability: No '## Purpose' section (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
74
-
- LOW QUALITY/quality_reliability: No prerequisites/requirements documented (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
75
-
- LOW QUALITY/quality_reliability: No limitations documented (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
72
+
- LOW SCHEMA/author_format: Author must be of the form 'Name <email@host>' (`skills/cuopt-numerical-optimization-api-cli/SKILL.md`)
76
73
77
74
## Tier 2: Deduplication Summary
78
75
79
-
Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings.
0 commit comments