Skip to content

Commit 303b4e7

Browse files
kvmtovedika-saravanan
authored andcommitted
Restructuring based on sota and Anthropic guidelines
Signed-off-by: kvmto <kmato@nvidia.com>
1 parent e23d64b commit 303b4e7

40 files changed

Lines changed: 5661 additions & 1489 deletions

.claude/evals/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# CUDA-QX skill evaluation harness
2+
3+
This tree is **not** a skill. It evaluates the skills under
4+
`.claude/skills/`. Nothing in this tree is referenced from any `SKILL.md`
5+
or any reference file, so an agent loading a CUDA-QX skill will not see
6+
the prompts, assertions, runners, or graders that live here.
7+
8+
## Layout
9+
10+
```
11+
.claude/evals/
12+
├── prompts/ <skill>.evals.json ← user-facing prompts only
13+
├── assertions/ <skill>.json ← THE answer key (substring rules)
14+
├── runners/
15+
│ └── runner.py ← prompt dump + iteration aggregate
16+
├── graders/
17+
│ ├── programmatic.py ← substring + activation grader
18+
│ ├── executable.py ← runs code blocks in a sandbox
19+
│ └── judge.py ← LLM-as-judge (BYO client)
20+
├── viewer/
21+
│ └── generate_review.py ← HTML report
22+
├── aggregate.py ← cross-grader agreement (Cohen's κ)
23+
├── workspaces/ ← per-iteration outputs (gitignored)
24+
└── README.md
25+
```
26+
27+
`prompts/` and `assertions/` are split deliberately: an agent answering an
28+
eval prompt must never read the answer key. Keeping the two on opposite
29+
sides of the directory boundary, and keeping both *outside* the skill
30+
folder, is the cheapest way to enforce that.
31+
32+
## Adding a new skill
33+
34+
1. Add the skill alias to `SKILL_DIRS` in `runners/runner.py` and in every
35+
grader (each grader is standalone and carries its own copy of the map).
36+
2. Drop `<skill>.evals.json` into `prompts/` and `<skill>.json` into
37+
`assertions/`. Both files use the same `Sxx`/`Axx` ids.
38+
3. Smoke-test:
39+
40+
```bash
41+
python .claude/evals/runners/runner.py prompts --skill <alias>
42+
```
43+
44+
## End-to-end loop (per iteration)
45+
46+
```bash
47+
WORKSPACE=.claude/evals/workspaces/$(date +%Y-%m-%d)-iter-1
48+
49+
# 1. Dump prompts for your evaluator to consume.
50+
python .claude/evals/runners/runner.py prompts --skill qec --kind all > $WORKSPACE/qec.prompts.jsonl
51+
52+
# 2. Your evaluator runs the agent (with and without the skill loaded) and
53+
# writes responses.json + timing.json into:
54+
# $WORKSPACE/with_skill/ and $WORKSPACE/without_skill/
55+
56+
# 3. Grade. Each grader is independent; run as many as you have.
57+
python .claude/evals/graders/programmatic.py --skill qec --responses $WORKSPACE/with_skill/responses.json
58+
python .claude/evals/graders/programmatic.py --skill qec --responses $WORKSPACE/without_skill/responses.json
59+
# (executable.py and judge.py optional; same calling convention.)
60+
61+
# 4. Aggregate per-iteration. Computes deltas between configurations.
62+
python .claude/evals/runners/runner.py aggregate $WORKSPACE
63+
64+
# 5. Cross-grader agreement (Cohen's κ between programmatic / judge / etc).
65+
python .claude/evals/aggregate.py $WORKSPACE
66+
67+
# 6. Render the HTML viewer for human review.
68+
python .claude/evals/viewer/generate_review.py $WORKSPACE --out $WORKSPACE/report.html
69+
```
70+
71+
## Why three graders
72+
73+
| Grader | Catches | Cost | Reliability |
74+
| --- | --- | --- | --- |
75+
| `programmatic.py` | API names, exact paths, exact flags | free | high precision, low recall |
76+
| `executable.py` | "does the suggested code actually run and produce the gold answer" | medium (sandbox) | highest signal for buildable skills |
77+
| `judge.py` | correctness, specificity, hallucinations | $$ (LLM call) | best for subjective dimensions |
78+
79+
Run all three and use `aggregate.py` to compute Cohen's κ between
80+
programmatic and judge. Disagreement is exactly the place where the rubric
81+
or the skill needs work.
82+
83+
## Scoring rubric (human pass)
84+
85+
When a human grader looks at the viewer, score each scenario 0–8:
86+
87+
- **Correctness** (0–2): facts true, paths/APIs real
88+
- **Specificity** (0–2): cites files, exact API names, exact kwargs
89+
- **Coverage** (0–2): hits each `must_include` item
90+
- **No hallucinations** (0–2): no `must_not_include` items
91+
92+
For build-skill responses, add an optional fifth dimension:
93+
94+
- **Action quality** (0–2): did the response identify the right script /
95+
invocation / fix? (Build prompts often want "run this script" rather
96+
than a long explanation, and substring scorers undercount that.)
97+
98+
12 scenarios × 8 + 10 activation = 106 max (130 with action quality).
99+
100+
## Token-cost tracking
101+
102+
A skill that delivers higher accuracy at much higher token cost may not be
103+
worth shipping. The evaluator should write `timing.json` alongside
104+
`responses.json`:
105+
106+
```json
107+
{"total_tokens": 84852, "duration_ms": 23332}
108+
```
109+
110+
The HTML viewer surfaces these alongside the grader scores.
111+
112+
## Sources by skill
113+
114+
- **build**: `CMakeLists.txt`, `Building.md`, `scripts/build_*.sh`,
115+
`scripts/ci/*.sh`, `scripts/validation/*`, `docker/build_env/`,
116+
`libs/{qec,solvers}/pyproject.toml.cu{12,13}`.
117+
- **qec**: `libs/qec/python/cudaq_qec/__init__.py`,
118+
`libs/qec/python/bindings/`, `libs/qec/include/cudaq/qec/`,
119+
`docs/sphinx/components/qec/`, `docs/sphinx/examples/qec/`.
120+
- **solvers**: `libs/solvers/python/cudaq_solvers/__init__.py`,
121+
`libs/solvers/python/bindings/solvers/`, `libs/solvers/include/`,
122+
`libs/solvers/python/cudaq_solvers/gqe_algorithm/gqe.py`.

0 commit comments

Comments
 (0)