|
| 1 | +# CUDA-QX skill evaluation harness |
| 2 | + |
| 3 | +This tree is **not** a skill. It evaluates the skills under |
| 4 | +`.claude/skills/`. Nothing in this tree is referenced from any `SKILL.md` |
| 5 | +or any reference file, so an agent loading a CUDA-QX skill will not see |
| 6 | +the prompts, assertions, runners, or graders that live here. |
| 7 | + |
| 8 | +## Layout |
| 9 | + |
| 10 | +``` |
| 11 | +.claude/evals/ |
| 12 | +├── prompts/ <skill>.evals.json ← user-facing prompts only |
| 13 | +├── assertions/ <skill>.json ← THE answer key (substring rules) |
| 14 | +├── runners/ |
| 15 | +│ └── runner.py ← prompt dump + iteration aggregate |
| 16 | +├── graders/ |
| 17 | +│ ├── programmatic.py ← substring + activation grader |
| 18 | +│ ├── executable.py ← runs code blocks in a sandbox |
| 19 | +│ └── judge.py ← LLM-as-judge (BYO client) |
| 20 | +├── viewer/ |
| 21 | +│ └── generate_review.py ← HTML report |
| 22 | +├── aggregate.py ← cross-grader agreement (Cohen's κ) |
| 23 | +├── workspaces/ ← per-iteration outputs (gitignored) |
| 24 | +└── README.md |
| 25 | +``` |
| 26 | + |
| 27 | +`prompts/` and `assertions/` are split deliberately: an agent answering an |
| 28 | +eval prompt must never read the answer key. Keeping the two on opposite |
| 29 | +sides of the directory boundary, and keeping both *outside* the skill |
| 30 | +folder, is the cheapest way to enforce that. |
| 31 | + |
| 32 | +## Adding a new skill |
| 33 | + |
| 34 | +1. Add the skill alias to `SKILL_DIRS` in `runners/runner.py` and in every |
| 35 | + grader (each grader is standalone and carries its own copy of the map). |
| 36 | +2. Drop `<skill>.evals.json` into `prompts/` and `<skill>.json` into |
| 37 | + `assertions/`. Both files use the same `Sxx`/`Axx` ids. |
| 38 | +3. Smoke-test: |
| 39 | + |
| 40 | + ```bash |
| 41 | + python .claude/evals/runners/runner.py prompts --skill <alias> |
| 42 | + ``` |
| 43 | + |
| 44 | +## End-to-end loop (per iteration) |
| 45 | + |
| 46 | +```bash |
| 47 | +WORKSPACE=.claude/evals/workspaces/$(date +%Y-%m-%d)-iter-1 |
| 48 | + |
| 49 | +# 1. Dump prompts for your evaluator to consume. |
| 50 | +python .claude/evals/runners/runner.py prompts --skill qec --kind all > $WORKSPACE/qec.prompts.jsonl |
| 51 | + |
| 52 | +# 2. Your evaluator runs the agent (with and without the skill loaded) and |
| 53 | +# writes responses.json + timing.json into: |
| 54 | +# $WORKSPACE/with_skill/ and $WORKSPACE/without_skill/ |
| 55 | + |
| 56 | +# 3. Grade. Each grader is independent; run as many as you have. |
| 57 | +python .claude/evals/graders/programmatic.py --skill qec --responses $WORKSPACE/with_skill/responses.json |
| 58 | +python .claude/evals/graders/programmatic.py --skill qec --responses $WORKSPACE/without_skill/responses.json |
| 59 | +# (executable.py and judge.py optional; same calling convention.) |
| 60 | + |
| 61 | +# 4. Aggregate per-iteration. Computes deltas between configurations. |
| 62 | +python .claude/evals/runners/runner.py aggregate $WORKSPACE |
| 63 | + |
| 64 | +# 5. Cross-grader agreement (Cohen's κ between programmatic / judge / etc). |
| 65 | +python .claude/evals/aggregate.py $WORKSPACE |
| 66 | + |
| 67 | +# 6. Render the HTML viewer for human review. |
| 68 | +python .claude/evals/viewer/generate_review.py $WORKSPACE --out $WORKSPACE/report.html |
| 69 | +``` |
| 70 | + |
| 71 | +## Why three graders |
| 72 | + |
| 73 | +| Grader | Catches | Cost | Reliability | |
| 74 | +| --- | --- | --- | --- | |
| 75 | +| `programmatic.py` | API names, exact paths, exact flags | free | high precision, low recall | |
| 76 | +| `executable.py` | "does the suggested code actually run and produce the gold answer" | medium (sandbox) | highest signal for buildable skills | |
| 77 | +| `judge.py` | correctness, specificity, hallucinations | $$ (LLM call) | best for subjective dimensions | |
| 78 | + |
| 79 | +Run all three and use `aggregate.py` to compute Cohen's κ between |
| 80 | +programmatic and judge. Disagreement is exactly the place where the rubric |
| 81 | +or the skill needs work. |
| 82 | + |
| 83 | +## Scoring rubric (human pass) |
| 84 | + |
| 85 | +When a human grader looks at the viewer, score each scenario 0–8: |
| 86 | + |
| 87 | +- **Correctness** (0–2): facts true, paths/APIs real |
| 88 | +- **Specificity** (0–2): cites files, exact API names, exact kwargs |
| 89 | +- **Coverage** (0–2): hits each `must_include` item |
| 90 | +- **No hallucinations** (0–2): no `must_not_include` items |
| 91 | + |
| 92 | +For build-skill responses, add an optional fifth dimension: |
| 93 | + |
| 94 | +- **Action quality** (0–2): did the response identify the right script / |
| 95 | + invocation / fix? (Build prompts often want "run this script" rather |
| 96 | + than a long explanation, and substring scorers undercount that.) |
| 97 | + |
| 98 | +12 scenarios × 8 + 10 activation = 106 max (130 with action quality). |
| 99 | + |
| 100 | +## Token-cost tracking |
| 101 | + |
| 102 | +A skill that delivers higher accuracy at much higher token cost may not be |
| 103 | +worth shipping. The evaluator should write `timing.json` alongside |
| 104 | +`responses.json`: |
| 105 | + |
| 106 | +```json |
| 107 | +{"total_tokens": 84852, "duration_ms": 23332} |
| 108 | +``` |
| 109 | + |
| 110 | +The HTML viewer surfaces these alongside the grader scores. |
| 111 | + |
| 112 | +## Sources by skill |
| 113 | + |
| 114 | +- **build**: `CMakeLists.txt`, `Building.md`, `scripts/build_*.sh`, |
| 115 | + `scripts/ci/*.sh`, `scripts/validation/*`, `docker/build_env/`, |
| 116 | + `libs/{qec,solvers}/pyproject.toml.cu{12,13}`. |
| 117 | +- **qec**: `libs/qec/python/cudaq_qec/__init__.py`, |
| 118 | + `libs/qec/python/bindings/`, `libs/qec/include/cudaq/qec/`, |
| 119 | + `docs/sphinx/components/qec/`, `docs/sphinx/examples/qec/`. |
| 120 | +- **solvers**: `libs/solvers/python/cudaq_solvers/__init__.py`, |
| 121 | + `libs/solvers/python/bindings/solvers/`, `libs/solvers/include/`, |
| 122 | + `libs/solvers/python/cudaq_solvers/gqe_algorithm/gqe.py`. |
0 commit comments