Skip to content

Commit 153a589

Browse files
add qec skills and update evaluation
Signed-off-by: vedika-saravanan <vsaravanan@nvidia.com>
1 parent 6ffb0b7 commit 153a589

91 files changed

Lines changed: 8993 additions & 593 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/evals/REPORT.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# CUDA-QX skill evaluation — consolidated report
22

33
**Updated:** 2026-05-07
4-
**Scope:** `cuda-qx-qec` (iter 1 → iter 2) and `cuda-qx-solvers` (iter 1).
4+
**Scope:** `cuda-qx-qec-decode` (iter 1 → iter 2) and `cuda-qx-solvers-algorithms` (iter 1).
55
`cuda-qx-build` not yet measured.
66

77
This is the single report covering the eval framework, what was measured,
@@ -75,7 +75,7 @@ Programmatic grader, with-skill vs without-skill, on identical 24-prompt
7575
suites (13 scenarios `S1``S13`, 11 activations `A1``A11`). Composite
7676
= coverage + purity + activation, max varies per skill.
7777

78-
### `cuda-qx-qec` — iter 1 → iter 2
78+
### `cuda-qx-qec-decode` — iter 1 → iter 2
7979

8080
```
8181
iter 1 (22 prompts) iter 2 (24 prompts)
@@ -91,7 +91,7 @@ composite 61/68 29/68 +47 pts 60/73 58/73 +3 p
9191
the without-skill agent happened to read `libs/qec/` source verbatim,
9292
matching exact substrings the with-skill agent paraphrased. See section 4.
9393

94-
### `cuda-qx-solvers` — iter 1 (first run)
94+
### `cuda-qx-solvers-algorithms` — iter 1 (first run)
9595

9696
```
9797
with without Δ
@@ -136,8 +136,8 @@ Both skills score 11/11 on activation in iter 2. Both baselines score 5/11.
136136
This is not a phrasing artifact — it's structural:
137137

138138
* On in-scope prompts (e.g. "Build a Steane memory experiment"), the
139-
with-skill agent emits the marker (`cuda-qx-qec`); the baseline cannot,
140-
because the skill name lives in `.agents/skills/cuda-qx-qec/` which it
139+
with-skill agent emits the marker (`cuda-qx-qec-decode`); the baseline cannot,
140+
because the skill name lives in `.agents/skills/cuda-qx-qec-decode/` which it
141141
was forbidden to read.
142142
* On out-of-scope prompts (e.g. "Open the cudaq_qec docs offline" — the new
143143
`A11`), the with-skill agent correctly *withholds* the marker and
@@ -210,7 +210,7 @@ Outcome:
210210
recipe ending in a working `python -m http.server -d build/docs/sphinx 8080`.
211211
* QEC `A11`: passes (correctly redirects without naming the skill).
212212
* QEC `S13`: fails on `http.server` only — agent suggested `file://` URLs.
213-
Real content gap; one-sentence fix in `references/triage.md`.
213+
Real content gap; one-sentence fix in `references/decode-triage.md`.
214214

215215
---
216216

@@ -220,8 +220,8 @@ Outcome:
220220
|---|---|---|
221221
| Judge grader | `not_configured` for all 26 scenarios scored | `pip install anthropic && export ANTHROPIC_API_KEY=...` then re-run `judge.py` on the saved `responses.json` (zero new tokens from agents) |
222222
| Executable grader | All 26 scenarios `skipped` (no `executable` rules in any assertions file) | Add 2–3 high-leverage rules (`S5` Steane shape, `S6` `@qec.code` decorator, `S11` `qec.operation` enum). Schema documented at top of `executable.py`. |
223-
| 4 brittle QEC assertions (`S1`, `S3`, `S5`, `S12`) | False-fails as documented in section 4 | One-line edits in `cuda-qx-qec.json` (regex alternatives instead of literal substrings). |
224-
| QEC `references/triage.md` | Missing one-line note on `python -m http.server` for docs | Surgical edit; flips `S13`. |
223+
| 4 brittle QEC assertions (`S1`, `S3`, `S5`, `S12`) | False-fails as documented in section 4 | One-line edits in `cuda-qx-qec-decode.json` (regex alternatives instead of literal substrings). |
224+
| QEC `references/decode-triage.md` | Missing one-line note on `python -m http.server` for docs | Surgical edit; flips `S13`. |
225225
| `cuda-qx-build` skill | Never evaluated end-to-end | Same recipe, `--skill build`; ~24 prompts × 2 subagents. |
226226
| Cohen's κ | `n/a` everywhere because judge has 0 scenarios scored | Falls out of #1 above. |
227227

@@ -231,11 +231,11 @@ Outcome:
231231

232232
In rough cost order, cheapest first:
233233

234-
1. **Patch the 4 brittle assertions** in `cuda-qx-qec.json` (regex alternatives
234+
1. **Patch the 4 brittle assertions** in `cuda-qx-qec-decode.json` (regex alternatives
235235
for `C-order`, `RuntimeError`, the shape variants, `X errors`/`Z errors`).
236236
Re-run `programmatic.py` on the existing `responses.json`. Zero tokens.
237237
Expected with-skill scenario pass rate: 11/13 on the existing data.
238-
2. **Add the `http.server` sentence** to `cuda-qx-qec` `references/triage.md`.
238+
2. **Add the `http.server` sentence** to `cuda-qx-qec-decode` `references/decode-triage.md`.
239239
Doesn't change scoring this iteration, but makes `S13` flip the next
240240
time a fresh agent runs.
241241
3. **Wire the judge grader** (export an API key, run `judge.py` on the
@@ -263,8 +263,8 @@ python -m http.server -d .agents/evals/workspaces/2026-05-07-qec-iter2 8001
263263
# open http://localhost:8001/report.html
264264

265265
# add a new prompt to a skill and re-grade only:
266-
$EDITOR .agents/evals/prompts/cuda-qx-qec.evals.json
267-
$EDITOR .agents/evals/assertions/cuda-qx-qec.json
266+
$EDITOR .agents/evals/prompts/cuda-qx-qec-decode.evals.json
267+
$EDITOR .agents/evals/assertions/cuda-qx-qec-decode.json
268268
python .agents/evals/graders/programmatic.py --skill qec \
269269
--responses .agents/evals/workspaces/2026-05-07-qec-iter2/with_skill/responses.json
270270
python .agents/evals/aggregate.py .agents/evals/workspaces/2026-05-07-qec-iter2

.agents/evals/assertions/cuda-qx-qec.json renamed to .agents/evals/assertions/cuda-qx-qec-decode.json

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
11
{
2-
"skill_name": "cuda-qx-qec",
2+
"skill_name": "cuda-qx-qec-decode",
33
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. Loaded by .agents/evals/graders/ only.",
4-
"activation_marker": "cuda-qx-qec",
4+
"activation_marker": "cuda-qx-qec-decode",
55
"scenarios": {
66
"S1": {
77
"must_include": ["C-order"],
8-
"must_not_include": ["F-order is supported"]
8+
"must_not_include": ["F-order is supported"],
9+
"executable": {
10+
"code_extractor": "first_python_block",
11+
"interpreter": "python3",
12+
"preamble": "import numpy as np\nH = np.zeros((3, 5), dtype=np.uint8, order='F')\nassert H.flags['F_CONTIGUOUS'], 'precondition failed: H must start F-ordered'\n",
13+
"harness": "{code}\nimport numpy as np\n# Probe any numpy array variable in locals() that is now C-contiguous\nresult = any(getattr(v, 'flags', None) is not None and v.flags.get('C_CONTIGUOUS', False) for v in list(locals().values()) if isinstance(v, np.ndarray))\nprint('C_CONTIGUOUS_FOUND' if result else 'NO_C_ORDER_ARRAY')\n",
14+
"expected_stdout_substr": "C_CONTIGUOUS_FOUND",
15+
"expected_exit": 0,
16+
"timeout_s": 15,
17+
"skip_if_missing": ["numpy"]
18+
}
919
},
1020
"S2": {
1121
"must_include": ["cuquantum-python", "26.03.0"],
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
{
2+
"skill_name": "cuda-qx-qec-realtime",
3+
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. Loaded by .agents/evals/graders/ only.",
4+
"activation_marker": "cuda-qx-qec-realtime",
5+
"scenarios": {
6+
"S1": {
7+
"must_include": [
8+
"z_dem_from_memory_circuit",
9+
"decoder_config",
10+
"multi_decoder_config",
11+
"configure_decoders_from_file",
12+
"reset_decoder",
13+
"enqueue_syndromes",
14+
"get_corrections",
15+
"finalize_decoders"
16+
],
17+
"must_not_include": ["configure_decoders_from_file inside cudaq.run"]
18+
},
19+
"S2": {
20+
"must_include": ["before cudaq.run", "Phase 3"],
21+
"must_not_include": ["this is a known bug", "configure after run is supported"]
22+
},
23+
"S3": {
24+
"must_include": ["block_size = 200", "syndrome_size = 60", "H.shape[1]", "H.shape[0]"],
25+
"must_not_include": ["block_size = 60"],
26+
"executable": {
27+
"code_extractor": "first_python_block",
28+
"interpreter": "python3",
29+
"preamble": "import numpy as np\nclass _Dem: pass\ndem = _Dem()\ndem.detector_error_matrix = np.zeros((60, 200), dtype=np.uint8)\n",
30+
"harness": "{code}\n# After the agent's snippet, both block_size and syndrome_size variables must exist with right values\nimport sys\ntry:\n ok = (int(block_size) == 200) and (int(syndrome_size) == 60)\nexcept NameError:\n ok = False\nprint('SIZES_OK' if ok else 'SIZES_WRONG')\n",
31+
"expected_stdout_substr": "SIZES_OK",
32+
"expected_exit": 0,
33+
"timeout_s": 15,
34+
"skip_if_missing": ["numpy"]
35+
}
36+
},
37+
"S4": {
38+
"must_include": ["num_rounds_plus_one", "num_syndromes_per_round", "+ 1"],
39+
"must_not_include": ["num_rounds directly"]
40+
},
41+
"S5": {
42+
"must_include": ["not real-time eligible", "nv-qldpc-decoder", "multi_error_lut", "single_error_lut"],
43+
"must_not_include": ["tensor_network_decoder is real-time eligible"]
44+
},
45+
"S6": {
46+
"must_include": [
47+
"py_decoding.cpp",
48+
"@cudaq.kernel",
49+
"python_realtime_decoding_api.rst"
50+
],
51+
"must_not_include": ["call from Python top level"]
52+
},
53+
"S7": {
54+
"must_include": ["qec.finalize_decoders", "GPU resources"],
55+
"must_not_include": ["restart Python", "this is unsupported"]
56+
},
57+
"S8": {
58+
"must_include": ["decoder_config.id", "unique", "logical qubit"],
59+
"must_not_include": ["id is optional"]
60+
},
61+
"S9": {
62+
"must_include": ["pcm_is_sorted", "simplify_pcm", "sort_pcm_columns"],
63+
"must_not_include": ["sliding_window auto-sorts"]
64+
},
65+
"S10": {
66+
"must_include": ["emulate=True", "cudaq.set_target(\"quantinuum\"", "CUDAQ_QUANTINUUM_CREDENTIALS"],
67+
"must_not_include": ["go straight to hardware"]
68+
},
69+
"S11": {
70+
"must_include": ["autonomous_decoder", "CRTP", "autonomous_decoder.cuh", "autonomous_decoder_guide.md"],
71+
"must_not_include": ["host code is fine"]
72+
},
73+
"S12": {
74+
"must_include": ["d=7", "d=13", "d=21", "d=31", "T=104"],
75+
"must_not_include": ["any d/T combo works"]
76+
},
77+
"S13": {
78+
"must_include": ["CUDAQ_QEC_DEBUG_DECODER=1"],
79+
"must_not_include": ["no logging available"]
80+
}
81+
},
82+
"activation": {
83+
"A1": {"should_activate": true},
84+
"A2": {"should_activate": true},
85+
"A3": {"should_activate": true},
86+
"A4": {"should_activate": true},
87+
"A5": {"should_activate": false},
88+
"A6": {"should_activate": false},
89+
"A7": {"should_activate": false},
90+
"A8": {"should_activate": true},
91+
"A9": {"should_activate": true},
92+
"A10": {"should_activate": false},
93+
"A11": {"should_activate": false}
94+
}
95+
}

.agents/evals/assertions/cuda-qx-solvers.json renamed to .agents/evals/assertions/cuda-qx-solvers-algorithms.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
2-
"skill_name": "cuda-qx-solvers",
2+
"skill_name": "cuda-qx-solvers-algorithms",
33
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. The agent under test must not see this content. Loaded by .agents/evals/graders/programmatic.py (and judge.py) only.",
4-
"activation_marker": "cuda-qx-solvers",
4+
"activation_marker": "cuda-qx-solvers-algorithms",
55
"scenarios": {
66
"S1": {
77
"must_include": ["OMP_NUM_THREADS=1", "PySCF", "eigenvalue"],
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
{
2+
"skill_name": "cuda-qx-solvers-chemistry",
3+
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. Loaded by .agents/evals/graders/ only.",
4+
"activation_marker": "cuda-qx-solvers-chemistry",
5+
"scenarios": {
6+
"S1": {
7+
"must_include": ["Ångström", "Bohr"],
8+
"must_not_include": ["geometry unit is Bohr by default"]
9+
},
10+
"S2": {
11+
"must_include": ["spin=2", "charge=0", "2S", "unpaired"],
12+
"must_not_include": ["spin=1 for triplet"]
13+
},
14+
"S3": {
15+
"must_include": ["MP2=True", "natorb requires MP2"],
16+
"must_not_include": ["set natorb=False to fix"]
17+
},
18+
"S4": {
19+
"must_include": ["12 qubits", "2 * norb_cas", "num_qubits = 2 * num_orbitals"],
20+
"must_not_include": ["6 qubits", "num_qubits = num_orbitals"],
21+
"executable": {
22+
"code_extractor": "first_python_block",
23+
"interpreter": "python3",
24+
"preamble": "# Active-space parameters from the prompt\nnele_cas = 6\nnorb_cas = 6\n",
25+
"harness": "{code}\n# Agent's snippet must compute num_qubits = 12 via 2 * num_orbitals\ntry:\n ok = int(num_qubits) == 12\nexcept NameError:\n ok = False\nprint('QUBITS_OK' if ok else 'QUBITS_WRONG')\n",
26+
"expected_stdout_substr": "QUBITS_OK",
27+
"expected_exit": 0,
28+
"timeout_s": 10,
29+
"skip_if_missing": []
30+
}
31+
},
32+
"S5": {
33+
"must_include": ["num_orbitals", "uccgsd takes num_orbitals", "num_qubits = 2 * num_orbitals"],
34+
"must_not_include": ["uccgsd takes num_qubits"]
35+
},
36+
"S6": {
37+
"must_include": ["fci_energy", "casci=True with no CAS"],
38+
"must_not_include": ["R-CASCI when no active space"]
39+
},
40+
"S7": {
41+
"must_include": ["core_energy", "nuclear_energy", "identity"],
42+
"must_not_include": ["this is a CUDA-QX bug"]
43+
},
44+
"S8": {
45+
"must_include": ["jordan_wigner", "default", "Bravyi-Kitaev"],
46+
"must_not_include": ["always use Bravyi-Kitaev"]
47+
},
48+
"S9": {
49+
"must_include": ["ccsd=True", "R-CCSD", "energies"],
50+
"must_not_include": ["VQE has no classical baseline"]
51+
},
52+
"S10": {
53+
"must_include": ["localhost:8000", "lsof", "cudaq-pyscf"],
54+
"must_not_include": ["reinstall CUDA-Q", "change basis"]
55+
},
56+
"S11": {
57+
"must_include": ["OMP_NUM_THREADS=1", "PySCF", "multithread"],
58+
"must_not_include": ["this is a CUDA-QX bug", "fermion mapping bug"]
59+
},
60+
"S12": {
61+
"must_include": ["CASCI", "CASSCF", "bond breaking", "single-reference"],
62+
"must_not_include": ["CCSD handles bond-breaking correctly"]
63+
},
64+
"S13": {
65+
"must_include": ["build_docs.sh", "http.server", "cuda-qx-build"],
66+
"must_not_include": ["this skill builds the docs"]
67+
}
68+
},
69+
"activation": {
70+
"A1": {"should_activate": true},
71+
"A2": {"should_activate": true},
72+
"A3": {"should_activate": true},
73+
"A4": {"should_activate": true},
74+
"A5": {"should_activate": true},
75+
"A6": {"should_activate": false},
76+
"A7": {"should_activate": false},
77+
"A8": {"should_activate": false},
78+
"A9": {"should_activate": false},
79+
"A10": {"should_activate": true},
80+
"A11": {"should_activate": false}
81+
}
82+
}

.agents/evals/graders/executable.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,11 @@
6161
ASSERTIONS_DIR = EVALS_ROOT / "assertions"
6262

6363
SKILL_DIRS = {
64-
"solvers": "cuda-qx-solvers",
65-
"qec": "cuda-qx-qec",
64+
"solvers": "cuda-qx-solvers-algorithms",
65+
"qec": "cuda-qx-qec-decode",
6666
"build": "cuda-qx-build",
67+
"qec-realtime": "cuda-qx-qec-realtime",
68+
"chemistry": "cuda-qx-solvers-chemistry",
6769
}
6870

6971

.agents/evals/graders/judge.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,8 @@
6060
PROMPTS_DIR = EVALS_ROOT / "prompts"
6161

6262
SKILL_DIRS = {
63-
"solvers": "cuda-qx-solvers",
64-
"qec": "cuda-qx-qec",
63+
"solvers": "cuda-qx-solvers-algorithms",
64+
"qec": "cuda-qx-qec-decode",
6565
"build": "cuda-qx-build",
6666
}
6767

.agents/evals/graders/programmatic.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,11 @@
4040
# assertions file and as the ``skill`` field in the grading output. Add a new
4141
# entry here when introducing a new skill.
4242
SKILL_DIRS = {
43-
"solvers": "cuda-qx-solvers",
44-
"qec": "cuda-qx-qec",
43+
"solvers": "cuda-qx-solvers-algorithms",
44+
"qec": "cuda-qx-qec-decode",
4545
"build": "cuda-qx-build",
46+
"qec-realtime": "cuda-qx-qec-realtime",
47+
"chemistry": "cuda-qx-solvers-chemistry",
4648
}
4749

4850
# Repo-relative location of the assertions tree. Resolved relative to this

.agents/evals/prompts/cuda-qx-qec.evals.json renamed to .agents/evals/prompts/cuda-qx-qec-decode.evals.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"skill_name": "cuda-qx-qec",
2+
"skill_name": "cuda-qx-qec-decode",
33
"_note": "Prompts only. Assertions live in assertions.json so the agent under test never reads the answer key.",
44
"evals": [
55
{

0 commit comments

Comments
 (0)