You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the without-skill agent happened to read `libs/qec/` source verbatim,
92
92
matching exact substrings the with-skill agent paraphrased. See section 4.
93
93
94
-
### `cuda-qx-solvers` — iter 1 (first run)
94
+
### `cuda-qx-solvers-algorithms` — iter 1 (first run)
95
95
96
96
```
97
97
with without Δ
@@ -136,8 +136,8 @@ Both skills score 11/11 on activation in iter 2. Both baselines score 5/11.
136
136
This is not a phrasing artifact — it's structural:
137
137
138
138
* On in-scope prompts (e.g. "Build a Steane memory experiment"), the
139
-
with-skill agent emits the marker (`cuda-qx-qec`); the baseline cannot,
140
-
because the skill name lives in `.agents/skills/cuda-qx-qec/` which it
139
+
with-skill agent emits the marker (`cuda-qx-qec-decode`); the baseline cannot,
140
+
because the skill name lives in `.agents/skills/cuda-qx-qec-decode/` which it
141
141
was forbidden to read.
142
142
* On out-of-scope prompts (e.g. "Open the cudaq_qec docs offline" — the new
143
143
`A11`), the with-skill agent correctly *withholds* the marker and
@@ -210,7 +210,7 @@ Outcome:
210
210
recipe ending in a working `python -m http.server -d build/docs/sphinx 8080`.
211
211
* QEC `A11`: passes (correctly redirects without naming the skill).
212
212
* QEC `S13`: fails on `http.server` only — agent suggested `file://` URLs.
213
-
Real content gap; one-sentence fix in `references/triage.md`.
213
+
Real content gap; one-sentence fix in `references/decode-triage.md`.
214
214
215
215
---
216
216
@@ -220,8 +220,8 @@ Outcome:
220
220
|---|---|---|
221
221
| Judge grader |`not_configured` for all 26 scenarios scored |`pip install anthropic && export ANTHROPIC_API_KEY=...` then re-run `judge.py` on the saved `responses.json` (zero new tokens from agents) |
222
222
| Executable grader | All 26 scenarios `skipped` (no `executable` rules in any assertions file) | Add 2–3 high-leverage rules (`S5` Steane shape, `S6``@qec.code` decorator, `S11``qec.operation` enum). Schema documented at top of `executable.py`. |
223
-
| 4 brittle QEC assertions (`S1`, `S3`, `S5`, `S12`) | False-fails as documented in section 4 | One-line edits in `cuda-qx-qec.json` (regex alternatives instead of literal substrings). |
224
-
| QEC `references/triage.md`| Missing one-line note on `python -m http.server` for docs | Surgical edit; flips `S13`. |
223
+
| 4 brittle QEC assertions (`S1`, `S3`, `S5`, `S12`) | False-fails as documented in section 4 | One-line edits in `cuda-qx-qec-decode.json` (regex alternatives instead of literal substrings). |
224
+
| QEC `references/decode-triage.md`| Missing one-line note on `python -m http.server` for docs | Surgical edit; flips `S13`. |
225
225
|`cuda-qx-build` skill | Never evaluated end-to-end | Same recipe, `--skill build`; ~24 prompts × 2 subagents. |
226
226
| Cohen's κ |`n/a` everywhere because judge has 0 scenarios scored | Falls out of #1 above. |
227
227
@@ -231,11 +231,11 @@ Outcome:
231
231
232
232
In rough cost order, cheapest first:
233
233
234
-
1.**Patch the 4 brittle assertions** in `cuda-qx-qec.json` (regex alternatives
234
+
1.**Patch the 4 brittle assertions** in `cuda-qx-qec-decode.json` (regex alternatives
235
235
for `C-order`, `RuntimeError`, the shape variants, `X errors`/`Z errors`).
236
236
Re-run `programmatic.py` on the existing `responses.json`. Zero tokens.
237
237
Expected with-skill scenario pass rate: 11/13 on the existing data.
238
-
2.**Add the `http.server` sentence** to `cuda-qx-qec``references/triage.md`.
238
+
2.**Add the `http.server` sentence** to `cuda-qx-qec-decode``references/decode-triage.md`.
239
239
Doesn't change scoring this iteration, but makes `S13` flip the next
240
240
time a fresh agent runs.
241
241
3.**Wire the judge grader** (export an API key, run `judge.py` on the
Copy file name to clipboardExpand all lines: .agents/evals/assertions/cuda-qx-qec-decode.json
+13-3Lines changed: 13 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,21 @@
1
1
{
2
-
"skill_name": "cuda-qx-qec",
2
+
"skill_name": "cuda-qx-qec-decode",
3
3
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. Loaded by .agents/evals/graders/ only.",
4
-
"activation_marker": "cuda-qx-qec",
4
+
"activation_marker": "cuda-qx-qec-decode",
5
5
"scenarios": {
6
6
"S1": {
7
7
"must_include": ["C-order"],
8
-
"must_not_include": ["F-order is supported"]
8
+
"must_not_include": ["F-order is supported"],
9
+
"executable": {
10
+
"code_extractor": "first_python_block",
11
+
"interpreter": "python3",
12
+
"preamble": "import numpy as np\nH = np.zeros((3, 5), dtype=np.uint8, order='F')\nassert H.flags['F_CONTIGUOUS'], 'precondition failed: H must start F-ordered'\n",
13
+
"harness": "{code}\nimport numpy as np\n# Probe any numpy array variable in locals() that is now C-contiguous\nresult = any(getattr(v, 'flags', None) is not None and v.flags.get('C_CONTIGUOUS', False) for v in list(locals().values()) if isinstance(v, np.ndarray))\nprint('C_CONTIGUOUS_FOUND' if result else 'NO_C_ORDER_ARRAY')\n",
"harness": "{code}\n# After the agent's snippet, both block_size and syndrome_size variables must exist with right values\nimport sys\ntry:\n ok = (int(block_size) == 200) and (int(syndrome_size) == 60)\nexcept NameError:\n ok = False\nprint('SIZES_OK' if ok else 'SIZES_WRONG')\n",
Copy file name to clipboardExpand all lines: .agents/evals/assertions/cuda-qx-solvers-algorithms.json
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
{
2
-
"skill_name": "cuda-qx-solvers",
2
+
"skill_name": "cuda-qx-solvers-algorithms",
3
3
"_note": "Answer key. NEVER read this file from inside SKILL.md or any references/*.md. The agent under test must not see this content. Loaded by .agents/evals/graders/programmatic.py (and judge.py) only.",
"preamble": "# Active-space parameters from the prompt\nnele_cas = 6\nnorb_cas = 6\n",
25
+
"harness": "{code}\n# Agent's snippet must compute num_qubits = 12 via 2 * num_orbitals\ntry:\n ok = int(num_qubits) == 12\nexcept NameError:\n ok = False\nprint('QUBITS_OK' if ok else 'QUBITS_WRONG')\n",
0 commit comments