diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md
index 3fd35358..a121f7d5 100644
--- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md
+++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md
@@ -1,6 +1,8 @@
 # Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences
 
-**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.**
+**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.**
+Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps;
+N=3 pass counts are in findings.md.
 This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
 vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
 cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
@@ -56,32 +58,35 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ
   metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
   right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)
 
-## 4. Does thinking help 397B? No — net −1, and the loss is revealing
-**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning
-changed exactly one outcome — and made it *worse*:
+## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing"
+**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked
+like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else."
+**N=3 refutes that** — thinking moves three cells, in *both* directions:
 
-| flip | no-think | think | cause |
-|---|---|---|---|
-| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit |
+| cell | no-think (N=3) | think (N=3) | Δ | what's happening |
+|---|:--:|:--:|:--:|---|
+| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways |
+| **p3_pm** | 2/3 | **0/3** | −2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) |
+| **p3_doc** | 2/3 | 1/3 | −1 | the verbosity story: thinking inflates length, trips the 700-word limit |
 
-`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words
-against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make
-it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing
-over-documentation tendency (§2). It also spent more turns getting there (35 vs 20).
-(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the
-think output is arguably equal in substance and failed on form.
-
-Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite,
-reasoning bought 397B nothing.
+The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity
+mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as
+no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form
+(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where*
+397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to
+−1. A single-N read would have missed both the gain and the symmetry.
 
 **Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
 substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
 than no-think on the same task (126 vs 110).
 
-**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) —
-including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low
-effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated
-at low reasoning effort. This is a real reliability edge for 397B.
+**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all
+72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's
+failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think
+— `p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1
+`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs
+no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real
+(no runaways), but it's "stalls quietly," not "always finishes."
 
 ## 5. Integration cost (a "messy model" finding in itself)
 - Flash (vLLM) ran the harness out of the box once launched.
@@ -92,10 +97,12 @@ at low reasoning effort. This is a real reliability edge for 397B.
   re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
   exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.
 
-## Net take (provisional, no-think only)
+## Net take (N=3, both arms)
 397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
 is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
 substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
-(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability,
-one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's
-reasoning modes — pending.
+(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its
+worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw
+accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so
+"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and
+market-research reliability matter; otherwise Flash wins on speed and cost.
diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md
index d2731022..bc436ab0 100644
--- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md
+++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md
@@ -1,18 +1,22 @@
-# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison)
+# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison)
 
-A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
+A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
 12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
 Step-3.7-Flash-NVFP4 entry on the same box.
 
-**Provisional, N=1** — one replicate per cell. An N=3 re-run is queued; treat numbers as directional.
+**N=3** — three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`).
 
 ## TL;DR
-- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
-- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at
-  identical fact coverage, not a reasoning failure.
-- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge.
+- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
+- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market`
+  (1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds,
+  not how often.
+- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort.
+  Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm`
+  `model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3).
+- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck.
 - **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
-  narrow (long-form synthesis reliability, one stable setting).
+  reliability (runaway resistance, thinking-on market research).
 - The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
 
 ## Files
diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md
index bcfc4005..4233898e 100644
--- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md
+++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md
@@ -1,8 +1,8 @@
-# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash
+# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3, vs Step-3.7-Flash
 
-**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3
-re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties
-across models, and the differences that matter are qualitative.
+**N=3** (three replicates per cell). Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail
+table below ties across models, and the differences that matter are qualitative. Phase-1 cells are
+graded with the **fixed** `phase1_grade.py` (see grading-correctness note below).
 
 ## Setup
 - **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards).
@@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative.
   `../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships,"
   not a clean precision study.**
 
-## Scorecard (N=1)
+## Scorecard (N=3, pass count per cell)
 
-| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) |
+| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
 |---|:--:|:--:|:--:|:--:|:--:|
-| p1_bugfix | ✓ | ✓ | ✓/✓/✓ | 3/3 | 2/3 |
-| p1_testwrite † | ✗ | ✗ | ✗/✗/✗ | 0/3 † | 0/3 † |
-| p1_refactor † | ✗ | ✗ | ✓/✗/✗ | 0/3 † | 0/3 † |
-| p2_extract | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
-| p2_ci | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
-| p2_hallucination | ✓ | ✓ | ✓/✓/✓ | 3/3 | 1/3 |
-| p2_triage | ✓ | ✓ | ~/✓/✓ | 3/3 | 3/3 |
-| p3_doc | ✓ | **✗** | ~/✓/✓ | 0/3 | 2/3 |
-| p3_business | ✓ | ✓ | ✓/~/✗ | 2/3 | 3/3 |
-| p3_market * | ✓ | ✓ | **✗**/✓/✓ | 3/3 * | 0/3 |
-| p3_writing | ✗ | ✗ | ✗/~/✗ | 0/3 | 2/3 |
-| p3_pm | ✗ | ✗ | ✗/~/✓ | 0/3 | 1/3 |
-| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 |
+| p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 |
+| p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † |
+| p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † |
+| p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
+| p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
+| p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 |
+| p2_triage | 3/3 | 3/3 | ~/✓/✓ | 3/3 | 3/3 |
+| p3_doc | 2/3 | **1/3** | ~/✓/✓ | 0/3 | 2/3 |
+| p3_business | 3/3 | 3/3 | ✓/~/✗ | 2/3 | 3/3 |
+| p3_market * | **1/3** | **3/3** | ✗/✓/✓ | 3/3 * | 0/3 |
+| p3_writing | 0/3 | 0/3 | ✗/~/✗ | 0/3 | 2/3 |
+| p3_pm | **2/3** | **0/3** | ✗/~/✓ | 0/3 | 1/3 |
+| **Total** | **23/36** | **22/36** | **7 / 8 / 8** | ~7/12 | ~7/12 |
 
 † `p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the
 core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was
@@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val
 A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while
 `code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`,
 `benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and
-`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome:
-- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental.
-- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
-  fix) are real and pass — they were previously ignored.
+`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded** (N=1 and N=3):
+- `p1_bugfix` PASS is genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
+  fix) are real and pass — they were previously ignored. Consistent 3/3 both arms.
 - `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153
   passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged`
   (it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here —
   the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
   is misleading and should be re-examined for the published 27B/Coder cells too.
-- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
-  (bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
-  especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see
-  tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands.
+- ⚠️ **The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing)
+  numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a
+  guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see tracking issue #29.**
+  Treat the reference columns' p1_* cells as provisional until that lands.
 
 ## Headline findings
 
-1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B,
-   a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total.
+1. **Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the *same
+   7–8/12 band* (by per-task majority) as a 27B, a ~30B coder, and an ~11B-active Flash. The interesting
+   signal is per-task and qualitative, not the total.
 
-2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
-   the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts
-   (`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
-   amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
-   everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.**
+2. **Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1
+   "inert" reading.** At N=1 the loss looked like a single verbosity flip (`p3_doc`). With three
+   replicates, thinking is not inert — it **redistributes**: it *helps* `p3_market` (no-think 1/3 → think
+   **3/3**, stabilizing the wobbliest cell, zero runaways) but *hurts* `p3_pm` (2/3 → **0/3**) and `p3_doc`
+   (2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a
+   no-op. Reasoning changes *where* 397B succeeds without changing *how often*.
 
-3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
-   `done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort**
-   (hit max_tokens). 397B's reliability edge is real and mode-independent.
+3. **N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in
+   the no-think arm, `p3_market` (N=1 ✓ → N=3 1/3) and `p3_pm` (N=1 ✗ → N=3 2/3) were single-draw
+   artifacts; `p3_doc` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all
+   reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the
+   consistent fails. **Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.**
 
-4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in
-   no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared
-   band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default;
-   397B earns its keep only where synthesis reliability and a single stable setting matter.
+4. **397B is runaway-resistant — but no-think market research gets *stuck*.** Zero max_tokens/length
+   runaways across all 72 cells (the failure mode Step-3.7-Flash showed on `p3_market` at low effort).
+   69/72 finished `done_signal`; the 3 non-clean exits were **all no-think**: `p3_market` v2 & v3 hit the
+   500-iter **stuck threshold** (`stuck_no_workspace_change_for_500_iters` — spinning without progress, not
+   over-generating) and `p3_pm` v1 `model_stopped`. So 397B's pathology is *stalling*, not runaway — and
+   **thinking eliminates the market stall**: think `p3_market` is a clean 3/3 `done_signal` vs no-think's
+   1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock.
 
-5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
+5. **Cost still favors Flash.** 397B reaches the shared band at ~71 tok/s spanning both GPUs at Q3, vs
+   Flash ~99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway
+   resistance + (thinking-on) market-research reliability matter.
+
+6. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
    (llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving
    `content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a
    thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers).
diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json
index 03e494b3..b89f7c6c 100644
--- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json
+++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json
@@ -2,8 +2,8 @@
   "$schema_note": "Read this BEFORE drawing conclusions. Machine-readable provenance for the qwen3.5-397b-vs-step3.7-flash-2026-05-29 entry. Companion to README.md / findings.md / QUALITATIVE.md.",
   "bundle_id": "qwen3.5-397b-vs-step3.7-flash-2026-05-29",
   "snapshot_date_utc": "2026-05-29",
-  "kind": "agentic microbench (12 task families) — N=1 provisional, two reasoning arms, + cross-model comparison",
-  "claim_scope": "Quality/behavior of Qwen3.5-397B-A17B (UD-Q3_K_XL GGUF) on llama.cpp through the MMBT 12-family agentic microbench, no-think vs think, on 2x RTX PRO 6000 Blackwell; compared against the Step-3.7-Flash-NVFP4 entry on the same rig. Single rig, ONE replicate per cell (N=1) — directional, not statistically settled. N=3 re-run queued.",
+  "kind": "agentic microbench (12 task families) — N=3, two reasoning arms, + cross-model comparison",
+  "claim_scope": "Quality/behavior of Qwen3.5-397B-A17B (UD-Q3_K_XL GGUF) on llama.cpp through the MMBT 12-family agentic microbench, no-think vs think, on 2x RTX PRO 6000 Blackwell; compared against the Step-3.7-Flash-NVFP4 entry on the same rig. Single rig, three replicates per cell (N=3). Phase-1 cells graded with the fixed phase1_grade.py.",
   "hardware": {
     "host": "Tower2 (ASUS Pro WS WRX90E-SAGE SE, Threadripper PRO 7965WX)",
     "gpus": "2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB each",
@@ -43,20 +43,21 @@
     "max_model_len_flag": "--max-model-len 131072 (matches served -c so request max_tokens math doesn't exceed ctx and 400)",
     "temperature": 0.3,
     "stuck_threshold": 500,
-    "n_per_cell": 1,
+    "n_per_cell": 3,
     "smoke_gate": "PASS (extraction, thinking off): tool-calling proven in harness, all fields correct, done_signal"
   },
   "results": {
-    "397b_no_think_pass": "8/12",
-    "397b_think_pass": "7/12",
+    "397b_no_think_pass": "23/36 (N=3); 8/12 (N=1)",
+    "397b_think_pass": "22/36 (N=3); 7/12 (N=1)",
     "step3p7_flash_pass": {"low": "7/12", "medium": "8/12", "high": "8/12"},
-    "think_vs_nothink": "net -1; only flip p3_doc (PASS->FAIL) = verbosity blew 700-word limit at identical 8/8 fact coverage",
-    "runaways": "none in either 397B arm (all done_signal); Step-3.7-Flash ran away on p3_market at low effort",
+    "think_vs_nothink": "net -1 at both N=1 and N=3, BUT N=3 shows it redistributes (not inert): p3_market +2 (1/3->3/3, stabilizes), p3_pm -2 (2/3->0/3), p3_doc -1 (verbosity vs 700-word limit).",
+    "n3_variance": "no-think single-draw luck exposed at N=3: p3_market N1 PASS->1/3, p3_pm N1 FAIL->2/3, p3_doc N1 PASS->2/3. Zero-variance: p1_bugfix, p2_extract/ci/hallucination/triage, p3_business (3/3); testwrite/refactor/writing (0/3).",
+    "runaways": "zero max_tokens/length runaways across all 72 cells (Step-3.7-Flash ran away on p3_market at low effort). 69/72 done_signal; the 3 non-clean exits were all no-think: p3_market v2/v3 hit the 500-iter stuck threshold, p3_pm v1 model_stopped. Pathology is stalling, not runaway; think arm 36/36 done_signal and think p3_market 3/3 (thinking clears the stall).",
     "known_artifact": "p1_refactor fails on structure (no output/ subpackage). p1_testwrite fails on the 'only /tests/ may differ' rule — NOT incapacity: corrected grading shows think-mode reached 99% coverage / 153 passing tests.",
     "phase1_grader_fix_2026-05-29": "phase1_grade.py read flat keys (coverage_pct/ruff_issues/benchmark_s) vs code_task_grader's nested coverage.line_coverage_pct/ruff.issue_count/benchmark.elapsed_s -> bugfix ruff/bench gates always-true, testwrite coverage gate always-false. Fixed and ALL phase-1 cells regraded; totals unchanged (8/12, 7/12) but now trustworthy."
   },
   "caveats": [
-    "N=1 per cell — directional only; N=3 re-run queued.",
+    "N=3 per cell. Open-ended cells (p3_market/p3_pm/p3_doc) are high-variance — see n3_variance.",
     "Cross-engine (llama.cpp vs vLLM) AND cross-quant (Q3_K_XL vs NVFP4) vs Step-3.7-Flash: 'best-as-each-ships', NOT a clean precision study.",
     "Some graders are binary and can fail high-quality output on format/length (see p3_writing); pair with QUALITATIVE.md."
   ]
diff --git a/tooling/scripts/run_microbench.sh b/tooling/scripts/run_microbench.sh
index a5cb191f..3526c967 100755
--- a/tooling/scripts/run_microbench.sh
+++ b/tooling/scripts/run_microbench.sh
@@ -76,6 +76,24 @@ if [ -n "$REASONING_EFFORT" ] && [[ "$LABEL" != *"$REASONING_EFFORT"* ]]; then
   exit 2
 fi
 
+# Same guard for --thinking: run names are keyed by LABEL only, so running
+# --thinking off then on under ONE label silently skips the second arm as
+# "already complete". Require the mode encoded in the label (nothink / think).
+if [ -n "$THINKING" ]; then
+  if [ "$THINKING" = "off" ]; then
+    want="nothink"; [[ "$LABEL" == *nothink* ]] && ok=1 || ok=0
+  else
+    want="think"; [[ "$LABEL" == *think* && "$LABEL" != *nothink* ]] && ok=1 || ok=0
+  fi
+  if [ "$ok" != "1" ]; then
+    echo "ERROR: --thinking '$THINKING' is set but label '$LABEL' does not encode it ('$want')." >&2
+    echo "       Run names are keyed by label only, so --thinking off then on under one label collide" >&2
+    echo "       (the second arm is skipped as already-complete). Put the mode in the label, e.g.:" >&2
+    echo "         $0 $MODEL $PORT 397b-${want} ${N} \"\" ${THINKING} ${MAXLEN}" >&2
+    exit 2
+  fi
+fi
+
 TOOLING="$(cd "$(dirname "$0")/.." && pwd)"
 REPO_ROOT="$(cd "$TOOLING/.." && pwd)"
 cd "$REPO_ROOT"