Light-Heart-Labs · Lightheartdevs · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md
@@ -1,6 +1,8 @@
 # Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences
 
-**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.**
+**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.**
+Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps;
+N=3 pass counts are in findings.md.
 This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
 vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
 cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
@@ -56,32 +58,35 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ
   metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
   right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)
 
-## 4. Does thinking help 397B? No — net −1, and the loss is revealing
-**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning
-changed exactly one outcome — and made it *worse*:
+## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing"
+**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked
+like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else."
+**N=3 refutes that** — thinking moves three cells, in *both* directions:
 
-| flip | no-think | think | cause |
-|---|---|---|---|
-| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit |
+| cell | no-think (N=3) | think (N=3) | Δ | what's happening |
+|---|:--:|:--:|:--:|---|
+| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways |
+| **p3_pm** | 2/3 | **0/3** | −2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) |
+| **p3_doc** | 2/3 | 1/3 | −1 | the verbosity story: thinking inflates length, trips the 700-word limit |
 
-`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words
-against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make
-it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing
-over-documentation tendency (§2). It also spent more turns getting there (35 vs 20).
-(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the
-think output is arguably equal in substance and failed on form.
-
-Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite,
-reasoning bought 397B nothing.
+The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity
+mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as
+no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form
+(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where*
+397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to
+−1. A single-N read would have missed both the gain and the symmetry.
 
 **Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
 substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
 than no-think on the same task (126 vs 110).
 
-**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) —
-including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low
-effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated
-at low reasoning effort. This is a real reliability edge for 397B.
+**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all
+72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's
+failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think
+— `p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1
+`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs
+no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real
+(no runaways), but it's "stalls quietly," not "always finishes."
 
 ## 5. Integration cost (a "messy model" finding in itself)
 - Flash (vLLM) ran the harness out of the box once launched.
@@ -92,10 +97,12 @@ at low reasoning effort. This is a real reliability edge for 397B.
   re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
   exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.
 
-## Net take (provisional, no-think only)
+## Net take (N=3, both arms)
 397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
 is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
 substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
-(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability,
-one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's
-reasoning modes — pending.
+(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its
+worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw
+accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so
+"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and
+market-research reliability matter; otherwise Flash wins on speed and cost.
diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md
@@ -1,18 +1,22 @@
-# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison)
+# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison)
 
-A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
+A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
 12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
 Step-3.7-Flash-NVFP4 entry on the same box.
 
-**Provisional, N=1** — one replicate per cell. An N=3 re-run is queued; treat numbers as directional.
+**N=3** — three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`).
 
 ## TL;DR
-- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
-- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at
-  identical fact coverage, not a reasoning failure.
-- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge.
+- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
+- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market`
+  (1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds,
+  not how often.
+- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort.
+  Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm`
+  `model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3).
+- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck.
 - **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
-  narrow (long-form synthesis reliability, one stable setting).
+  reliability (runaway resistance, thinking-on market research).
 - The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
 
 ## Files

diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md
@@ -1,8 +1,8 @@
-# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash
+# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3, vs Step-3.7-Flash
 
-**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3
-re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties
-across models, and the differences that matter are qualitative.
+**N=3** (three replicates per cell). Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail
+table below ties across models, and the differences that matter are qualitative. Phase-1 cells are
+graded with the **fixed** `phase1_grade.py` (see grading-correctness note below).
 
 ## Setup
 - **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards).
@@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative.
   `../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships,"
   not a clean precision study.**
 
-## Scorecard (N=1)
+## Scorecard (N=3, pass count per cell)
 
-| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) |
+| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
 |---|:--:|:--:|:--:|:--:|:--:|
-| p1_bugfix | ✓ | ✓ | ✓/✓/✓ | 3/3 | 2/3 |
-| p1_testwrite † | ✗ | ✗ | ✗/✗/✗ | 0/3 † | 0/3 † |
-| p1_refactor † | ✗ | ✗ | ✓/✗/✗ | 0/3 † | 0/3 † |
-| p2_extract | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
-| p2_ci | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
-| p2_hallucination | ✓ | ✓ | ✓/✓/✓ | 3/3 | 1/3 |
-| p2_triage | ✓ | ✓ | ~/✓/✓ | 3/3 | 3/3 |
-| p3_doc | ✓ | **✗** | ~/✓/✓ | 0/3 | 2/3 |
-| p3_business | ✓ | ✓ | ✓/~/✗ | 2/3 | 3/3 |
-| p3_market * | ✓ | ✓ | **✗**/✓/✓ | 3/3 * | 0/3 |
-| p3_writing | ✗ | ✗ | ✗/~/✗ | 0/3 | 2/3 |
-| p3_pm | ✗ | ✗ | ✗/~/✓ | 0/3 | 1/3 |
-| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 |
+| p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 |
+| p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † |
+| p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † |
+| p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
+| p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
+| p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 |
+| p2_triage | 3/3 | 3/3 | ~/✓/✓ | 3/3 | 3/3 |
+| p3_doc | 2/3 | **1/3** | ~/✓/✓ | 0/3 | 2/3 |
+| p3_business | 3/3 | 3/3 | ✓/~/✗ | 2/3 | 3/3 |
+| p3_market * | **1/3** | **3/3** | ✗/✓/✓ | 3/3 * | 0/3 |
+| p3_writing | 0/3 | 0/3 | ✗/~/✗ | 0/3 | 2/3 |
+| p3_pm | **2/3** | **0/3** | ✗/~/✓ | 0/3 | 1/3 |
+| **Total** | **23/36** | **22/36** | **7 / 8 / 8** | ~7/12 | ~7/12 |
 
 † `p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the
 core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was
@@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val
 A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while
 `code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`,
 `benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and
-`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome:
-- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental.
-- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
-  fix) are real and pass — they were previously ignored.
+`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded** (N=1 and N=3):
+- `p1_bugfix` PASS is genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
+  fix) are real and pass — they were previously ignored. Consistent 3/3 both arms.
 - `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153
   passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged`
   (it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here —
   the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
   is misleading and should be re-examined for the published 27B/Coder cells too.
-- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
-  (bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
-  especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see
-  tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands.
+- ⚠️ **The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing)
+  numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a
+  guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see tracking issue #29.**
+  Treat the reference columns' p1_* cells as provisional until that lands.
 
 ## Headline findings
 
-1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B,
-   a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total.
+1. **Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the *same
+   7–8/12 band* (by per-task majority) as a 27B, a ~30B coder, and an ~11B-active Flash. The interesting
+   signal is per-task and qualitative, not the total.
 
-2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
-   the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts
-   (`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
-   amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
-   everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.**
+2. **Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1
+   "inert" reading.** At N=1 the loss looked like a single verbosity flip (`p3_doc`). With three
+   replicates, thinking is not inert — it **redistributes**: it *helps* `p3_market` (no-think 1/3 → think
+   **3/3**, stabilizing the wobbliest cell, zero runaways) but *hurts* `p3_pm` (2/3 → **0/3**) and `p3_doc`
+   (2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a
+   no-op. Reasoning changes *where* 397B succeeds without changing *how often*.
 
-3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
-   `done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort**
-   (hit max_tokens). 397B's reliability edge is real and mode-independent.
+3. **N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in
+   the no-think arm, `p3_market` (N=1 ✓ → N=3 1/3) and `p3_pm` (N=1 ✗ → N=3 2/3) were single-draw
+   artifacts; `p3_doc` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all
+   reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the
+   consistent fails. **Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.**
 
-4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in
-   no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared
-   band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default;
-   397B earns its keep only where synthesis reliability and a single stable setting matter.
+4. **397B is runaway-resistant — but no-think market research gets *stuck*.** Zero max_tokens/length
+   runaways across all 72 cells (the failure mode Step-3.7-Flash showed on `p3_market` at low effort).
+   69/72 finished `done_signal`; the 3 non-clean exits were **all no-think**: `p3_market` v2 & v3 hit the
+   500-iter **stuck threshold** (`stuck_no_workspace_change_for_500_iters` — spinning without progress, not
+   over-generating) and `p3_pm` v1 `model_stopped`. So 397B's pathology is *stalling*, not runaway — and
+   **thinking eliminates the market stall**: think `p3_market` is a clean 3/3 `done_signal` vs no-think's
+   1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock.
 
-5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
+5. **Cost still favors Flash.** 397B reaches the shared band at ~71 tok/s spanning both GPUs at Q3, vs
+   Flash ~99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway
+   resistance + (thinking-on) market-research reliability matter.
+
+6. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
    (llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving
    `content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a
    thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers).