Skip to content

Commit 4b628a4

Browse files
Merge pull request #31 from Light-Heart-Labs/mmbt-bench-autopilot-2026-05-30
Bench autopilot tooling + 397B N=10 + GPU power analysis + unified cross-model
2 parents aea2585 + e7de54b commit 4b628a4

14 files changed

Lines changed: 3078 additions & 97 deletions

File tree

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md

Lines changed: 22 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences
22

3-
**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.**
3+
**Status: N=10. Both 397B arms complete (no-think 82/120, think 72/120); Flash low/med/high complete; 27B/Coder via Q4/AWQ refs.**
44
Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps;
5-
N=3 pass counts are in findings.md.
5+
N=10 pass counts are in findings.md.
66
This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
77
vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
88
cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
@@ -58,34 +58,32 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ
5858
metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
5959
right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)
6060

61-
## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing"
62-
**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked
63-
like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else."
64-
**N=3 refutes that** — thinking moves three cells, in *both* directions:
61+
## 4. Does thinking help 397B? Net −10 at N=10 — it *redistributes*, and on net hurts
62+
**No-think 82/120 vs think 72/120.** The signal sharpened with N: at N=1 the loss looked like a single
63+
verbosity flip (`p3_doc`); N=3 hinted at redistribution (−1); **N=10 makes it decisive** — thinking moves
64+
three cells, hard, in both directions while leaving the other nine identical:
6565

66-
| cell | no-think (N=3) | think (N=3) | Δ | what's happening |
66+
| cell | no-think (N=10) | think (N=10) | Δ | what's happening |
6767
|---|:--:|:--:|:--:|---|
68-
| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways |
69-
| **p3_pm** | 2/3 | **0/3** |2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) |
70-
| **p3_doc** | 2/3 | 1/3 |1 | the verbosity story: thinking inflates length, trips the 700-word limit |
68+
| **p3_market** | 8/10 | **10/10** | +2 | thinking *stabilizes* the wobbliest cell — clears the no-think stall, zero runaways |
69+
| **p3_pm** | 5/10 | **0/10** |5 | thinking *destroys* project-mgmt synthesis (over-deliberation) |
70+
| **p3_doc** | 9/10 | **2/10** |7 | the verbosity story: thinking inflates length, trips the 700-word limit |
7171

72-
The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity
73-
mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as
74-
no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form
75-
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where*
76-
397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to
77-
−1. A single-N read would have missed both the gain and the symmetry.
72+
The verbosity mechanism is well-captured: think `p3_doc` hits the same fact coverage as no-think but
73+
overruns the 700-word limit (e.g. 721 vs 692 words at N=1) — equal substance, failed on form. **Lesson:
74+
reasoning changes *where* 397B succeeds, not *how often* — and here the trade is net-negative.** "Turn
75+
thinking on" is a per-task decision; for doc-synthesis and project-mgmt it's actively harmful, for market
76+
research it's a clear win.
7877

7978
**Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
8079
substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
8180
than no-think on the same task (126 vs 110).
8281

8382
**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all
84-
72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's
85-
failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think
86-
`p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1
87-
`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs
88-
no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real
83+
240 N=10 cells — contrast Flash, which **ran away on `p3_market` at low effort**, and 27B-Q8 which ran
84+
away 23/36 times. 397B's failure mode is the opposite of runaway: its only non-clean exits are *stalls*
85+
(stuck-loop / `model_stopped`), concentrated in no-think `p3_market`/`p3_pm`. Thinking specifically clears
86+
the no-think market stall (think `p3_market` 10/10). So the reliability edge over Flash is real
8987
(no runaways), but it's "stalls quietly," not "always finishes."
9088

9189
## 5. Integration cost (a "messy model" finding in itself)
@@ -97,12 +95,12 @@ no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliabil
9795
re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
9896
exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.
9997

100-
## Net take (N=3, both arms)
98+
## Net take (N=10, both arms)
10199
397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
102100
is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
103101
substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
104102
(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its
105-
worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw
106-
accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so
103+
worst case is a quiet *stall* on no-think market research, which thinking clears to 10/10) — not raw
104+
accuracy. Thinking lowers the aggregate (net −10 at N=10); it *redistributes* (market up, pm/doc down), so
107105
"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and
108106
market-research reliability matter; otherwise Flash wins on speed and cost.

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,24 @@
1-
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison)
1+
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=10 (+ Step-3.7-Flash + 27B/Coder-Q4 + GPU power)
22

33
A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
44
12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
55
Step-3.7-Flash-NVFP4 entry on the same box.
66

7-
**N=3**three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`).
7+
**N=10**ten replicates per cell, both arms (240 cells, all `done_signal`; phase-1 graded with the fixed `phase1_grade.py`).
88

99
## TL;DR
10-
- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
11-
- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market`
12-
(1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds,
13-
not how often.
14-
- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort.
15-
Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm`
16-
`model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3).
17-
- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck.
18-
- **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
19-
reliability (runaway resistance, thinking-on market research).
10+
- **397B no-think 82/120, think 72/120; Step-3.7-Flash 7–8/12; 27B-Q4 & Coder-Next-Q4 ~7/12.** Aggregate ties across a ~15× param range — scale doesn't move the total (confirmed at N=10).
11+
- **Thinking is net-negative across a ~15× param range — same mechanism.** 397B think 72/120 < no-think 82/120 (−10), and Qwen3.6-27B-Q4 (N=10) ships **86.8% no-think vs 75% thinking** — both worse with thinking, both via the **`p3_doc` word-limit loop** (397B 9/10→2/10; 27B-thinking `wall_killed` ~40%). Reasoning isn't a free upgrade; on constraint-bound synthesis it backfires regardless of size. (Full cross-model think/no-think table in findings.md.)
12+
- **N=10 overturns small-N luck:** `p3_market` no-think flips 1/3 (N=3, looked like a fail) → 8/10 (clear pass) — auto-flagged in the stability table. The headline methodological result.
13+
- **Failure temperament tracks lineage, not size:** 397B + 27B *stall* (never over-generate); Coder-Next + Flash *run away*. Zero max_tokens runaways across all 240 397B cells.
14+
- **Cross-model uses clean Q4/AWQ refs** for 27B/Coder; fresh Q8/FP8 runs excluded as serving failures (documented, not faked).
15+
- **GPU power:** combined both-GPU draw never within 5% of the 1200W cap (median 670W, max 985W=82%); GPU0 leads GPU1 — pipeline alternation. The pair never hits full power together.
2016
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
2117

2218
## Files
23-
- [findings.md](findings.md) — scorecard + headline findings.
19+
- [findings.md](findings.md) — N=10 scorecard + headline findings + power + cross-model qualitative.
20+
- [findings-n10.md](findings-n10.md) — auto-generated replicate-stability table (flags small-N flips) + finish-reason audit.
21+
- [power-analysis.md](power-analysis.md) — dual-GPU power percentiles, pipeline asymmetry, %-of-cap.
2422
- [QUALITATIVE.md](QUALITATIVE.md) — behavioral analysis beyond pass/fail (token economy, packaging,
2523
failure-mode texture, reasoning shape), every claim cited to a cell/file.
2624
- [manifest.json](manifest.json) — models, quant, engine, launch flags, run inventory, dates.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=10, two arms (think / no-think)
2+
3+
**N=10** scorecard, replicate-stability analysis, think-vs-no-think redistribution, and finish-reason audit — auto-generated by `tooling/bench_report.py` from the per-cell `grade.json` / `summary.json` logs. Partial data is rendered as-is (denominators reflect graded replicates, not the target N).
4+
5+
## Setup
6+
- **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (5 shards), 2× RTX PRO 6000 Blackwell, llama.cpp pipeline-parallel (`-sm layer -ngl 999 -fa on -c 131072 --jinja --reasoning-format none`).
7+
- **Two arms:** `enable_thinking` off (no-think) and on (think).
8+
- **Comparators:** Step-3.7-Flash-NVFP4 low/med/high, and the published 27B / Coder reference columns (carried from the N=3 entry).
9+
10+
## Scorecard (N=10, pass count per cell)
11+
12+
| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
13+
|---|:--:|:--:|:--:|:--:|:--:|
14+
| p1_bugfix | 10/10 | 10/10 | ✓/✓/✓ | 3/3 | 2/3 |
15+
| p1_testwrite | 0/10 | 0/10 | ✗/✗/✗ | 0/3 | 0/3 |
16+
| p1_refactor | 0/10 | 0/10 | ✓/✗/✗ | 0/3 | 0/3 |
17+
| p2_extract | 10/10 | 10/10 | ✓/✓/✓ | 3/3 | 3/3 |
18+
| p2_ci | 10/10 | 10/10 | ✓/✓/✓ | 3/3 | 3/3 |
19+
| p2_hallucination | 10/10 | 10/10 | ✓/✓/✓ | 3/3 | 1/3 |
20+
| p2_triage | 10/10 | 10/10 | ~/✓/✓ | 3/3 | 3/3 |
21+
| p3_doc | 9/10 | 2/10 | ~/✓/✓ | 0/3 | 2/3 |
22+
| p3_business | 10/10 | 10/10 | ✓/~/✗ | 2/3 | 3/3 |
23+
| p3_market | 8/10 | 10/10 | ✗/✓/✓ | 3/3 | 0/3 |
24+
| p3_writing | 0/10 | 0/10 | ✗/~/✗ | 0/3 | 2/3 |
25+
| p3_pm | 5/10 | 0/10 | ✗/~/✓ | 0/3 | 1/3 |
26+
| **Total** | **82/120** | **72/120** | **7 / 8 / 8** | ~7/12 | ~7/12 |
27+
28+
Pass = grade.json `verdict` of PASS or STRUCTURAL_PASS. Cells with no grade.json yet are excluded from the denominator (shown as `x/done`, not `x/N`); an em-dash means no graded replicate exists for that cell. Step/27B/Coder columns are the published comparators carried from the N=3 entry (see tracking issue #29 — phase-1 reference cells may be provisional).
29+
30+
## What N reveals — pass-rate stability across replicates
31+
32+
Per task, the pass count over the first v1–3 / v1–5 / v1–10 / v1–10 replicates of each arm. A **⚑ flip** marks a cell whose small-N *majority verdict* (>50% pass) disagrees with its full-N=10 majority verdict — i.e. a small-N read that would have been overturned. `·` = no graded replicate in that window; a tie (exactly 50%) is treated as no-call and never flagged as a flip.
33+
34+
| task | arm | v1–3 | v1–5 | v1–10 | flip? |
35+
|---|---|:--:|:--:|:--:|:--:|
36+
| p1_bugfix | 397B no-think | 3/3 | 5/5 | 10/10 | |
37+
| p1_bugfix | 397B think | 3/3 | 5/5 | 10/10 | |
38+
| p1_testwrite | 397B no-think | 0/3 | 0/5 | 0/10 | |
39+
| p1_testwrite | 397B think | 0/3 | 0/5 | 0/10 | |
40+
| p1_refactor | 397B no-think | 0/3 | 0/5 | 0/10 | |
41+
| p1_refactor | 397B think | 0/3 | 0/5 | 0/10 | |
42+
| p2_extract | 397B no-think | 3/3 | 5/5 | 10/10 | |
43+
| p2_extract | 397B think | 3/3 | 5/5 | 10/10 | |
44+
| p2_ci | 397B no-think | 3/3 | 5/5 | 10/10 | |
45+
| p2_ci | 397B think | 3/3 | 5/5 | 10/10 | |
46+
| p2_hallucination | 397B no-think | 3/3 | 5/5 | 10/10 | |
47+
| p2_hallucination | 397B think | 3/3 | 5/5 | 10/10 | |
48+
| p2_triage | 397B no-think | 3/3 | 5/5 | 10/10 | |
49+
| p2_triage | 397B think | 3/3 | 5/5 | 10/10 | |
50+
| p3_doc | 397B no-think | 2/3 | 4/5 | 9/10 | |
51+
| p3_doc | 397B think | 1/3 | 1/5 | 2/10 | |
52+
| p3_business | 397B no-think | 3/3 | 5/5 | 10/10 | |
53+
| p3_business | 397B think | 3/3 | 5/5 | 10/10 | |
54+
| p3_market | 397B no-think | 1/3 | 3/5 | 8/10 ||
55+
| p3_market | 397B think | 3/3 | 5/5 | 10/10 | |
56+
| p3_writing | 397B no-think | 0/3 | 0/5 | 0/10 | |
57+
| p3_writing | 397B think | 0/3 | 0/5 | 0/10 | |
58+
| p3_pm | 397B no-think | 2/3 | 3/5 | 5/10 | |
59+
| p3_pm | 397B think | 0/3 | 0/5 | 0/10 | |
60+
61+
**Flipped cells** (1): p3_market (397B no-think). These are exactly the cells where a small-N verdict would have been luck — trust the high-N read and treat them as high-variance.
62+
63+
## Thinking vs no-think — per-task redistribution
64+
65+
Pass-rate delta (think − no-think) per task. To stay honest while the arms are at different depths, each task's delta is computed on its **common window** — the first `k = min(no-think graded, think graded)` replicates of *both* arms — and also reported as a pass-count delta scaled to that common k. A net-zero aggregate can still hide real per-task swings; this surfaces *where* reasoning moves success.
66+
67+
| task | k | no-think (≤k) | think (≤k) | Δ rate (pp) | Δ count |
68+
|---|:--:|:--:|:--:|:--:|:--:|
69+
| p1_bugfix | 10 | 10/10 | 10/10 | +0 | 0 |
70+
| p1_testwrite | 10 | 0/10 | 0/10 | +0 | 0 |
71+
| p1_refactor | 10 | 0/10 | 0/10 | +0 | 0 |
72+
| p2_extract | 10 | 10/10 | 10/10 | +0 | 0 |
73+
| p2_ci | 10 | 10/10 | 10/10 | +0 | 0 |
74+
| p2_hallucination | 10 | 10/10 | 10/10 | +0 | 0 |
75+
| p2_triage | 10 | 10/10 | 10/10 | +0 | 0 |
76+
| p3_doc | 10 | 9/10 | 2/10 | -70 | **-7** |
77+
| p3_business | 10 | 10/10 | 10/10 | +0 | 0 |
78+
| p3_market | 10 | 8/10 | 10/10 | +20 | **+2** |
79+
| p3_writing | 10 | 0/10 | 0/10 | +0 | 0 |
80+
| p3_pm | 10 | 5/10 | 0/10 | -50 | **-5** |
81+
82+
On matched common windows, thinking **helps 1** task(s) and **hurts 2** (net -10 passes over matched cells). Read the per-task Δ, not a single aggregate — the interesting signal is the redistribution. (Raw per-arm totals at the current — possibly unequal — depths are in the scorecard above.)
83+
84+
## Runaway / stall summary (finish_reason per arm)
85+
86+
Count of completed cells by `summary.json` `finish_reason`, per arm, over v1–10. `done_signal` = clean agent-declared completion. `model_stopped` = the model ended the turn without signalling done. `*_runaway` / `max_*` = over-generation. `stuck_*` = spinning without workspace progress until the stuck threshold. Anything that is not `done_signal` is a non-clean exit worth a look.
87+
88+
| arm | cells | done_signal | model_stopped | stuck_no_workspace_change_for_500_iters | non-clean |
89+
|---|:--:|:--:|:--:|:--:|:--:|
90+
| 397B no-think | 120 | 115 | 3 | 2 | 5 |
91+
| 397B think | 120 | 120 | 0 | 0 | 0 |
92+
93+
**Non-clean exits:** `p2_extract_397b-nothink_v8` (model_stopped); `p3_market_397b-nothink_v2` (stuck_no_workspace_change_for_500_iters); `p3_market_397b-nothink_v3` (stuck_no_workspace_change_for_500_iters); `p3_pm_397b-nothink_v1` (model_stopped); `p3_pm_397b-nothink_v9` (model_stopped).
94+
95+
---
96+
Generated by `tooling/bench_report.py` (read-only; does not touch the live autopilot status.json or dashboard).

0 commit comments

Comments
 (0)