1- # Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1 , vs Step-3.7-Flash
1+ # Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3 , vs Step-3.7-Flash
22
3- ** Provisional / N=1. ** One replicate per cell — directional, not statistically settled. An N=3
4- re-run is queued. Pair this with [ QUALITATIVE.md ] ( QUALITATIVE.md ) : the pass/fail table below ties
5- across models, and the differences that matter are qualitative .
3+ ** N=3 ** (three replicates per cell). Pair this with [ QUALITATIVE.md ] ( QUALITATIVE.md ) : the pass/fail
4+ table below ties across models, and the differences that matter are qualitative. Phase-1 cells are
5+ graded with the ** fixed ** ` phase1_grade.py ` (see grading-correctness note below) .
66
77## Setup
88- ** Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~ 167 GB on disk, 5 shards).
@@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative.
1414 ` ../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/ ` . Cross-engine + cross-quant: ** "best-as-each-ships,"
1515 not a clean precision study.**
1616
17- ## Scorecard (N=1 )
17+ ## Scorecard (N=3, pass count per cell )
1818
19- | task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3 ) | Coder (ref N=3 ) |
19+ | task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
2020| ---| :--:| :--:| :--:| :--:| :--:|
21- | p1_bugfix | ✓ | ✓ | ✓/✓/✓ | 3/3 | 2/3 |
22- | p1_testwrite † | ✗ | ✗ | ✗/✗/✗ | 0/3 † | 0/3 † |
23- | p1_refactor † | ✗ | ✗ | ✓/✗/✗ | 0/3 † | 0/3 † |
24- | p2_extract | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
25- | p2_ci | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 |
26- | p2_hallucination | ✓ | ✓ | ✓/✓/✓ | 3/3 | 1/3 |
27- | p2_triage | ✓ | ✓ | ~ /✓/✓ | 3/3 | 3/3 |
28- | p3_doc | ✓ | ** ✗ ** | ~ /✓/✓ | 0/3 | 2/3 |
29- | p3_business | ✓ | ✓ | ✓/~ /✗ | 2/3 | 3/3 |
30- | p3_market * | ✓ | ✓ | ** ✗ ** /✓/✓ | 3/3 * | 0/3 |
31- | p3_writing | ✗ | ✗ | ✗/~ /✗ | 0/3 | 2/3 |
32- | p3_pm | ✗ | ✗ | ✗/~ /✓ | 0/3 | 1/3 |
33- | ** Total** | ** 8/12 ** | ** 7/12 ** | ** 7 / 8 / 8** | ~ 7/12 | ~ 7/12 |
21+ | p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 |
22+ | p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † |
23+ | p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † |
24+ | p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
25+ | p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
26+ | p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 |
27+ | p2_triage | 3/3 | 3/3 | ~ /✓/✓ | 3/3 | 3/3 |
28+ | p3_doc | 2/3 | ** 1/3 ** | ~ /✓/✓ | 0/3 | 2/3 |
29+ | p3_business | 3/3 | 3/3 | ✓/~ /✗ | 2/3 | 3/3 |
30+ | p3_market * | ** 1/3 ** | ** 3/3 ** | ✗ /✓/✓ | 3/3 * | 0/3 |
31+ | p3_writing | 0/3 | 0/3 | ✗/~ /✗ | 0/3 | 2/3 |
32+ | p3_pm | ** 2/3 ** | ** 0/3 ** | ✗/~ /✓ | 0/3 | 1/3 |
33+ | ** Total** | ** 23/36 ** | ** 22/36 ** | ** 7 / 8 / 8** | ~ 7/12 | ~ 7/12 |
3434
3535† ` p1_refactor ` fails on structure (no ` output/ ` subpackage created), not the model's competence at the
3636core edit. ` p1_testwrite ` — see the grading-correctness note below; the earlier "task-design" framing was
@@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val
4040A review caught that ` phase1_grade.py ` read flat keys (` coverage_pct ` , ` ruff_issues ` , ` benchmark_s ` ) while
4141` code_task_grader.py ` writes nested ones (` coverage.line_coverage_pct ` , ` ruff.issue_count ` ,
4242` benchmark.elapsed_s ` ). Effect: ` p1_bugfix ` 's ruff/benchmark gates were silently always-true, and
43- ` p1_testwrite ` 's coverage gate was always-false. Fixed and ** all phase-1 cells regraded** . Outcome:
44- - ** Totals unchanged (8/12 / 7/12)** — but now * trustworthy* , not coincidental.
45- - ` p1_bugfix ` PASS is now genuinely validated: ruff 2→0 and benchmark ** 11.2s→0.537s** (the planted O(n²)
46- fix) are real and pass — they were previously ignored.
43+ ` p1_testwrite ` 's coverage gate was always-false. Fixed and ** all phase-1 cells regraded** (N=1 and N=3):
44+ - ` p1_bugfix ` PASS is genuinely validated: ruff 2→0 and benchmark ** 11.2s→0.537s** (the planted O(n²)
45+ fix) are real and pass — they were previously ignored. Consistent 3/3 both arms.
4746- ` p1_testwrite ` still FAILs, but the ** reason flips** : think-mode actually achieved ** 99% coverage / 153
4847 passing tests** (the broken grader reported ` cov=0 ` and hid it); it fails only on ` logalyzer_unchanged `
4948 (it edited production code, violating the "only /tests/ may differ" rule). The model is * capable* here —
5049 the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
5150 is misleading and should be re-examined for the published 27B/Coder cells too.
52- - ⚠️ ** The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
53- (bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
54- especially is likely a guaranteed-FAIL artifact. ** Historical phase-1 scores may need regrading; see
55- tracking issue # 29 . ** Treat the reference columns' p1_ * cells as provisional until that lands.
51+ - ⚠️ ** The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing)
52+ numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a
53+ guaranteed-FAIL artifact. ** Historical phase-1 scores may need regrading; see tracking issue # 29 . **
54+ Treat the reference columns' p1_ * cells as provisional until that lands.
5655
5756## Headline findings
5857
59- 1 . ** Scale doesn't move the aggregate.** A 397B-param model lands in the * same 7–8/12 band* as a 27B,
60- a ~ 30B coder, and an ~ 11B-active Flash. The interesting signal is per-task and qualitative, not the total.
58+ 1 . ** Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the * same
59+ 7–8/12 band* (by per-task majority) as a 27B, a ~ 30B coder, and an ~ 11B-active Flash. The interesting
60+ signal is per-task and qualitative, not the total.
6161
62- 2 . ** Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
63- the lone flip is ` p3_doc ` PASS→FAIL — and it's instructive: * both* modes captured all 8/8 facts
64- (` fact_coverage 1.0 ` ); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
65- amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
66- everywhere else — more tokens and turns, same outcomes. ** Reasoning bought 397B nothing here.**
62+ 2 . ** Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1
63+ "inert" reading.** At N=1 the loss looked like a single verbosity flip (` p3_doc ` ). With three
64+ replicates, thinking is not inert — it ** redistributes** : it * helps* ` p3_market ` (no-think 1/3 → think
65+ ** 3/3** , stabilizing the wobbliest cell, zero runaways) but * hurts* ` p3_pm ` (2/3 → ** 0/3** ) and ` p3_doc `
66+ (2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a
67+ no-op. Reasoning changes * where* 397B succeeds without changing * how often* .
6768
68- 3 . ** 397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
69- ` done_signal ` — zero max_tokens/length failures. Step-3.7-Flash ** ran away on ` p3_market ` at low effort**
70- (hit max_tokens). 397B's reliability edge is real and mode-independent.
69+ 3 . ** N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in
70+ the no-think arm, ` p3_market ` (N=1 ✓ → N=3 1/3) and ` p3_pm ` (N=1 ✗ → N=3 2/3) were single-draw
71+ artifacts; ` p3_doc ` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all
72+ reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the
73+ consistent fails. ** Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.**
7174
72- 4 . ** 397B's distinctive lane is long-form synthesis** (` p3_doc ` /` p3_business ` /` p3_market ` all pass in
73- no-think) where the 27B was weak — but it's the ** slowest and most expensive** way to reach the shared
74- band (~ 71 tok/s spanning both GPUs at Q3, vs Flash ~ 99 tok/s on one engine). Flash is the better default;
75- 397B earns its keep only where synthesis reliability and a single stable setting matter.
75+ 4 . ** 397B is runaway-resistant — but no-think market research gets * stuck* .** Zero max_tokens/length
76+ runaways across all 72 cells (the failure mode Step-3.7-Flash showed on ` p3_market ` at low effort).
77+ 69/72 finished ` done_signal ` ; the 3 non-clean exits were ** all no-think** : ` p3_market ` v2 & v3 hit the
78+ 500-iter ** stuck threshold** (` stuck_no_workspace_change_for_500_iters ` — spinning without progress, not
79+ over-generating) and ` p3_pm ` v1 ` model_stopped ` . So 397B's pathology is * stalling* , not runaway — and
80+ ** thinking eliminates the market stall** : think ` p3_market ` is a clean 3/3 ` done_signal ` vs no-think's
81+ 1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock.
7682
77- 5 . ** Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
83+ 5 . ** Cost still favors Flash.** 397B reaches the shared band at ~ 71 tok/s spanning both GPUs at Q3, vs
84+ Flash ~ 99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway
85+ resistance + (thinking-on) market-research reliability matter.
86+
87+ 6 . ** Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
7888 (llama.cpp) needed ` --reasoning-format none ` (default extracts CoT into ` reasoning_content ` , leaving
7989 ` content ` empty → agent loop reads a thinking turn as "done" and dies at iter ~ 3 — invisible to a
8090 thinking-off smoke) plus harness cleanup fixes (non-sudo ` rm ` on root-owned sandbox/grader leftovers).
0 commit comments