Skip to content

Commit aea2585

Browse files
Merge pull request #30 from Light-Heart-Labs/qwen3.5-397b-n3-results-2026-05-29
[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)
2 parents 7a21044 + 4809534 commit aea2585

5 files changed

Lines changed: 122 additions & 82 deletions

File tree

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md

Lines changed: 31 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences
22

3-
**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.**
3+
**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.**
4+
Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps;
5+
N=3 pass counts are in findings.md.
46
This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
57
vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
68
cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
@@ -56,32 +58,35 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ
5658
metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
5759
right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)
5860

59-
## 4. Does thinking help 397B? No — net −1, and the loss is revealing
60-
**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning
61-
changed exactly one outcome — and made it *worse*:
61+
## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing"
62+
**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked
63+
like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else."
64+
**N=3 refutes that** — thinking moves three cells, in *both* directions:
6265

63-
| flip | no-think | think | cause |
64-
|---|---|---|---|
65-
| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit |
66+
| cell | no-think (N=3) | think (N=3) | Δ | what's happening |
67+
|---|:--:|:--:|:--:|---|
68+
| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways |
69+
| **p3_pm** | 2/3 | **0/3** | −2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) |
70+
| **p3_doc** | 2/3 | 1/3 | −1 | the verbosity story: thinking inflates length, trips the 700-word limit |
6671

67-
`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words
68-
against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make
69-
it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing
70-
over-documentation tendency (§2). It also spent more turns getting there (35 vs 20).
71-
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the
72-
think output is arguably equal in substance and failed on form.
73-
74-
Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite,
75-
reasoning bought 397B nothing.
72+
The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity
73+
mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as
74+
no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form
75+
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where*
76+
397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to
77+
−1. A single-N read would have missed both the gain and the symmetry.
7678

7779
**Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
7880
substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
7981
than no-think on the same task (126 vs 110).
8082

81-
**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) —
82-
including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low
83-
effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated
84-
at low reasoning effort. This is a real reliability edge for 397B.
83+
**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all
84+
72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's
85+
failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think
86+
`p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1
87+
`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs
88+
no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real
89+
(no runaways), but it's "stalls quietly," not "always finishes."
8590

8691
## 5. Integration cost (a "messy model" finding in itself)
8792
- Flash (vLLM) ran the harness out of the box once launched.
@@ -92,10 +97,12 @@ at low reasoning effort. This is a real reliability edge for 397B.
9297
re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
9398
exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.
9499

95-
## Net take (provisional, no-think only)
100+
## Net take (N=3, both arms)
96101
397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
97102
is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
98103
substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
99-
(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability,
100-
one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's
101-
reasoning modes — pending.
104+
(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its
105+
worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw
106+
accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so
107+
"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and
108+
market-research reliability matter; otherwise Flash wins on speed and cost.

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,22 @@
1-
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison)
1+
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison)
22

3-
A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
3+
A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
44
12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
55
Step-3.7-Flash-NVFP4 entry on the same box.
66

7-
**Provisional, N=1**one replicate per cell. An N=3 re-run is queued; treat numbers as directional.
7+
**N=3**three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`).
88

99
## TL;DR
10-
- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
11-
- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at
12-
identical fact coverage, not a reasoning failure.
13-
- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge.
10+
- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
11+
- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market`
12+
(1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds,
13+
not how often.
14+
- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort.
15+
Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm`
16+
`model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3).
17+
- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck.
1418
- **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
15-
narrow (long-form synthesis reliability, one stable setting).
19+
reliability (runaway resistance, thinking-on market research).
1620
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
1721

1822
## Files

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md

Lines changed: 52 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash
1+
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3, vs Step-3.7-Flash
22

3-
**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3
4-
re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties
5-
across models, and the differences that matter are qualitative.
3+
**N=3** (three replicates per cell). Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail
4+
table below ties across models, and the differences that matter are qualitative. Phase-1 cells are
5+
graded with the **fixed** `phase1_grade.py` (see grading-correctness note below).
66

77
## Setup
88
- **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards).
@@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative.
1414
`../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships,"
1515
not a clean precision study.**
1616

17-
## Scorecard (N=1)
17+
## Scorecard (N=3, pass count per cell)
1818

19-
| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) |
19+
| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
2020
|---|:--:|:--:|:--:|:--:|:--:|
21-
| p1_bugfix | | | ✓/✓/✓ | 3/3 | 2/3 |
22-
| p1_testwrite † | | | ✗/✗/✗ | 0/3 † | 0/3 † |
23-
| p1_refactor † | | | ✓/✗/✗ | 0/3 † | 0/3 † |
24-
| p2_extract | | | ✓/✓/✓ | 3/3 | 3/3 |
25-
| p2_ci | | | ✓/✓/✓ | 3/3 | 3/3 |
26-
| p2_hallucination | | | ✓/✓/✓ | 3/3 | 1/3 |
27-
| p2_triage | | | ~/✓/✓ | 3/3 | 3/3 |
28-
| p3_doc | | **** | ~/✓/✓ | 0/3 | 2/3 |
29-
| p3_business | | | ✓/~/✗ | 2/3 | 3/3 |
30-
| p3_market * | | | ****/✓/✓ | 3/3 * | 0/3 |
31-
| p3_writing | | | ✗/~/✗ | 0/3 | 2/3 |
32-
| p3_pm | | | ✗/~/✓ | 0/3 | 1/3 |
33-
| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 |
21+
| p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 |
22+
| p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † |
23+
| p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † |
24+
| p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
25+
| p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
26+
| p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 |
27+
| p2_triage | 3/3 | 3/3 | ~/✓/✓ | 3/3 | 3/3 |
28+
| p3_doc | 2/3 | **1/3** | ~/✓/✓ | 0/3 | 2/3 |
29+
| p3_business | 3/3 | 3/3 | ✓/~/✗ | 2/3 | 3/3 |
30+
| p3_market * | **1/3** | **3/3** |/✓/✓ | 3/3 * | 0/3 |
31+
| p3_writing | 0/3 | 0/3 | ✗/~/✗ | 0/3 | 2/3 |
32+
| p3_pm | **2/3** | **0/3** | ✗/~/✓ | 0/3 | 1/3 |
33+
| **Total** | **23/36** | **22/36** | **7 / 8 / 8** | ~7/12 | ~7/12 |
3434

3535
`p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the
3636
core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was
@@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val
4040
A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while
4141
`code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`,
4242
`benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and
43-
`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome:
44-
- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental.
45-
- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
46-
fix) are real and pass — they were previously ignored.
43+
`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded** (N=1 and N=3):
44+
- `p1_bugfix` PASS is genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
45+
fix) are real and pass — they were previously ignored. Consistent 3/3 both arms.
4746
- `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153
4847
passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged`
4948
(it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here —
5049
the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
5150
is misleading and should be re-examined for the published 27B/Coder cells too.
52-
- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
53-
(bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
54-
especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see
55-
tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands.
51+
- ⚠️ **The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing)
52+
numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a
53+
guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see tracking issue #29.**
54+
Treat the reference columns' p1_* cells as provisional until that lands.
5655

5756
## Headline findings
5857

59-
1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B,
60-
a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total.
58+
1. **Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the *same
59+
7–8/12 band* (by per-task majority) as a 27B, a ~30B coder, and an ~11B-active Flash. The interesting
60+
signal is per-task and qualitative, not the total.
6161

62-
2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
63-
the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts
64-
(`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
65-
amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
66-
everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.**
62+
2. **Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1
63+
"inert" reading.** At N=1 the loss looked like a single verbosity flip (`p3_doc`). With three
64+
replicates, thinking is not inert — it **redistributes**: it *helps* `p3_market` (no-think 1/3 → think
65+
**3/3**, stabilizing the wobbliest cell, zero runaways) but *hurts* `p3_pm` (2/3 → **0/3**) and `p3_doc`
66+
(2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a
67+
no-op. Reasoning changes *where* 397B succeeds without changing *how often*.
6768

68-
3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
69-
`done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort**
70-
(hit max_tokens). 397B's reliability edge is real and mode-independent.
69+
3. **N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in
70+
the no-think arm, `p3_market` (N=1 ✓ → N=3 1/3) and `p3_pm` (N=1 ✗ → N=3 2/3) were single-draw
71+
artifacts; `p3_doc` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all
72+
reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the
73+
consistent fails. **Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.**
7174

72-
4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in
73-
no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared
74-
band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default;
75-
397B earns its keep only where synthesis reliability and a single stable setting matter.
75+
4. **397B is runaway-resistant — but no-think market research gets *stuck*.** Zero max_tokens/length
76+
runaways across all 72 cells (the failure mode Step-3.7-Flash showed on `p3_market` at low effort).
77+
69/72 finished `done_signal`; the 3 non-clean exits were **all no-think**: `p3_market` v2 & v3 hit the
78+
500-iter **stuck threshold** (`stuck_no_workspace_change_for_500_iters` — spinning without progress, not
79+
over-generating) and `p3_pm` v1 `model_stopped`. So 397B's pathology is *stalling*, not runaway — and
80+
**thinking eliminates the market stall**: think `p3_market` is a clean 3/3 `done_signal` vs no-think's
81+
1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock.
7682

77-
5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
83+
5. **Cost still favors Flash.** 397B reaches the shared band at ~71 tok/s spanning both GPUs at Q3, vs
84+
Flash ~99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway
85+
resistance + (thinking-on) market-research reliability matter.
86+
87+
6. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
7888
(llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving
7989
`content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a
8090
thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers).

0 commit comments

Comments
 (0)