Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences

**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.**
**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.**
Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps;
N=3 pass counts are in findings.md.
This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
Expand Down Expand Up @@ -56,32 +58,35 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ
metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)

## 4. Does thinking help 397B? No — net −1, and the loss is revealing
**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning
changed exactly one outcome — and made it *worse*:
## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing"
**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked
like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else."
**N=3 refutes that** — thinking moves three cells, in *both* directions:

| flip | no-think | think | cause |
|---|---|---|---|
| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit |
| cell | no-think (N=3) | think (N=3) | Δ | what's happening |
|---|:--:|:--:|:--:|---|
| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways |
| **p3_pm** | 2/3 | **0/3** | −2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) |
| **p3_doc** | 2/3 | 1/3 | −1 | the verbosity story: thinking inflates length, trips the 700-word limit |

`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words
against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make
it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing
over-documentation tendency (§2). It also spent more turns getting there (35 vs 20).
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the
think output is arguably equal in substance and failed on form.

Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite,
reasoning bought 397B nothing.
The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity
mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as
no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where*
397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to
−1. A single-N read would have missed both the gain and the symmetry.

**Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
than no-think on the same task (126 vs 110).

**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) —
including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low
effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated
at low reasoning effort. This is a real reliability edge for 397B.
**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all
72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's
failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think
— `p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1
`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs
no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real
(no runaways), but it's "stalls quietly," not "always finishes."

## 5. Integration cost (a "messy model" finding in itself)
- Flash (vLLM) ran the harness out of the box once launched.
Expand All @@ -92,10 +97,12 @@ at low reasoning effort. This is a real reliability edge for 397B.
re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.

## Net take (provisional, no-think only)
## Net take (N=3, both arms)
397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability,
one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's
reasoning modes — pending.
(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its
worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw
accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so
"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and
market-research reliability matter; otherwise Flash wins on speed and cost.
20 changes: 12 additions & 8 deletions hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison)
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison)

A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
Step-3.7-Flash-NVFP4 entry on the same box.

**Provisional, N=1** — one replicate per cell. An N=3 re-run is queued; treat numbers as directional.
**N=3** — three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`).

## TL;DR
- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at
identical fact coverage, not a reasoning failure.
- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge.
- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market`
(1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds,
not how often.
- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort.
Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm`
`model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3).
- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck.
- **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
narrow (long-form synthesis reliability, one stable setting).
reliability (runaway resistance, thinking-on market research).
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**

## Files
Expand Down
94 changes: 52 additions & 42 deletions hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3, vs Step-3.7-Flash

**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3
re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties
across models, and the differences that matter are qualitative.
**N=3** (three replicates per cell). Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail
table below ties across models, and the differences that matter are qualitative. Phase-1 cells are
graded with the **fixed** `phase1_grade.py` (see grading-correctness note below).

## Setup
- **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards).
Expand All @@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative.
`../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships,"
not a clean precision study.**

## Scorecard (N=1)
## Scorecard (N=3, pass count per cell)

| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) |
| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) |
|---|:--:|:--:|:--:|:--:|:--:|
| p1_bugfix | | | ✓/✓/✓ | 3/3 | 2/3 |
| p1_testwrite † | | | ✗/✗/✗ | 0/3 † | 0/3 † |
| p1_refactor † | | | ✓/✗/✗ | 0/3 † | 0/3 † |
| p2_extract | | | ✓/✓/✓ | 3/3 | 3/3 |
| p2_ci | | | ✓/✓/✓ | 3/3 | 3/3 |
| p2_hallucination | | | ✓/✓/✓ | 3/3 | 1/3 |
| p2_triage | | | ~/✓/✓ | 3/3 | 3/3 |
| p3_doc | | **** | ~/✓/✓ | 0/3 | 2/3 |
| p3_business | | | ✓/~/✗ | 2/3 | 3/3 |
| p3_market * | | ✓ | **✗**/✓/✓ | 3/3 * | 0/3 |
| p3_writing | | | ✗/~/✗ | 0/3 | 2/3 |
| p3_pm | | | ✗/~/✓ | 0/3 | 1/3 |
| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 |
| p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 |
| p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † |
| p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † |
| p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
| p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 |
| p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 |
| p2_triage | 3/3 | 3/3 | ~/✓/✓ | 3/3 | 3/3 |
| p3_doc | 2/3 | **1/3** | ~/✓/✓ | 0/3 | 2/3 |
| p3_business | 3/3 | 3/3 | ✓/~/✗ | 2/3 | 3/3 |
| p3_market * | **1/3** | **3/3** | ✗/✓/✓ | 3/3 * | 0/3 |
| p3_writing | 0/3 | 0/3 | ✗/~/✗ | 0/3 | 2/3 |
| p3_pm | **2/3** | **0/3** | ✗/~/✓ | 0/3 | 1/3 |
| **Total** | **23/36** | **22/36** | **7 / 8 / 8** | ~7/12 | ~7/12 |

† `p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the
core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was
Expand All @@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val
A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while
`code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`,
`benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and
`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome:
- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental.
- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
fix) are real and pass — they were previously ignored.
`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded** (N=1 and N=3):
- `p1_bugfix` PASS is genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
fix) are real and pass — they were previously ignored. Consistent 3/3 both arms.
- `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153
passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged`
(it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here —
the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
is misleading and should be re-examined for the published 27B/Coder cells too.
- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
(bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see
tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands.
- ⚠️ **The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing)
numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a
guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see tracking issue #29.**
Treat the reference columns' p1_* cells as provisional until that lands.

## Headline findings

1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B,
a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total.
1. **Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the *same
7–8/12 band* (by per-task majority) as a 27B, a ~30B coder, and an ~11B-active Flash. The interesting
signal is per-task and qualitative, not the total.

2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts
(`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.**
2. **Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1
"inert" reading.** At N=1 the loss looked like a single verbosity flip (`p3_doc`). With three
replicates, thinking is not inert — it **redistributes**: it *helps* `p3_market` (no-think 1/3 → think
**3/3**, stabilizing the wobbliest cell, zero runaways) but *hurts* `p3_pm` (2/3 → **0/3**) and `p3_doc`
(2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a
no-op. Reasoning changes *where* 397B succeeds without changing *how often*.

3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
`done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort**
(hit max_tokens). 397B's reliability edge is real and mode-independent.
3. **N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in
the no-think arm, `p3_market` (N=1 ✓ → N=3 1/3) and `p3_pm` (N=1 ✗ → N=3 2/3) were single-draw
artifacts; `p3_doc` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all
reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the
consistent fails. **Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.**

4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in
no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared
band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default;
397B earns its keep only where synthesis reliability and a single stable setting matter.
4. **397B is runaway-resistant — but no-think market research gets *stuck*.** Zero max_tokens/length
runaways across all 72 cells (the failure mode Step-3.7-Flash showed on `p3_market` at low effort).
69/72 finished `done_signal`; the 3 non-clean exits were **all no-think**: `p3_market` v2 & v3 hit the
500-iter **stuck threshold** (`stuck_no_workspace_change_for_500_iters` — spinning without progress, not
over-generating) and `p3_pm` v1 `model_stopped`. So 397B's pathology is *stalling*, not runaway — and
**thinking eliminates the market stall**: think `p3_market` is a clean 3/3 `done_signal` vs no-think's
1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock.

5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
5. **Cost still favors Flash.** 397B reaches the shared band at ~71 tok/s spanning both GPUs at Q3, vs
Flash ~99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway
resistance + (thinking-on) market-research reliability matter.

6. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
(llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving
`content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a
thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers).
Expand Down
Loading
Loading