|
| 1 | +# Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences |
| 2 | + |
| 3 | +**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.** |
| 4 | +This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12 |
| 5 | +vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the |
| 6 | +cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table. |
| 7 | + |
| 8 | +Models: |
| 9 | +- **397B** = Qwen3.5-397B-A17B, UD-Q3_K_XL GGUF, llama.cpp b9014, pipeline (`-sm layer`), ctx 131072, no-think arm unless stated. |
| 10 | +- **Flash** = Step-3.7-Flash-NVFP4, 201B MoE / ~11B active, vLLM, native CUTLASS FP4, reasoning levels low/med/high. |
| 11 | +- Cross-engine + cross-quant: "best-as-each-ships," NOT a clean precision study. |
| 12 | + |
| 13 | +## 1. Token / iteration economy — the sharpest split |
| 14 | +Per-cell from `logs/<cell>/transcript.jsonl` (`iter`, `completion_tokens`, tool calls, `wall_s`): |
| 15 | + |
| 16 | +| task | 397B no-think | Step-3.7 medium | |
| 17 | +|---|---|---| |
| 18 | +| p1_bugfix (both PASS) | 110 iters, 29.7k ctok, 463s | **333 iters, 107k ctok, 1222s** | |
| 19 | +| p2_extract (both PASS) | 10 iters, 3.7k ctok | **3 iters, 2.0k ctok** | |
| 20 | +| p2_ci | 42i / 6.4k | 28i / 13.2k | |
| 21 | +| p3_doc | 20i / 12.6k | 11i / 9.9k | |
| 22 | +| p3_business | 19i / 14.4k | 10i / 8.0k | |
| 23 | +| p3_market | 75i / 17.6k | 96i / 28.4k | |
| 24 | + |
| 25 | +**Read:** on the hard open-ended coding task both PASS, but Flash-medium burns ~3.6× the tokens |
| 26 | +and 3× the iterations of 397B no-think. On the *trivial* grounded task (extraction) it inverts — |
| 27 | +Flash is surgical (3 iters) where 397B plods (10). Flash's reasoning is a double-edged sword: |
| 28 | +crisp when the task is well-bounded, flaily when it's open-ended. 397B no-think is steadier across |
| 29 | +the difficulty range. Both hold ~1 tool call/turn (no thrashing). |
| 30 | + |
| 31 | +## 2. Same conclusions, different packaging |
| 32 | +`p3_business` (Borealis acquisition review, both PASS): both independently recommended **HOLD** and |
| 33 | +cited the *same* core issues (burn-rate/runway math, thin customer validation, unsubstantiated |
| 34 | +synergies, opaque valuation) — quality parity on the judgment. The form differs: |
| 35 | +- **397B is scaffold-heavy:** 15 concerns in 3 severity tiers, two ADRs (incl. a concern-prioritization |
| 36 | + framework), a navigation README, per-deliverable "omissions" decision docs. (`logs/p3_business_397b-nothink_v1`, done_summary + workspace) |
| 37 | +- **Flash is economical:** same substance as tighter flowing prose, fewer artifacts. (`logs/p3_business_step3p7-medium_v1`) |
| 38 | + |
| 39 | +397B over-documents (useful if you want an audit trail, unprompted); Flash says it once and moves on. |
| 40 | + |
| 41 | +## 3. Failure-mode texture (from grade.json sub-scores, not just verdict) |
| 42 | +- **397B `p3_pm` FAIL = under-recall, not hallucination.** workstream_recall 6/6, milestone_recall 5/5, |
| 43 | + decision_recall 3/4, but **risk_recall 2/6** in a clipped 373-word output. It drops items when terse; |
| 44 | + it does not fabricate. Benign failure signature. (`logs/p3_pm_397b-nothink_v1/grade.json`) |
| 45 | +- **397B `p3_writing` FAIL ≈ grader strictness, not bad output.** The legal_summary deliverable is |
| 46 | + accurate and audience-aware (correct incident window, tiered impact: 4 automation-failure / ~24 |
| 47 | + enterprise / ~11,400 general accounts, defensible case-by-case credit recommendation) and it wrote |
| 48 | + ADRs documenting deliberate per-audience omissions. The binary grader rejected it anyway — real |
| 49 | + quality runs ahead of pass rate here. Ties to the known binary-grader-misses-quality caveat. |
| 50 | + (`logs/p3_writing_397b-nothink_v1` workspace) |
| 51 | +- **397B `p1_testwrite` (think) FAIL = a *rule* violation hiding real competence.** After a grader-bug |
| 52 | + fix (see findings.md), the corrected metrics show think-mode wrote tests reaching **99% coverage / 153 |
| 53 | + passing** — strong, capable test-writing. It FAILs only because it edited `logalyzer/` production code, |
| 54 | + violating the task's "only /tests/ may differ" rule (`logalyzer_unchanged: False`). The prior grader bug |
| 55 | + reported `cov=0` and made this look like a flat incapacity ("coverage never improves"). Lesson: a broken |
| 56 | + metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was |
| 57 | + right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`) |
| 58 | + |
| 59 | +## 4. Does thinking help 397B? No — net −1, and the loss is revealing |
| 60 | +**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning |
| 61 | +changed exactly one outcome — and made it *worse*: |
| 62 | + |
| 63 | +| flip | no-think | think | cause | |
| 64 | +|---|---|---|---| |
| 65 | +| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit | |
| 66 | + |
| 67 | +`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words |
| 68 | +against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make |
| 69 | +it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing |
| 70 | +over-documentation tendency (§2). It also spent more turns getting there (35 vs 20). |
| 71 | +(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the |
| 72 | +think output is arguably equal in substance and failed on form. |
| 73 | + |
| 74 | +Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite, |
| 75 | +reasoning bought 397B nothing. |
| 76 | + |
| 77 | +**Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a |
| 78 | +substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns |
| 79 | +than no-think on the same task (126 vs 110). |
| 80 | + |
| 81 | +**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) — |
| 82 | +including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low |
| 83 | +effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated |
| 84 | +at low reasoning effort. This is a real reliability edge for 397B. |
| 85 | + |
| 86 | +## 5. Integration cost (a "messy model" finding in itself) |
| 87 | +- Flash (vLLM) ran the harness out of the box once launched. |
| 88 | +- 397B (llama.cpp) needed two fixes: **`--reasoning-format none`** (default extracts CoT into |
| 89 | + `reasoning_content`, leaving `content` empty → the agent loop reads a thinking turn as "done" and dies |
| 90 | + at iter ~3; the no-think smoke could not catch this) and a **sandbox-cleanup workaround** (non-sudo |
| 91 | + `rm` fails on root-owned workspace leftovers; sandbox containers not force-removed on abnormal exit → |
| 92 | + re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're |
| 93 | + exactly the "messy" friction MMBT exists to document. PR should fix both in the harness. |
| 94 | + |
| 95 | +## Net take (provisional, no-think only) |
| 96 | +397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash |
| 97 | +is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on |
| 98 | +substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band |
| 99 | +(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability, |
| 100 | +one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's |
| 101 | +reasoning modes — pending. |
0 commit comments