Skip to content

Commit 7a21044

Browse files
Merge pull request #28 from Light-Heart-Labs/add-qwen3.5-397b-microbench-2026-05-29
Qwen3.5-397B-A17B microbench (N=1) + GGUF/thinking tooling support
2 parents 251ed4b + 500c423 commit 7a21044

77 files changed

Lines changed: 2391 additions & 34 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,8 @@
11
# Bytecode in shipped inputs (these get re-generated when the agent runs)
22
tooling/inputs/**/__pycache__/
3+
4+
# Transient run artifacts — raw per-run logs and agent workspaces are not published
5+
tooling/logs/
6+
tooling/workspace/
7+
/logs/
8+
__pycache__/
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences
2+
3+
**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.**
4+
This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12
5+
vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the
6+
cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table.
7+
8+
Models:
9+
- **397B** = Qwen3.5-397B-A17B, UD-Q3_K_XL GGUF, llama.cpp b9014, pipeline (`-sm layer`), ctx 131072, no-think arm unless stated.
10+
- **Flash** = Step-3.7-Flash-NVFP4, 201B MoE / ~11B active, vLLM, native CUTLASS FP4, reasoning levels low/med/high.
11+
- Cross-engine + cross-quant: "best-as-each-ships," NOT a clean precision study.
12+
13+
## 1. Token / iteration economy — the sharpest split
14+
Per-cell from `logs/<cell>/transcript.jsonl` (`iter`, `completion_tokens`, tool calls, `wall_s`):
15+
16+
| task | 397B no-think | Step-3.7 medium |
17+
|---|---|---|
18+
| p1_bugfix (both PASS) | 110 iters, 29.7k ctok, 463s | **333 iters, 107k ctok, 1222s** |
19+
| p2_extract (both PASS) | 10 iters, 3.7k ctok | **3 iters, 2.0k ctok** |
20+
| p2_ci | 42i / 6.4k | 28i / 13.2k |
21+
| p3_doc | 20i / 12.6k | 11i / 9.9k |
22+
| p3_business | 19i / 14.4k | 10i / 8.0k |
23+
| p3_market | 75i / 17.6k | 96i / 28.4k |
24+
25+
**Read:** on the hard open-ended coding task both PASS, but Flash-medium burns ~3.6× the tokens
26+
and 3× the iterations of 397B no-think. On the *trivial* grounded task (extraction) it inverts —
27+
Flash is surgical (3 iters) where 397B plods (10). Flash's reasoning is a double-edged sword:
28+
crisp when the task is well-bounded, flaily when it's open-ended. 397B no-think is steadier across
29+
the difficulty range. Both hold ~1 tool call/turn (no thrashing).
30+
31+
## 2. Same conclusions, different packaging
32+
`p3_business` (Borealis acquisition review, both PASS): both independently recommended **HOLD** and
33+
cited the *same* core issues (burn-rate/runway math, thin customer validation, unsubstantiated
34+
synergies, opaque valuation) — quality parity on the judgment. The form differs:
35+
- **397B is scaffold-heavy:** 15 concerns in 3 severity tiers, two ADRs (incl. a concern-prioritization
36+
framework), a navigation README, per-deliverable "omissions" decision docs. (`logs/p3_business_397b-nothink_v1`, done_summary + workspace)
37+
- **Flash is economical:** same substance as tighter flowing prose, fewer artifacts. (`logs/p3_business_step3p7-medium_v1`)
38+
39+
397B over-documents (useful if you want an audit trail, unprompted); Flash says it once and moves on.
40+
41+
## 3. Failure-mode texture (from grade.json sub-scores, not just verdict)
42+
- **397B `p3_pm` FAIL = under-recall, not hallucination.** workstream_recall 6/6, milestone_recall 5/5,
43+
decision_recall 3/4, but **risk_recall 2/6** in a clipped 373-word output. It drops items when terse;
44+
it does not fabricate. Benign failure signature. (`logs/p3_pm_397b-nothink_v1/grade.json`)
45+
- **397B `p3_writing` FAIL ≈ grader strictness, not bad output.** The legal_summary deliverable is
46+
accurate and audience-aware (correct incident window, tiered impact: 4 automation-failure / ~24
47+
enterprise / ~11,400 general accounts, defensible case-by-case credit recommendation) and it wrote
48+
ADRs documenting deliberate per-audience omissions. The binary grader rejected it anyway — real
49+
quality runs ahead of pass rate here. Ties to the known binary-grader-misses-quality caveat.
50+
(`logs/p3_writing_397b-nothink_v1` workspace)
51+
- **397B `p1_testwrite` (think) FAIL = a *rule* violation hiding real competence.** After a grader-bug
52+
fix (see findings.md), the corrected metrics show think-mode wrote tests reaching **99% coverage / 153
53+
passing** — strong, capable test-writing. It FAILs only because it edited `logalyzer/` production code,
54+
violating the task's "only /tests/ may differ" rule (`logalyzer_unchanged: False`). The prior grader bug
55+
reported `cov=0` and made this look like a flat incapacity ("coverage never improves"). Lesson: a broken
56+
metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was
57+
right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`)
58+
59+
## 4. Does thinking help 397B? No — net −1, and the loss is revealing
60+
**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning
61+
changed exactly one outcome — and made it *worse*:
62+
63+
| flip | no-think | think | cause |
64+
|---|---|---|---|
65+
| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit |
66+
67+
`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words
68+
against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make
69+
it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing
70+
over-documentation tendency (§2). It also spent more turns getting there (35 vs 20).
71+
(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the
72+
think output is arguably equal in substance and failed on form.
73+
74+
Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite,
75+
reasoning bought 397B nothing.
76+
77+
**Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a
78+
substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns
79+
than no-think on the same task (126 vs 110).
80+
81+
**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) —
82+
including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low
83+
effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated
84+
at low reasoning effort. This is a real reliability edge for 397B.
85+
86+
## 5. Integration cost (a "messy model" finding in itself)
87+
- Flash (vLLM) ran the harness out of the box once launched.
88+
- 397B (llama.cpp) needed two fixes: **`--reasoning-format none`** (default extracts CoT into
89+
`reasoning_content`, leaving `content` empty → the agent loop reads a thinking turn as "done" and dies
90+
at iter ~3; the no-think smoke could not catch this) and a **sandbox-cleanup workaround** (non-sudo
91+
`rm` fails on root-owned workspace leftovers; sandbox containers not force-removed on abnormal exit →
92+
re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're
93+
exactly the "messy" friction MMBT exists to document. PR should fix both in the harness.
94+
95+
## Net take (provisional, no-think only)
96+
397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash
97+
is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on
98+
substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band
99+
(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability,
100+
one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's
101+
reasoning modes — pending.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison)
2+
3+
A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT
4+
12-family agentic microbench in two reasoning modes (no-think / think) and compared against the
5+
Step-3.7-Flash-NVFP4 entry on the same box.
6+
7+
**Provisional, N=1** — one replicate per cell. An N=3 re-run is queued; treat numbers as directional.
8+
9+
## TL;DR
10+
- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range.
11+
- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at
12+
identical fact coverage, not a reasoning failure.
13+
- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge.
14+
- **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is
15+
narrow (long-form synthesis reliability, one stable setting).
16+
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
17+
18+
## Files
19+
- [findings.md](findings.md) — scorecard + headline findings.
20+
- [QUALITATIVE.md](QUALITATIVE.md) — behavioral analysis beyond pass/fail (token economy, packaging,
21+
failure-mode texture, reasoning shape), every claim cited to a cell/file.
22+
- [manifest.json](manifest.json) — models, quant, engine, launch flags, run inventory, dates.
23+
24+
## Reproduce
25+
```bash
26+
# 1. Serve the model (GGUF on llama.cpp). NOTE --reasoning-format none is REQUIRED for the think arm:
27+
# the default extracts chain-of-thought into reasoning_content, leaving content empty, which the
28+
# agentic harness reads as an early stop.
29+
docker run -d --name llama-397b --gpus all --shm-size 16g \
30+
-v $HOME/models:/models:ro -p 127.0.0.1:8001:8000 \
31+
ghcr.io/ggml-org/llama.cpp:server-cuda-b9014 \
32+
-m /models/unsloth-Qwen3.5-397B-A17B-GGUF/UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf \
33+
-a qwen3.5-397b-a17b \
34+
-ngl 999 -sm layer -fa on -c 131072 -b 2048 -np 1 --jinja --reasoning-format none \
35+
--host 0.0.0.0 --port 8000
36+
37+
# 2. Smoke-gate (thinking off, declare the served context window):
38+
bash tooling/scripts/smoke_test.sh qwen3.5-397b-a17b 8001 smoke397 "" off 131072
39+
40+
# 3. Both arms (label encodes the mode; --max-model-len matches the served -c):
41+
bash tooling/scripts/run_microbench.sh qwen3.5-397b-a17b 8001 397b-nothink 1 "" off 131072
42+
bash tooling/scripts/run_microbench.sh qwen3.5-397b-a17b 8001 397b-think 1 "" on 131072
43+
44+
# 4. Grade + summarize each label:
45+
bash tooling/scripts/grade_microbench.sh 397b-nothink && bash tooling/scripts/summarize.sh 397b-nothink
46+
bash tooling/scripts/grade_microbench.sh 397b-think && bash tooling/scripts/summarize.sh 397b-think
47+
```
48+
49+
## Hardware / environment
50+
2× NVIDIA RTX PRO 6000 Blackwell Workstation (sm_120, PCIe — no NVLink), TR PRO 7965WX, 252 GB RAM.
51+
llama.cpp pipeline parallel (`-sm layer`); tensor-parallel (`-sm row`) is ~45% slower on decode on
52+
this PCIe-only topology. Single-stream decode ~71–74 tok/s at Q3_K_XL. Power caps lifted (both GPUs
53+
600 W) for the run.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash
2+
3+
**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3
4+
re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties
5+
across models, and the differences that matter are qualitative.
6+
7+
## Setup
8+
- **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards).
9+
- **Engine:** llama.cpp `ghcr.io/ggml-org/llama.cpp:server-cuda-b9014`, pipeline parallel (`-sm layer`),
10+
`-ngl 999 -fa on -c 131072 -b 2048 -np 1 --jinja --reasoning-format none`, 2× RTX PRO 6000 Blackwell.
11+
- **Two arms:** `enable_thinking` off (no-think) and on (think), via the harness `--thinking {on,off}`
12+
flag (sends `chat_template_kwargs.enable_thinking`).
13+
- **Comparison:** Step-3.7-Flash-NVFP4 (vLLM, native CUTLASS FP4) low/med/high — see
14+
`../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships,"
15+
not a clean precision study.**
16+
17+
## Scorecard (N=1)
18+
19+
| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) |
20+
|---|:--:|:--:|:--:|:--:|:--:|
21+
| p1_bugfix ||| ✓/✓/✓ | 3/3 | 2/3 |
22+
| p1_testwrite † ||| ✗/✗/✗ | 0/3 † | 0/3 † |
23+
| p1_refactor † ||| ✓/✗/✗ | 0/3 † | 0/3 † |
24+
| p2_extract ||| ✓/✓/✓ | 3/3 | 3/3 |
25+
| p2_ci ||| ✓/✓/✓ | 3/3 | 3/3 |
26+
| p2_hallucination ||| ✓/✓/✓ | 3/3 | 1/3 |
27+
| p2_triage ||| ~/✓/✓ | 3/3 | 3/3 |
28+
| p3_doc || **** | ~/✓/✓ | 0/3 | 2/3 |
29+
| p3_business ||| ✓/~/✗ | 2/3 | 3/3 |
30+
| p3_market * ||| ****/✓/✓ | 3/3 * | 0/3 |
31+
| p3_writing ||| ✗/~/✗ | 0/3 | 2/3 |
32+
| p3_pm ||| ✗/~/✓ | 0/3 | 1/3 |
33+
| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 |
34+
35+
`p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the
36+
core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was
37+
partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation validity is a hand-grading dimension).
38+
39+
### Grading-correctness fix (post-review, 2026-05-29)
40+
A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while
41+
`code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`,
42+
`benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and
43+
`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome:
44+
- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental.
45+
- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²)
46+
fix) are real and pass — they were previously ignored.
47+
- `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153
48+
passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged`
49+
(it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here —
50+
the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite
51+
is misleading and should be re-examined for the published 27B/Coder cells too.
52+
- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1
53+
(bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite
54+
especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see
55+
tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands.
56+
57+
## Headline findings
58+
59+
1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B,
60+
a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total.
61+
62+
2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes;
63+
the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts
64+
(`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning
65+
amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert
66+
everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.**
67+
68+
3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished
69+
`done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort**
70+
(hit max_tokens). 397B's reliability edge is real and mode-independent.
71+
72+
4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in
73+
no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared
74+
band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default;
75+
397B earns its keep only where synthesis reliability and a single stable setting matter.
76+
77+
5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B
78+
(llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving
79+
`content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a
80+
thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers).
81+
All fixed in `tooling/`; see commit history.
82+
83+
See [QUALITATIVE.md](QUALITATIVE.md) for the behavioral analysis (token economy, packaging style,
84+
failure-mode texture, reasoning shape) with per-cell citations.

0 commit comments

Comments
 (0)