diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md index 3fd35358..a121f7d5 100644 --- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md +++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/QUALITATIVE.md @@ -1,6 +1,8 @@ # Qwen3.5-397B-A17B vs Step-3.7-Flash — qualitative differences -**Status: N=1 / provisional. Both 397B arms complete (no-think 8/12, think 7/12); Flash low/med/high complete.** +**Status: N=3. Both 397B arms complete (no-think 23/36, think 22/36); Flash low/med/high complete.** +Per-cell citations below may reference N=1 (`_v1`) artifacts where the behavior is identical across reps; +N=3 pass counts are in findings.md. This doc is deliberately *not* a pass/fail scorecard. Pass/fail ties (397B no-think 8/12 vs Flash 7–8/12) hide the differences that matter — those live here. Every claim cites the cell/file it came from so it's reproducible. See SCORECARD/findings for the quantitative table. @@ -56,32 +58,35 @@ synergies, opaque valuation) — quality parity on the judgment. The form differ metric doesn't just mis-score — it **invents the wrong story about why**. The pass/fail bit (FAIL) was right by accident; everything it implied about the model was wrong. (`logs/p1_testwrite_397b-think_v1/grade.json`) -## 4. Does thinking help 397B? No — net −1, and the loss is revealing -**397B no-think 8/12 vs 397B think 7/12.** Eleven of twelve cells are identical between modes; reasoning -changed exactly one outcome — and made it *worse*: +## 4. Does thinking help 397B? Net −1 — and N=3 shows it *redistributes*, not "does nothing" +**No-think 23/36 vs think 22/36** (N=1 was 8/12 vs 7/12 — net −1 both ways). At **N=1** the loss looked +like a single verbosity flip (`p3_doc`), and it was tempting to call thinking "inert everywhere else." +**N=3 refutes that** — thinking moves three cells, in *both* directions: -| flip | no-think | think | cause | -|---|---|---|---| -| **p3_doc** | PASS (692w) | **FAIL (721w)** | identical content, verbosity blew the limit | +| cell | no-think (N=3) | think (N=3) | Δ | what's happening | +|---|:--:|:--:|:--:|---| +| **p3_market** | 1/3 | **3/3** | +2 | thinking *stabilizes* the wobbliest cell — coin-flip → lock, zero runaways | +| **p3_pm** | 2/3 | **0/3** | −2 | thinking *hurts* project-mgmt synthesis (over-deliberation → worse) | +| **p3_doc** | 2/3 | 1/3 | −1 | the verbosity story: thinking inflates length, trips the 700-word limit | -`p3_doc` think captured **all 8/8 facts** (`fact_coverage 1.0`), same as no-think — but wrote 721 words -against a 700-word limit (`within_word_limit: False`) where no-think landed at 692. Thinking did not make -it less accurate; it made it **less disciplined about the length constraint**, amplifying 397B's existing -over-documentation tendency (§2). It also spent more turns getting there (35 vs 20). -(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`) — a clean case of why pass/fail alone misleads: the -think output is arguably equal in substance and failed on form. - -Everywhere else thinking was **inert**: same PASS/FAIL, just more tokens and turns. On this suite, -reasoning bought 397B nothing. +The N=1 `p3_doc` flip was real but *not the whole story* — one draw of a three-way swing. The verbosity +mechanism is well-captured: at N=1, think `p3_doc` hit **all 8/8 facts** (`fact_coverage 1.0`, same as +no-think) but wrote 721 words vs the 700 limit (no-think: 692) — equal substance, failed on form +(`logs/p3_doc_397b-{nothink,think}_v1/grade.json`). The new lesson from N=3: **reasoning changes *where* +397B succeeds, not *how often*** — it buys market-research reliability at the cost of pm/doc, netting to +−1. A single-N read would have missed both the gain and the symmetry. **Reasoning shape:** 397B thinks in short targeted bursts (`p1_bugfix` think: 16 of 126 turns carry a substantial think block, median ~73 reasoning tokens/turn), not long monologues — but uses more turns than no-think on the same task (126 vs 110). -**No runaways, either mode.** All 12 think cells finished `done_signal` (no max_tokens/length failures) — -including `p3_market` (STRUCTURAL_PASS, 56 iters). Contrast Flash, which **ran away on `p3_market` at low -effort** (hit max_tokens). 397B is runaway-resistant in both modes; Flash's runaway risk is concentrated -at low reasoning effort. This is a real reliability edge for 397B. +**No *runaways*, but no-think *stalls* on market research.** Zero max_tokens/length failures across all +72 cells — contrast Flash, which **ran away on `p3_market` at low effort** (hit max_tokens). 397B's +failure mode is the opposite of runaway: 69/72 `done_signal`, and the 3 non-clean exits were all no-think +— `p3_market` v2/v3 hit the 500-iter *stuck* threshold (spinning without progress) and `p3_pm` v1 +`model_stopped`. The think arm is clean (36/36 `done_signal`), and notably `p3_market` think is 3/3 vs +no-think's 2-stuck/1-pass — **thinking eliminates the stall.** So the reliability edge over Flash is real +(no runaways), but it's "stalls quietly," not "always finishes." ## 5. Integration cost (a "messy model" finding in itself) - Flash (vLLM) ran the harness out of the box once launched. @@ -92,10 +97,12 @@ at low reasoning effort. This is a real reliability edge for 397B. re-run name collisions). Both are harness/engine-integration bugs, not model quality — but they're exactly the "messy" friction MMBT exists to document. PR should fix both in the harness. -## Net take (provisional, no-think only) +## Net take (N=3, both arms) 397B no-think is the steady, over-documenting, high-prose-quality one whose misses are omissions; Flash is the fast, terse, reasoning-driven one — brilliant when bounded, flaily when not. They agree on substance more than the scorecard's "tie" suggests. Flash is the cheaper/faster way to the same band -(~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is narrow (long-form synthesis reliability, -one stable setting, no effort-tuning). The 397B-think arm is the apples-to-apples test against Flash's -reasoning modes — pending. +(~99 vs ~71 tok/s, one engine vs both GPUs). 397B's edge is reliability — no max_tokens runaways (its +worst case is a quiet *stall* on no-think market research, which thinking clears to 3/3) — not raw +accuracy. Thinking doesn't raise the aggregate (net −1); it *redistributes* (market up, pm/doc down), so +"turn thinking on" is a per-task call, not a default. Reach for 397B when runaway resistance and +market-research reliability matter; otherwise Flash wins on speed and cost. diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md index d2731022..bc436ab0 100644 --- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md +++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md @@ -1,18 +1,22 @@ -# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=1 (+ Step-3.7-Flash comparison) +# Qwen3.5-397B-A17B on 2× RTX PRO 6000 Blackwell — microbench N=3 (+ Step-3.7-Flash comparison) -A large dense-MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT +A large MoE (397B total / ~17B active) run as a GGUF on llama.cpp, benched through the MMBT 12-family agentic microbench in two reasoning modes (no-think / think) and compared against the Step-3.7-Flash-NVFP4 entry on the same box. -**Provisional, N=1** — one replicate per cell. An N=3 re-run is queued; treat numbers as directional. +**N=3** — three replicates per cell (phase-1 graded with the fixed `phase1_grade.py`). ## TL;DR -- **397B no-think 8/12, think 7/12; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range. -- **Thinking didn't help 397B** (net −1) — the one regression is a verbosity-driven word-limit overrun at - identical fact coverage, not a reasoning failure. -- **397B never ran away** (both arms) where Flash did at low effort — a real reliability edge. +- **397B no-think 23/36, think 22/36; Step-3.7-Flash 7–8/12.** Aggregate ties across a ~15× param range. +- **Thinking is net −1, but not inert** (N=3 correction): it *redistributes* — stabilizes `p3_market` + (1/3→3/3) while hurting `p3_pm` (2/3→0/3) and `p3_doc` (2/3→1/3). It changes *where* 397B succeeds, + not how often. +- **397B never ran away** (zero max_tokens/length failures across 72 cells) where Flash did at low effort. + Its only non-clean exits: 2 no-think `p3_market` reps hit the 500-iter *stuck* threshold + 1 `p3_pm` + `model_stopped` — a stalling pathology, not runaway; thinking clears the market stall (think market 3/3). +- **N=3 matters:** `p3_market`/`p3_pm`/`p3_doc` are high-variance; their N=1 verdicts were single-draw luck. - **Flash is the cheaper/faster default** (~99 vs ~71 tok/s, one engine vs both GPUs); 397B's case is - narrow (long-form synthesis reliability, one stable setting). + reliability (runaway resistance, thinking-on market research). - The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).** ## Files diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md index bcfc4005..4233898e 100644 --- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md +++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings.md @@ -1,8 +1,8 @@ -# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=1, vs Step-3.7-Flash +# Qwen3.5-397B-A17B (GGUF, llama.cpp) — microbench N=3, vs Step-3.7-Flash -**Provisional / N=1.** One replicate per cell — directional, not statistically settled. An N=3 -re-run is queued. Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail table below ties -across models, and the differences that matter are qualitative. +**N=3** (three replicates per cell). Pair this with [QUALITATIVE.md](QUALITATIVE.md): the pass/fail +table below ties across models, and the differences that matter are qualitative. Phase-1 cells are +graded with the **fixed** `phase1_grade.py` (see grading-correctness note below). ## Setup - **Model:** Qwen3.5-397B-A17B, unsloth UD-Q3_K_XL GGUF (~167 GB on disk, 5 shards). @@ -14,23 +14,23 @@ across models, and the differences that matter are qualitative. `../step3.7-flash-nvfp4-dual-blackwell-2026-05-28/`. Cross-engine + cross-quant: **"best-as-each-ships," not a clean precision study.** -## Scorecard (N=1) +## Scorecard (N=3, pass count per cell) -| task | 397B no-think | 397B think | Step low/med/high | 27B (ref N=3) | Coder (ref N=3) | +| task | 397B no-think | 397B think | Step low/med/high | 27B (ref) | Coder (ref) | |---|:--:|:--:|:--:|:--:|:--:| -| p1_bugfix | ✓ | ✓ | ✓/✓/✓ | 3/3 | 2/3 | -| p1_testwrite † | ✗ | ✗ | ✗/✗/✗ | 0/3 † | 0/3 † | -| p1_refactor † | ✗ | ✗ | ✓/✗/✗ | 0/3 † | 0/3 † | -| p2_extract | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 | -| p2_ci | ✓ | ✓ | ✓/✓/✓ | 3/3 | 3/3 | -| p2_hallucination | ✓ | ✓ | ✓/✓/✓ | 3/3 | 1/3 | -| p2_triage | ✓ | ✓ | ~/✓/✓ | 3/3 | 3/3 | -| p3_doc | ✓ | **✗** | ~/✓/✓ | 0/3 | 2/3 | -| p3_business | ✓ | ✓ | ✓/~/✗ | 2/3 | 3/3 | -| p3_market * | ✓ | ✓ | **✗**/✓/✓ | 3/3 * | 0/3 | -| p3_writing | ✗ | ✗ | ✗/~/✗ | 0/3 | 2/3 | -| p3_pm | ✗ | ✗ | ✗/~/✓ | 0/3 | 1/3 | -| **Total** | **8/12** | **7/12** | **7 / 8 / 8** | ~7/12 | ~7/12 | +| p1_bugfix | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 2/3 | +| p1_testwrite † | 0/3 | 0/3 | ✗/✗/✗ | 0/3 † | 0/3 † | +| p1_refactor † | 0/3 | 0/3 | ✓/✗/✗ | 0/3 † | 0/3 † | +| p2_extract | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 | +| p2_ci | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 3/3 | +| p2_hallucination | 3/3 | 3/3 | ✓/✓/✓ | 3/3 | 1/3 | +| p2_triage | 3/3 | 3/3 | ~/✓/✓ | 3/3 | 3/3 | +| p3_doc | 2/3 | **1/3** | ~/✓/✓ | 0/3 | 2/3 | +| p3_business | 3/3 | 3/3 | ✓/~/✗ | 2/3 | 3/3 | +| p3_market * | **1/3** | **3/3** | ✗/✓/✓ | 3/3 * | 0/3 | +| p3_writing | 0/3 | 0/3 | ✗/~/✗ | 0/3 | 2/3 | +| p3_pm | **2/3** | **0/3** | ✗/~/✓ | 0/3 | 1/3 | +| **Total** | **23/36** | **22/36** | **7 / 8 / 8** | ~7/12 | ~7/12 | † `p1_refactor` fails on structure (no `output/` subpackage created), not the model's competence at the core edit. `p1_testwrite` — see the grading-correctness note below; the earlier "task-design" framing was @@ -40,41 +40,51 @@ partly a grader artifact. \* `p3_market` is graded STRUCTURAL_PASS (citation val A review caught that `phase1_grade.py` read flat keys (`coverage_pct`, `ruff_issues`, `benchmark_s`) while `code_task_grader.py` writes nested ones (`coverage.line_coverage_pct`, `ruff.issue_count`, `benchmark.elapsed_s`). Effect: `p1_bugfix`'s ruff/benchmark gates were silently always-true, and -`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded**. Outcome: -- **Totals unchanged (8/12 / 7/12)** — but now *trustworthy*, not coincidental. -- `p1_bugfix` PASS is now genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²) - fix) are real and pass — they were previously ignored. +`p1_testwrite`'s coverage gate was always-false. Fixed and **all phase-1 cells regraded** (N=1 and N=3): +- `p1_bugfix` PASS is genuinely validated: ruff 2→0 and benchmark **11.2s→0.537s** (the planted O(n²) + fix) are real and pass — they were previously ignored. Consistent 3/3 both arms. - `p1_testwrite` still FAILs, but the **reason flips**: think-mode actually achieved **99% coverage / 153 passing tests** (the broken grader reported `cov=0` and hid it); it fails only on `logalyzer_unchanged` (it edited production code, violating the "only /tests/ may differ" rule). The model is *capable* here — the task constraint, not incapacity, is what fails it. The inherited † "task-design" footnote on testwrite is misleading and should be re-examined for the published 27B/Coder cells too. -- ⚠️ **The 27B / Coder reference columns in the scorecard above predate this fix.** Their phase-1 - (bug-fixing / test-writing) numbers came from the same buggy grader, so they may be wrong — testwrite - especially is likely a guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see - tracking issue #29.** Treat the reference columns' p1_* cells as provisional until that lands. +- ⚠️ **The 27B / Coder reference columns predate this fix.** Their phase-1 (bug-fixing / test-writing) + numbers came from the same buggy grader, so they may be wrong — testwrite especially is likely a + guaranteed-FAIL artifact. **Historical phase-1 scores may need regrading; see tracking issue #29.** + Treat the reference columns' p1_* cells as provisional until that lands. ## Headline findings -1. **Scale doesn't move the aggregate.** A 397B-param model lands in the *same 7–8/12 band* as a 27B, - a ~30B coder, and an ~11B-active Flash. The interesting signal is per-task and qualitative, not the total. +1. **Scale doesn't move the aggregate.** 397B lands at 23/36 (no-think) and 22/36 (think) — the *same + 7–8/12 band* (by per-task majority) as a 27B, a ~30B coder, and an ~11B-active Flash. The interesting + signal is per-task and qualitative, not the total. -2. **Thinking is net −1 for 397B on this suite** (8/12 → 7/12). 11 of 12 cells are identical between modes; - the lone flip is `p3_doc` PASS→FAIL — and it's instructive: *both* modes captured all 8/8 facts - (`fact_coverage 1.0`); think just wrote 721 words against a 700-word limit (no-think: 692). Reasoning - amplified 397B's verbosity and blew a hard constraint with identical content. Thinking was inert - everywhere else — more tokens and turns, same outcomes. **Reasoning bought 397B nothing here.** +2. **Thinking is net −1 for 397B (8/12→7/12 at N=1, 23/36→22/36 at N=3) — but N=3 overturns the N=1 + "inert" reading.** At N=1 the loss looked like a single verbosity flip (`p3_doc`). With three + replicates, thinking is not inert — it **redistributes**: it *helps* `p3_market` (no-think 1/3 → think + **3/3**, stabilizing the wobbliest cell, zero runaways) but *hurts* `p3_pm` (2/3 → **0/3**) and `p3_doc` + (2/3 → 1/3, the verbosity-vs-word-limit story). Net −1, but as a wash of real per-task swings, not a + no-op. Reasoning changes *where* 397B succeeds without changing *how often*. -3. **397B is runaway-resistant; Flash is not (at low effort).** All 24 397B cells (both arms) finished - `done_signal` — zero max_tokens/length failures. Step-3.7-Flash **ran away on `p3_market` at low effort** - (hit max_tokens). 397B's reliability edge is real and mode-independent. +3. **N=3 exposes which N=1 verdicts were luck.** Single replicates are noisy on the open-ended cells: in + the no-think arm, `p3_market` (N=1 ✓ → N=3 1/3) and `p3_pm` (N=1 ✗ → N=3 2/3) were single-draw + artifacts; `p3_doc` (✓ → 2/3) wobbles on the word limit. Zero-variance cells (3/3 or 0/3 across all + reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business, and the + consistent fails. **Trust the mid-tier and bugfix; treat market/pm/doc as high-variance.** -4. **397B's distinctive lane is long-form synthesis** (`p3_doc`/`p3_business`/`p3_market` all pass in - no-think) where the 27B was weak — but it's the **slowest and most expensive** way to reach the shared - band (~71 tok/s spanning both GPUs at Q3, vs Flash ~99 tok/s on one engine). Flash is the better default; - 397B earns its keep only where synthesis reliability and a single stable setting matter. +4. **397B is runaway-resistant — but no-think market research gets *stuck*.** Zero max_tokens/length + runaways across all 72 cells (the failure mode Step-3.7-Flash showed on `p3_market` at low effort). + 69/72 finished `done_signal`; the 3 non-clean exits were **all no-think**: `p3_market` v2 & v3 hit the + 500-iter **stuck threshold** (`stuck_no_workspace_change_for_500_iters` — spinning without progress, not + over-generating) and `p3_pm` v1 `model_stopped`. So 397B's pathology is *stalling*, not runaway — and + **thinking eliminates the market stall**: think `p3_market` is a clean 3/3 `done_signal` vs no-think's + 1/3 (2 stuck). For market research, reasoning turns a stuck coin-flip into a lock. -5. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B +5. **Cost still favors Flash.** 397B reaches the shared band at ~71 tok/s spanning both GPUs at Q3, vs + Flash ~99 tok/s on one engine. Flash is the better default; 397B earns its keep where its runaway + resistance + (thinking-on) market-research reliability matter. + +6. **Integration tax (a "messy model" finding).** Flash (vLLM) ran the harness out of the box. 397B (llama.cpp) needed `--reasoning-format none` (default extracts CoT into `reasoning_content`, leaving `content` empty → agent loop reads a thinking turn as "done" and dies at iter ~3 — invisible to a thinking-off smoke) plus harness cleanup fixes (non-sudo `rm` on root-owned sandbox/grader leftovers). diff --git a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json index 03e494b3..b89f7c6c 100644 --- a/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json +++ b/hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/manifest.json @@ -2,8 +2,8 @@ "$schema_note": "Read this BEFORE drawing conclusions. Machine-readable provenance for the qwen3.5-397b-vs-step3.7-flash-2026-05-29 entry. Companion to README.md / findings.md / QUALITATIVE.md.", "bundle_id": "qwen3.5-397b-vs-step3.7-flash-2026-05-29", "snapshot_date_utc": "2026-05-29", - "kind": "agentic microbench (12 task families) — N=1 provisional, two reasoning arms, + cross-model comparison", - "claim_scope": "Quality/behavior of Qwen3.5-397B-A17B (UD-Q3_K_XL GGUF) on llama.cpp through the MMBT 12-family agentic microbench, no-think vs think, on 2x RTX PRO 6000 Blackwell; compared against the Step-3.7-Flash-NVFP4 entry on the same rig. Single rig, ONE replicate per cell (N=1) — directional, not statistically settled. N=3 re-run queued.", + "kind": "agentic microbench (12 task families) — N=3, two reasoning arms, + cross-model comparison", + "claim_scope": "Quality/behavior of Qwen3.5-397B-A17B (UD-Q3_K_XL GGUF) on llama.cpp through the MMBT 12-family agentic microbench, no-think vs think, on 2x RTX PRO 6000 Blackwell; compared against the Step-3.7-Flash-NVFP4 entry on the same rig. Single rig, three replicates per cell (N=3). Phase-1 cells graded with the fixed phase1_grade.py.", "hardware": { "host": "Tower2 (ASUS Pro WS WRX90E-SAGE SE, Threadripper PRO 7965WX)", "gpus": "2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB each", @@ -43,20 +43,21 @@ "max_model_len_flag": "--max-model-len 131072 (matches served -c so request max_tokens math doesn't exceed ctx and 400)", "temperature": 0.3, "stuck_threshold": 500, - "n_per_cell": 1, + "n_per_cell": 3, "smoke_gate": "PASS (extraction, thinking off): tool-calling proven in harness, all fields correct, done_signal" }, "results": { - "397b_no_think_pass": "8/12", - "397b_think_pass": "7/12", + "397b_no_think_pass": "23/36 (N=3); 8/12 (N=1)", + "397b_think_pass": "22/36 (N=3); 7/12 (N=1)", "step3p7_flash_pass": {"low": "7/12", "medium": "8/12", "high": "8/12"}, - "think_vs_nothink": "net -1; only flip p3_doc (PASS->FAIL) = verbosity blew 700-word limit at identical 8/8 fact coverage", - "runaways": "none in either 397B arm (all done_signal); Step-3.7-Flash ran away on p3_market at low effort", + "think_vs_nothink": "net -1 at both N=1 and N=3, BUT N=3 shows it redistributes (not inert): p3_market +2 (1/3->3/3, stabilizes), p3_pm -2 (2/3->0/3), p3_doc -1 (verbosity vs 700-word limit).", + "n3_variance": "no-think single-draw luck exposed at N=3: p3_market N1 PASS->1/3, p3_pm N1 FAIL->2/3, p3_doc N1 PASS->2/3. Zero-variance: p1_bugfix, p2_extract/ci/hallucination/triage, p3_business (3/3); testwrite/refactor/writing (0/3).", + "runaways": "zero max_tokens/length runaways across all 72 cells (Step-3.7-Flash ran away on p3_market at low effort). 69/72 done_signal; the 3 non-clean exits were all no-think: p3_market v2/v3 hit the 500-iter stuck threshold, p3_pm v1 model_stopped. Pathology is stalling, not runaway; think arm 36/36 done_signal and think p3_market 3/3 (thinking clears the stall).", "known_artifact": "p1_refactor fails on structure (no output/ subpackage). p1_testwrite fails on the 'only /tests/ may differ' rule — NOT incapacity: corrected grading shows think-mode reached 99% coverage / 153 passing tests.", "phase1_grader_fix_2026-05-29": "phase1_grade.py read flat keys (coverage_pct/ruff_issues/benchmark_s) vs code_task_grader's nested coverage.line_coverage_pct/ruff.issue_count/benchmark.elapsed_s -> bugfix ruff/bench gates always-true, testwrite coverage gate always-false. Fixed and ALL phase-1 cells regraded; totals unchanged (8/12, 7/12) but now trustworthy." }, "caveats": [ - "N=1 per cell — directional only; N=3 re-run queued.", + "N=3 per cell. Open-ended cells (p3_market/p3_pm/p3_doc) are high-variance — see n3_variance.", "Cross-engine (llama.cpp vs vLLM) AND cross-quant (Q3_K_XL vs NVFP4) vs Step-3.7-Flash: 'best-as-each-ships', NOT a clean precision study.", "Some graders are binary and can fail high-quality output on format/length (see p3_writing); pair with QUALITATIVE.md." ] diff --git a/tooling/scripts/run_microbench.sh b/tooling/scripts/run_microbench.sh index a5cb191f..3526c967 100755 --- a/tooling/scripts/run_microbench.sh +++ b/tooling/scripts/run_microbench.sh @@ -76,6 +76,24 @@ if [ -n "$REASONING_EFFORT" ] && [[ "$LABEL" != *"$REASONING_EFFORT"* ]]; then exit 2 fi +# Same guard for --thinking: run names are keyed by LABEL only, so running +# --thinking off then on under ONE label silently skips the second arm as +# "already complete". Require the mode encoded in the label (nothink / think). +if [ -n "$THINKING" ]; then + if [ "$THINKING" = "off" ]; then + want="nothink"; [[ "$LABEL" == *nothink* ]] && ok=1 || ok=0 + else + want="think"; [[ "$LABEL" == *think* && "$LABEL" != *nothink* ]] && ok=1 || ok=0 + fi + if [ "$ok" != "1" ]; then + echo "ERROR: --thinking '$THINKING' is set but label '$LABEL' does not encode it ('$want')." >&2 + echo " Run names are keyed by label only, so --thinking off then on under one label collide" >&2 + echo " (the second arm is skipped as already-complete). Put the mode in the label, e.g.:" >&2 + echo " $0 $MODEL $PORT 397b-${want} ${N} \"\" ${THINKING} ${MAXLEN}" >&2 + exit 2 + fi +fi + TOOLING="$(cd "$(dirname "$0")/.." && pwd)" REPO_ROOT="$(cd "$TOOLING/.." && pwd)" cd "$REPO_ROOT"