You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**397B no-think 82/120, think 72/120; Step-3.7-Flash 7–8/12; 27B-Q4 & Coder-Next-Q4 ~7/12.** Aggregate ties across a ~15× param range — scale doesn't move the total (confirmed at N=10).
11
+
-**Thinking is net-negative across a ~15× param range — same mechanism.** 397B think 72/120 < no-think 82/120 (−10), and Qwen3.6-27B-Q4 (N=10) ships **86.8% no-think vs 75% thinking** — both worse with thinking, both via the **`p3_doc` word-limit loop** (397B 9/10→2/10; 27B-thinking `wall_killed`~40%). Reasoning isn't a free upgrade; on constraint-bound synthesis it backfires regardless of size. (Full cross-model think/no-think table in findings.md.)
12
+
-**N=10 overturns small-N luck:**`p3_market` no-think flips 1/3 (N=3, looked like a fail) → 8/10 (clear pass) — auto-flagged in the stability table. The headline methodological result.
13
+
-**Failure temperament tracks lineage, not size:** 397B + 27B *stall* (never over-generate); Coder-Next + Flash *run away*. Zero max_tokens runaways across all 240 397B cells.
14
+
-**Cross-model uses clean Q4/AWQ refs** for 27B/Coder; fresh Q8/FP8 runs excluded as serving failures (documented, not faked).
15
+
-**GPU power:** combined both-GPU draw never within 5% of the 1200W cap (median 670W, max 985W=82%); GPU0 leads GPU1 — pipeline alternation. The pair never hits full power together.
20
16
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
**N=10** scorecard, replicate-stability analysis, think-vs-no-think redistribution, and finish-reason audit — auto-generated by `tooling/bench_report.py` from the per-cell `grade.json` / `summary.json` logs. Partial data is rendered as-is (denominators reflect graded replicates, not the target N).
Pass = grade.json `verdict` of PASS or STRUCTURAL_PASS. Cells with no grade.json yet are excluded from the denominator (shown as `x/done`, not `x/N`); an em-dash means no graded replicate exists for that cell. Step/27B/Coder columns are the published comparators carried from the N=3 entry (see tracking issue #29 — phase-1 reference cells may be provisional).
29
+
30
+
## What N reveals — pass-rate stability across replicates
31
+
32
+
Per task, the pass count over the first v1–3 / v1–5 / v1–10 / v1–10 replicates of each arm. A **⚑ flip** marks a cell whose small-N *majority verdict* (>50% pass) disagrees with its full-N=10 majority verdict — i.e. a small-N read that would have been overturned. `·` = no graded replicate in that window; a tie (exactly 50%) is treated as no-call and never flagged as a flip.
**Flipped cells** (1): p3_market (397B no-think). These are exactly the cells where a small-N verdict would have been luck — trust the high-N read and treat them as high-variance.
62
+
63
+
## Thinking vs no-think — per-task redistribution
64
+
65
+
Pass-rate delta (think − no-think) per task. To stay honest while the arms are at different depths, each task's delta is computed on its **common window** — the first `k = min(no-think graded, think graded)` replicates of *both* arms — and also reported as a pass-count delta scaled to that common k. A net-zero aggregate can still hide real per-task swings; this surfaces *where* reasoning moves success.
On matched common windows, thinking **helps 1** task(s) and **hurts 2** (net -10 passes over matched cells). Read the per-task Δ, not a single aggregate — the interesting signal is the redistribution. (Raw per-arm totals at the current — possibly unequal — depths are in the scorecard above.)
83
+
84
+
## Runaway / stall summary (finish_reason per arm)
85
+
86
+
Count of completed cells by `summary.json``finish_reason`, per arm, over v1–10. `done_signal` = clean agent-declared completion. `model_stopped` = the model ended the turn without signalling done. `*_runaway` / `max_*` = over-generation. `stuck_*` = spinning without workspace progress until the stuck threshold. Anything that is not `done_signal` is a non-clean exit worth a look.
0 commit comments