You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates (#44)
Run a fixed K=32 independent slots (16 each over two non-memorized MoE
papers, samples=4) through the causal-challenger -> refine -> accept loop
and report the accepted-rate with a Wilson 95% CI, the per-slot gap
distribution, and the plain-vs-refined gap-widening with a paired-bootstrap
CI. This settles the n=3 noise (one run 2/3, a re-run 0/3): acceptance is a
real 38% rate, CI [23%, 55%] (12/32), not a coin-flip ~0.
- powered.ts: a fixed-slots harness over the existing buildAutodataDataset
(which already runs exactly `target` independent slots); only the
cross-slot aggregation + the two CIs are added. CIs are agent-eval's
published `wilson` / `pairedBootstrap`, never hand-rolled. Stats are
recomputed from the on-disk per-attempt JSONL so an interrupted run loses
no data. Surfaces challenger-stage (LaTeX-in-JSON) failures separately so
they are not silently miscounted as discrimination rejects.
- powered.test.ts: offline unit coverage for analyzeTrails (denominator =
requested target, per-slot best-gap pairing, the accept-rule decomposition,
multi-doc aggregation).
- docs/results/autodata-live.md: replace the noisy n=3 section with the
powered rate + CI, the gap distribution, and two autopsied accepted
examples (real weak-fails-strong-derives on both docs).
|**target=3 — independent re-run**|**0 / 3**| 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
70
-
71
-
**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
72
-
accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
73
-
these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
74
-
acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
64
+
## The powered result — a real ~38% accept-rate
75
65
76
-
## An autopsied accepted example (real discrimination, both answers read)
66
+
**Design (fixed-slots, not until-N-accepted):** run a fixed K = 32 independent slots (each slot = one
67
+
full challenger → refine → accept cycle), split 16 / 16 across the two docs, samples = 4 per solver
68
+
(stabilise the weak mean), maxRetries = 2 (3 challenger attempts per slot). Record each slot's
69
+
outcome (accept / reject) + best gap, so the rate is bounded-cost and unbiased. Runnable:
70
+
`src/autodata/powered.ts`; per-attempt autopsy JSONL per doc; the CIs are agent-eval's published
71
+
estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` for the paired widening).
77
72
78
-
> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
79
-
> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
80
-
> output differ from the intended behavior, and why is this failure mode problematic?
81
-
82
-
-**strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
83
-
uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
84
-
the point of the MoE. Correct.
85
-
-**weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
86
-
the failure consequence; it never reaches "all experts averaged → specialization lost."
87
-
88
-
When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
89
-
leakage (the answer is not in the context). **But it does not open reliably.** In the independent
90
-
re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
91
-
correctly explained that high positional locality routes consecutive tokens to the same expert →
92
-
over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
93
-
fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
94
-
variance that makes acceptance a coin-flip.
73
+
| metric | value | read |
74
+
|---|---|---|
75
+
|**accept-rate (headline)**|**38% CI [23%, 55%]** (12 / 32) | excludes ~0 → **reliable, not a coin-flip**|
76
+
| accept-rate (producing slots) | 44% CI [28%, 63%] (12 / 27) | excludes the 5 challenger-stage (LaTeX) failures |
77
+
| — mixtral | 19% CI [7%, 43%] (3 / 16) | the harder doc; still excludes 0 |
78
+
| — deepseek-v3 | 56% CI [33%, 77%] (9 / 16) | the easier-to-discriminate doc |
79
+
| best gap / slot (n=27) | min −0.23 · median **0.42** · p90 0.80 · max 0.95 | how far each slot separated the tiers |
80
+
| plain (first-draft) gap / slot | min −0.23 · median 0.19 · p90 0.61 · max 0.95 | the un-refined baseline |
81
+
|**gap-widening Δ (plain → best-refined)**| mean **+0.103** CI [+0.029, +0.193] (paired bootstrap, n=27) | the fold's lift; **excludes 0** (median Δ 0 — it helps a minority) |
82
+
| weak score / attempt (n=33) | min 0.05 · median **0.55** · max 1.00 | the variance source — competent ~half the time |
83
+
| strong score / attempt (n=33) | min 0.21 · median **0.99** · max 1.00 | the strong solver almost always derives |
0 commit comments