Skip to content

Commit 035a8d4

Browse files
authored
feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates (#44)
Run a fixed K=32 independent slots (16 each over two non-memorized MoE papers, samples=4) through the causal-challenger -> refine -> accept loop and report the accepted-rate with a Wilson 95% CI, the per-slot gap distribution, and the plain-vs-refined gap-widening with a paired-bootstrap CI. This settles the n=3 noise (one run 2/3, a re-run 0/3): acceptance is a real 38% rate, CI [23%, 55%] (12/32), not a coin-flip ~0. - powered.ts: a fixed-slots harness over the existing buildAutodataDataset (which already runs exactly `target` independent slots); only the cross-slot aggregation + the two CIs are added. CIs are agent-eval's published `wilson` / `pairedBootstrap`, never hand-rolled. Stats are recomputed from the on-disk per-attempt JSONL so an interrupted run loses no data. Surfaces challenger-stage (LaTeX-in-JSON) failures separately so they are not silently miscounted as discrimination rejects. - powered.test.ts: offline unit coverage for analyzeTrails (denominator = requested target, per-slot best-gap pairing, the accept-rule decomposition, multi-doc aggregation). - docs/results/autodata-live.md: replace the noisy n=3 section with the powered rate + CI, the gap distribution, and two autopsied accepted examples (real weak-fails-strong-derives on both docs).
1 parent 0230286 commit 035a8d4

4 files changed

Lines changed: 684 additions & 72 deletions

File tree

docs/results/autodata-live.md

Lines changed: 103 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,16 @@
1-
# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)
1+
# Autodata live result: the causal-challenger loop reliably discriminates at power — 38% accept-rate, CI [23%, 55%] (NOT a coin-flip)
22

3-
Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
3+
Running the agentic data-creation loop (`src/autodata/`) on real arXiv docs with real two-tier
44
solvers, to manufacture training examples that separate a strong solver from a weak one (the
55
discriminative reward of the Autodata / Agentic-Self-Instruct method).
66

7-
**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
8-
**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
9-
method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
10-
∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
11-
0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
12-
(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
13-
**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
14-
mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.
7+
**Powered headline (32 independent slots, 2 docs, samples=4):** the loop **reliably manufactures
8+
discriminating examples — accept-rate 38%, Wilson 95% CI [23%, 55%]** (12 of 32 slots cleared the
9+
hard accept bar: weak < 0.5 ∧ strong ≥ 0.65 ∧ gap ≥ 0.2). The CI lower bound (23%) excludes ~0, so
10+
this is a **real, repeatable rate, not the n=1–2 luck** that made it look like a coin-flip at n=3.
11+
Acceptance is **doc-dependent** (mixtral 19%, deepseek-v3 56%) and gated by **whether the weak model
12+
struggles** (it does on only 39% of attempts), but it is decisively above zero on both docs. This
13+
**replaces** the earlier n=3 result, which was too noisy to tell "real rate" from "coin-flip ~0".
1514

1615
## The two levers that turned the null into a positive
1716

@@ -29,22 +28,30 @@ both fixed here:
2928

3029
2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
3130
canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
32-
pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
33-
memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
34-
post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.
31+
pretraining and capability cannot separate. Fix — **ground on docs the weak solver has not
32+
memorized**: the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024) and the DeepSeek-V3 paper
33+
(arXiv 2412.19437, Dec 2024), both post-dating `llama-3.1-8b`'s knowledge cutoff, forcing it to
34+
reason from the context.
3535

3636
## Setup (all env-overridable)
3737

3838
| role | model | why |
3939
|---|---|---|
40-
| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
40+
| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 docs → must reason, can't recall |
4141
| strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
4242
| challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
43-
| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |
43+
| grounding doc A | Mixtral-of-Experts (2401.04088) | non-memorized; MoE expert routing / gating (`focus=expert`) |
44+
| grounding doc B | DeepSeek-V3 (2412.19437) | non-memorized; auxiliary-loss-free load balancing / expert specialization (`focus=auxiliary`) |
4445

45-
Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
46-
challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
47-
live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)
46+
Accept thresholds (the paper's): strong ≥ 0.65, weak < 0.50, gap ≥ 0.20. (`glm-5.2`, the brief's
47+
challenger/judge, was returning upstream-capacity 503s; `deepseek-v4-flash` is the live, neutral
48+
substitute. `routerChat` retries transient 503/429/timeout with bounded backoff.)
49+
50+
The grounding chunk must be **prose, not equations**: an equation-dense chunk (e.g. DeepSeek-V3's MLA
51+
section) breaks the challenger's strict-JSON output (LaTeX backslashes), so both `focus` terms select
52+
the prose description of an MoE-expert mechanism. Even so, 5 of 32 slots (~16%) still hit a
53+
LaTeX-in-JSON failure and produced no example — those count as rejects in the headline (the
54+
conservative floor); see below.
4855

4956
## The judge is reliable (checked before trusting any gap)
5057

@@ -54,70 +61,94 @@ each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs we
5461
measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
5562
errors — `deepseek` is the better grader here.)
5663

57-
## The result — the gap opens, examples are accepted
58-
59-
**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
60-
**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
61-
reasoning).
62-
63-
**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**
64-
65-
| run | accepted | gap widening (plain → refined) | note |
66-
|---|---|---|---|
67-
| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
68-
| target=1, samples=3, maxRetries=4 | **1 / 1** || first causal draft already separated |
69-
| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
70-
71-
**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
72-
accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
73-
these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
74-
acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
64+
## The powered result — a real ~38% accept-rate
7565

76-
## An autopsied accepted example (real discrimination, both answers read)
66+
**Design (fixed-slots, not until-N-accepted):** run a fixed K = 32 independent slots (each slot = one
67+
full challenger → refine → accept cycle), split 16 / 16 across the two docs, samples = 4 per solver
68+
(stabilise the weak mean), maxRetries = 2 (3 challenger attempts per slot). Record each slot's
69+
outcome (accept / reject) + best gap, so the rate is bounded-cost and unbiased. Runnable:
70+
`src/autodata/powered.ts`; per-attempt autopsy JSONL per doc; the CIs are agent-eval's published
71+
estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` for the paired widening).
7772

78-
> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
79-
> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
80-
> output differ from the intended behavior, and why is this failure mode problematic?
81-
82-
- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
83-
uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
84-
the point of the MoE. Correct.
85-
- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
86-
the failure consequence; it never reaches "all experts averaged → specialization lost."
87-
88-
When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
89-
leakage (the answer is not in the context). **But it does not open reliably.** In the independent
90-
re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
91-
correctly explained that high positional locality routes consecutive tokens to the same expert →
92-
over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
93-
fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
94-
variance that makes acceptance a coin-flip.
73+
| metric | value | read |
74+
|---|---|---|
75+
| **accept-rate (headline)** | **38% CI [23%, 55%]** (12 / 32) | excludes ~0 → **reliable, not a coin-flip** |
76+
| accept-rate (producing slots) | 44% CI [28%, 63%] (12 / 27) | excludes the 5 challenger-stage (LaTeX) failures |
77+
| — mixtral | 19% CI [7%, 43%] (3 / 16) | the harder doc; still excludes 0 |
78+
| — deepseek-v3 | 56% CI [33%, 77%] (9 / 16) | the easier-to-discriminate doc |
79+
| best gap / slot (n=27) | min −0.23 · median **0.42** · p90 0.80 · max 0.95 | how far each slot separated the tiers |
80+
| plain (first-draft) gap / slot | min −0.23 · median 0.19 · p90 0.61 · max 0.95 | the un-refined baseline |
81+
| **gap-widening Δ (plain → best-refined)** | mean **+0.103** CI [+0.029, +0.193] (paired bootstrap, n=27) | the fold's lift; **excludes 0** (median Δ 0 — it helps a minority) |
82+
| weak score / attempt (n=33) | min 0.05 · median **0.55** · max 1.00 | the variance source — competent ~half the time |
83+
| strong score / attempt (n=33) | min 0.21 · median **0.99** · max 1.00 | the strong solver almost always derives |
84+
85+
**Accept-rule decomposition (33 quality-clean attempts):** strong ≥ 0.65 = **88%**, weak < 0.50 =
86+
**39%** ← the binding gate, gap ≥ 0.20 = 52%, all-three (= accept) = 36%. The strong solver derives
87+
almost everything; the bottleneck is the weak model failing — which happens on only ~39% of
88+
attempts, so the per-slot accept-rate is set by **how often `llama-3.1-8b` actually struggles**, not
89+
by the challenger or judge. **Total live spend: $0.57** for the 32-slot run (~$1.0 including pilots).
90+
91+
## Two autopsied accepted examples (real discrimination, both answers read)
92+
93+
**deepseek-v3 — gap 0.93 (weak 0.07, strong 1.00):**
94+
> **Q:** Why does using a *sequence-wise* auxiliary loss lead to a higher validation loss than a
95+
> *batch-wise* auxiliary loss or the auxiliary-loss-free method in MoE models?
96+
97+
- **strong (`gemini-2.5-pro`): 1.00** — derives that the sequence-wise loss imposes a *stricter,
98+
less flexible* per-sequence balance constraint that *hinders the emergence of expert
99+
specialisation*. Correct, matches the reference.
100+
- **weak (`llama-3.1-8b`): [0.10, 0.03, 0.10, 0.03]***restates the question* and never derives the
101+
reason. A recall-shaped non-answer; the judge's `reasoning` criterion floors it.
102+
103+
**mixtral — gap 0.95 (weak 0.05, strong 1.00):**
104+
> **Q:** The text says each input is routed to 2 of 8 experts, yet the output sums `G(x)_i · E_i(x)`
105+
> over all `n` experts. Are these consistent? If not, which should be revised?
106+
107+
- **strong: 1.00** — derives YES, consistent: the gating vector `G(x)` is *sparse* (nonzero only for
108+
the 2 selected experts), so the full-`n` sum effectively includes only those 2. Correct.
109+
- **weak: [0.03, 0.07, 0.03, 0.07]** — concludes the statements are *inconsistent*; it never grasps
110+
the sparse-gating equivalence. A genuine reasoning error, not a judge artifact or leakage (the
111+
answer is derived, not in the context).
112+
113+
These are real weak-fails-strong-derives examples on both docs — the loop is manufacturing genuine
114+
discrimination, not gaming the gap.
95115

96116
## The finding
97117

98-
The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
99-
(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
100-
(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
101-
**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).
102-
103-
But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
104-
*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
105-
MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
106-
coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
107-
exact small-n shape that has repeatedly looked positive here and washed out at power.
118+
The question "does the causal-challenger loop reliably manufacture discriminating examples, or is
119+
acceptance a coin-flip ~0?" is now **settled at power: it reliably works.** Accept-rate **38%, CI
120+
[23%, 55%]** over 32 slots — the lower bound excludes ~0, and even the harder of the two docs
121+
(mixtral, 19% [7%, 43%]) excludes 0. The fold also **reliably widens the gap** (mean +0.103, CI
122+
[+0.029, +0.193]), reproducing the n=3 direction at power, though most of the discrimination comes
123+
from the first causal draft already separating (median widening 0 — the refine helps a minority of
124+
slots).
125+
126+
Two honest caveats, both quantified, neither overturns the verdict:
127+
128+
1. **Doc-dependence.** The rate ranges 19% (mixtral) → 56% (deepseek-v3). The pooled 38% is a real
129+
average across two non-memorized MoE papers, not a single lucky doc — but expect the rate to move
130+
with the source material's difficulty for the 8B.
131+
2. **The binding constraint is the weak model's competence, not the method.** `llama-3.1-8b` answers
132+
these MoE-reasoning questions competently (weak median 0.55) about as often as it flails, so
133+
~39% of attempts clear the "weak must struggle" gate. A weaker weak model (or harder docs) would
134+
raise the rate; a stronger one would lower it. The loop's discriminative reward works as designed —
135+
the rate is a property of the **tier gap**, which is exactly what it should measure.
108136

109137
## Status
110138

111-
Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
112-
dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
113-
Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
114-
`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
115-
*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
116-
direction, not a confirmed win.
139+
Mechanism + observability + **power**: solid. The accept-rate is measured at n=32 with a Wilson CI
140+
that excludes ~0, the gap-widening with a paired-bootstrap CI that excludes 0, every attempt dumped
141+
to a JSONL autopsy trail, and the two headline accepted examples read end-to-end (real
142+
discrimination). The n=3 "coin-flip ~0?" worry is **resolved: ~38% accept-rate, not zero.**
117143

118144
## Reproduce
119145

120146
```
121-
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts # causal, default Mixtral doc
122-
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts # recall-vs-causal A/B, same doc
147+
# Powered accept-rate + CIs (32 slots, 2 docs, samples=4) — the headline result:
148+
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts
149+
# knobs: AUTODATA_SLOTS_PER_DOC=16 AUTODATA_SAMPLES=4 AUTODATA_MAXRETRIES=2
150+
151+
# Single-doc builder + recall-vs-causal calibration (the lever's A/B):
152+
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts
153+
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts
123154
```

src/autodata/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ export {
3636
type GroundedDoc,
3737
groundDoc,
3838
} from './grounding'
39+
export { analyzeTrails, type DocTrail, type PoweredStats } from './powered'
3940
export {
4041
type AutodataRoles,
4142
buildAutodataRoles,

0 commit comments

Comments
 (0)