Skip to content

Commit 04def9b

Browse files
MaxGhenisclaude
andcommitted
Post-snap stage-1 results: ZI-QRF 0.928 coverage at 77k × 50
After the categorical-snap mitigation for the upstream shared-col noise bug, re-ran stage-1 at both 40k and 77k scales: 40k × 50: ZI-QRF coverage 0.979 (pre-snap: 0.352, +0.627) ZI-QDNN coverage 0.796 (pre-snap: 0.222, +0.574) ZI-MAF coverage 0.168 (pre-snap: 0.029, +0.139) 77k × 50: ZI-QRF coverage 0.928 (pre-snap: 0.256, +0.672) ZI-QDNN coverage 0.707 (pre-snap: 0.147, +0.560) ZI-MAF coverage 0.106 (pre-snap: 0.014, +0.092) Ordering preserved (ZI-QRF > ZI-QDNN > ZI-MAF). Absolute numbers are meaningfully higher because the pre-snap numbers were dragged down uniformly by the shared-col noise on binary/categorical conditioning vars (is_military, cps_race, state_fips etc). Headline story changes: - ZI-QRF quality is far better than pilot suggested -- 92.8% coverage at 77k is production-credible. - ZI-QDNN is legitimately competitive (0.707) though ZI-QRF still wins by 31% and runs 3x faster. - ZI-MAF at 0.106 is still the worst but not "entirely broken" as the pre-snap 0.014 suggested. All other findings (ordering, calibrate-on-synth, embedding-PRDC, ZI-MAF hyperparameter-tuning verdict) hold. This snap is a measurement improvement, not a direction change. G1 next-action playbook unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 80dbfa1 commit 04def9b

2 files changed

Lines changed: 78 additions & 0 deletions

File tree

docs/overnight-session-2026-04-16.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ After the stage-1 evidence landed, I continued with the open items:
129129
5. **Embedding-PRDC validation completed** (`docs/embedding-prdc-validation.md`) — the scale-up doc flagged raw-feature PRDC in 50-dim as potentially noise-dominated. Fit a 16-dim autoencoder on the holdout and recomputed PRDC in latent space. **Ordering preserved in both spaces: ZI-QRF > ZI-QDNN > ZI-MAF.** ZI-QRF 0.348→0.309 raw→embed; ZI-MAF 0.025→0.038 raw→embed (still near-collapsed). The stage-1 ordering is robust.
130130
6. **Quickstart doc** (`docs/quickstart-rewire.md`) — ordered walkthrough of all tooling: G1 flag, scale-up harness, embedding-PRDC script, calibrate-on-synth script, diagnostics reproduction.
131131
7. **Calibrate-on-synthesizer script completed** (`docs/calibrate-on-synthesizer-result.md`) — tests whether microcalibrate on top of a weak synthesizer rescues weighted aggregate accuracy. **ZI-QRF pre-cal 0.26 → post-cal 0.14 mean relative error; ZI-MAF pre-cal 17.98 → post-cal 15.08 (still useless).** Calibration doesn't rescue a broken synthesizer — it refines a structurally sound one. Fourth robustness check on the ordering, now at the weighted-aggregate level.
132+
8. **Upstream bug found + mitigated** (`docs/per-column-zero-rate-bug.md`, `docs/stage-1-post-snap-results.md`) — `microplex.eval.benchmark._MultiSourceBase.generate` adds σ=0.1 Gaussian noise to every shared-column value including binary/categorical ones. Harness now snaps synthetic values back to the training-pool grid for any integer-valued shared column. **Post-snap stage-1 coverage at 77k × 50: ZI-QRF 0.928, ZI-QDNN 0.707, ZI-MAF 0.106.** Numbers are much higher than the pre-snap stage-1; ordering is preserved. The G1 cross-section with ZI-QRF produces 92.8 % PRDC coverage — production-credible.
132133
8. **Method-kwargs config**`ScaleUpStageConfig.method_kwargs` lets future runs override per-method hyperparameters through the normal harness path rather than standalone tuning scripts.
133134

134135
Updated PR #3 count: **20 commits**, all green tests, all pushed. Four robustness checks on the synthesizer ordering finding (small-scale synth, 5k real, 40k real, 77k real, 16-dim embedding) — all agree ZI-QRF wins.

docs/stage-1-post-snap-results.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Stage-1 results after fixing the shared-col noise bug
2+
3+
*Corrected stage-1 numbers after the categorical-snap mitigation landed. The raw numbers in `docs/stage-1-pilot-results.md` are preserved for historical reference but should not be cited; the post-snap numbers here are the real measurement.*
4+
5+
## The fix in one line
6+
7+
`microplex.eval.benchmark._MultiSourceBase.generate` adds σ=0.1 Gaussian noise to *every* shared-column value, including binary / categorical ones. The harness now snaps those values back to their training-pool grid after generation. See `docs/per-column-zero-rate-bug.md`.
8+
9+
## Corrected stage-1 at 40k × 50 (PRDC capped 15k/15k)
10+
11+
| Method | Coverage | Precision | Density | Fit (s) | Peak RSS (GB) | Zero-rate MAE |
12+
|---|---:|---:|---:|---:|---:|---:|
13+
| **ZI-QRF** | **0.979** | 0.913 | 0.902 | 20.0 | 3.5 | 0.016 |
14+
| ZI-QDNN | 0.796 | 0.848 | 0.766 | 52.5 | 11.8 | 0.136 |
15+
| ZI-MAF | 0.168 | 0.030 | 0.022 | 114.6 | 11.8 | 0.084 |
16+
17+
## Corrected stage-1 at 77k × 50 (full ECPS)
18+
19+
| Method | Coverage | Precision | Density | Fit (s) | Peak RSS (GB) | Zero-rate MAE |
20+
|---|---:|---:|---:|---:|---:|---:|
21+
| **ZI-QRF** | **0.928** | 0.910 | 0.885 | 37.0 | 6.0 | 0.013 |
22+
| ZI-QDNN | 0.707 | 0.835 | 0.664 | 105.5 | 11.0 | 0.136 |
23+
| ZI-MAF | 0.106 | 0.036 | 0.025 | 227.0 | 11.0 | 0.083 |
24+
25+
Total 77k wall time: 386 s.
26+
27+
## Before vs after the snap fix (coverage at 77k × 50)
28+
29+
| Method | Pre-snap (original stage-1) | Post-snap (this doc) | Uplift |
30+
|---|---:|---:|---:|
31+
| ZI-QRF | 0.256 | 0.928 | +0.672 (3.6×) |
32+
| ZI-QDNN | 0.147 | 0.707 | +0.560 (4.8×) |
33+
| ZI-MAF | 0.014 | 0.106 | +0.092 (7.6×) |
34+
35+
Neural methods get a bigger absolute uplift because their per-column models received the noise-polluted conditioning directly; QRF's tree splits are somewhat robust to small perturbations, which reduces the pre-snap damage to it.
36+
37+
## What changed in the headline story
38+
39+
### Findings that STILL hold
40+
41+
1. **Ordering preserved**: ZI-QRF > ZI-QDNN > ZI-MAF at every scale, every config.
42+
2. **ZI-MAF is still the worst** method tested. Even with the bug fix, ZI-MAF at 0.106 is 9× worse than ZI-QRF at 0.928.
43+
3. **ZI-QRF is the G1 production synthesizer** default. No change.
44+
4. **Calibration-on-synth** result holds (ZI-MAF too far off to rescue via weights).
45+
5. **Embedding-PRDC** validation holds.
46+
6. **ZI-MAF hyperparameter tuning** result holds (wider/longer doesn't rescue it).
47+
48+
### Findings that need revision
49+
50+
1. **ZI-QRF quality is much higher than the pilot suggested.** Stage-1 coverage is 0.928 at 77k, not 0.256. The G1 cross-section is in way better shape than the pre-snap numbers implied.
51+
2. **ZI-QDNN is legitimately competitive.** Pre-snap 0.147 looked mediocre; post-snap 0.707 is respectable. In production if compute budget allows, ZI-QDNN is a reasonable fallback.
52+
3. **The "ZI-MAF is broken" claim is softer than the pre-snap numbers.** At 0.106 it's still worst, but it's not "1% coverage is so bad no amount of calibration rescues it." 10.6% is bad but measurable; the calibrate-on-synth result (mean rel err 15) still says the structure is too far off to rescue via weights, but the PRDC gap is not orders-of-magnitude.
53+
54+
### How confident to be
55+
56+
Four independent robustness checks still agree (raw 50-d PRDC at 40k, raw 50-d PRDC at 77k, embedding 16-d PRDC at 40k, calibrate-on-synth at 20k). Adding the snap fix to stage-1 gives a fifth confirmation. Ordering is robust; absolute numbers finally match the fix.
57+
58+
## What this means for G1
59+
60+
The headline is now cleaner: **ZI-QRF produces 92.8% PRDC coverage on a held-out 15k-record slice of enhanced_cps_2024 at 77k × 50 scale in 37 seconds.** That's a production-credible starting point. Downstream calibration via MicrocalibrateAdapter will pull weighted aggregates to target. We have a working cross-section synthesizer.
61+
62+
The next-action playbook (launch v7 with `--calibration-backend microcalibrate`, see `docs/quickstart-rewire.md`) stays the same. This snap fix is a measurement improvement, not a direction change.
63+
64+
## Artifacts
65+
66+
- `artifacts/stage1_40k_snap.json`
67+
- `artifacts/stage1_40k_snap.jsonl`
68+
- `artifacts/stage1_77k_snap.json`
69+
- `artifacts/stage1_77k_snap.jsonl`
70+
71+
Reproduction:
72+
73+
```bash
74+
uv run python -m microplex_us.bakeoff --stage stage1 --methods ZI-QRF ZI-MAF ZI-QDNN
75+
```
76+
77+
(Uses the snap by default in the harness.)

0 commit comments

Comments
 (0)