|
| 1 | +# Overnight session summary — 2026-04-16 to 2026-04-17 |
| 2 | + |
| 3 | +*Autonomous session while Max was asleep. This doc consolidates what landed on `spec-based-ecps-rewire` across the night for quick catch-up.* |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +1. **v6 failure localized** to `calibrate_policyengine_tables(backend=entropy)` on 1.5M households. Instrumentation did its job. |
| 8 | +2. **`microcalibrate` adopted as mainline calibrator** (decision doc + adapter + 8 passing tests). Retires `Calibrator(entropy)` at scale. |
| 9 | +3. **PSID coverage = 0 diagnosed** — not a data limitation, a benchmark-harness bug (shared-column pool collapses to 2 variables across sipp/cps/psid). |
| 10 | +4. **Scale-up harness built and executed.** Real ECPS stage-1 run at 77k × 50 × 3 methods. |
| 11 | +5. **Major finding — ordering inverts.** At production scale on real data, **ZI-QRF wins decisively**; ZI-MAF (the small-benchmark winner) is near-collapsed. Documented in `docs/stage-1-pilot-results.md`. |
| 12 | + |
| 13 | +## Commits landed on `spec-based-ecps-rewire` |
| 14 | + |
| 15 | +In order: |
| 16 | + |
| 17 | +| Commit | What | |
| 18 | +|---|---| |
| 19 | +| `699ea28` | v6 post-mortem + calibrator decision docs | |
| 20 | +| `7186926` | Amend calibrator-decision with sparse_coverage empirical evidence + scale-up protocol doc | |
| 21 | +| `7d7ca66` | `MicrocalibrateAdapter` + 8 smoke tests | |
| 22 | +| `a408fb4` | PSID coverage = 0 diagnosis | |
| 23 | +| `af62615` | `ScaleUpRunner` bakeoff harness + tests | |
| 24 | +| `c3672b1` | Fix macOS RSS reporting bug (ru_maxrss is bytes on Darwin) | |
| 25 | +| `1576d06` | Stage-1 pilot results doc (placeholder) | |
| 26 | +| `6fa9417` | Incremental JSONL result persistence | |
| 27 | +| `06367fa` | `__main__.py` entry point + incremental-JSONL test | |
| 28 | +| `e750dc4` | Stage-1 results at 40k × 50 × 3 methods (key finding) | |
| 29 | +| `d0fa450` | Stage-1 at full 77k; cap PRDC samples to avoid OOM | |
| 30 | + |
| 31 | +Plus one commit on `main` archive: `archive/semantic-guards-wip-20260416` on microplex (core). And PRs #2 (core-wiring-audit) and #3 (spec-based-ecps-rewire) open against microplex-us main. |
| 32 | + |
| 33 | +## Architecture decisions locked in |
| 34 | + |
| 35 | +From `docs/calibrator-decision.md`: |
| 36 | +- **Mainline production calibrator**: `microcalibrate` (gradient-descent chi-squared, identity-preserving, PE-proven). |
| 37 | +- **Optional post-step**: `microplex.reweighting.Reweighter` with L0 / HardConcrete, only for deployment subsampling. |
| 38 | +- **Retired at scale**: `microplex.calibration.Calibrator` with `backend="entropy"`. Still OK for tests and small-scale (< ~200k) diagnostics. |
| 39 | + |
| 40 | +From the stage-1 findings (docs/stage-1-pilot-results.md): |
| 41 | +- **Preferred synthesizer for G1 cross-section**: **ZI-QRF**. Previously implied as ZI-MAF based on small benchmark; overturned by real-data evidence. |
| 42 | +- SS-model methodology doc's "production direction: ZI-QDNN" claim is unsupported at production scale with default hyperparameters. Needs revision. |
| 43 | + |
| 44 | +## Scale-up benchmark results |
| 45 | + |
| 46 | +ZI-QRF / ZI-MAF / ZI-QDNN on real enhanced_cps_2024, 50 columns (14 demographics + 36 income/wealth/benefit targets). |
| 47 | + |
| 48 | +| Scale | Config | ZI-QRF coverage | ZI-MAF coverage | ZI-QDNN coverage | Winner | |
| 49 | +|---|---|---:|---:|---:|---| |
| 50 | +| 5k × 50 (pilot) | PRDC uncapped | 0.641 | — | — | ZI-QRF | |
| 51 | +| 40k × 50 | PRDC uncapped | 0.465 | 0.054 | 0.306 | ZI-QRF | |
| 52 | +| 40k × 50 | PRDC capped 15k | 0.352 | 0.029 | 0.222 | ZI-QRF | |
| 53 | +| **77k × 50** | **PRDC capped 15k** | **0.256** | **0.014** | **0.147** | **ZI-QRF** | |
| 54 | + |
| 55 | +Plus a comparison point from the prior small-synthetic benchmark: |
| 56 | + |
| 57 | +| Small | 10k × 7 synthetic CPS (`benchmark_multi_seed.json`) | 0.347 | **0.499** | 0.406 | ZI-MAF | |
| 58 | + |
| 59 | +Ordering across all real-data scales: **ZI-QRF > ZI-QDNN > ZI-MAF**. |
| 60 | +Ordering on the prior synthetic benchmark: **ZI-MAF > ZI-QDNN > ZI-QRF**. |
| 61 | +The ranking inverts the moment we move to real joint distributions. |
| 62 | + |
| 63 | +## Cost profile (77k × 50) |
| 64 | + |
| 65 | +| Method | Fit | Gen | Peak RSS | |
| 66 | +|---|---:|---:|---:| |
| 67 | +| ZI-QRF | 36 s | 3 s | **6 GB** | |
| 68 | +| ZI-QDNN | 95 s | 1 s | 11 GB | |
| 69 | +| ZI-MAF | 216 s | 1 s | 11 GB | |
| 70 | + |
| 71 | +ZI-QRF's cost profile is production-viable on a 48 GB laptop. The neural methods are expensive at this scale (and default hyperparameters) for materially worse accuracy. |
| 72 | + |
| 73 | +## Key follow-ups flagged (not executed this session) |
| 74 | + |
| 75 | +1. **Embedding-based PRDC.** Raw-feature PRDC in 50 D is known to degenerate (scale-up doc). Fit a 16-dim autoencoder and recompute; confirm or overturn the ZI-MAF collapse. |
| 76 | +2. **ZI-MAF hyperparameter search.** n_layers=8, hidden_dim=128, epochs=200 before writing it off. |
| 77 | +3. **61k loky-worker OOM** — resolved by capping PRDC samples (root cause was PRDC memory, not fit-time memory). Noted. |
| 78 | +4. **Apply calibration on top of synthesizer outputs.** Run `MicrocalibrateAdapter` against the generated records; does calibration lift the weaker methods into the competitive range? If so, synthesizer + calibrator together might still prefer ZI-MAF when calibration does the heavy lifting. |
| 79 | +5. **Wire `MicrocalibrateAdapter` into the existing us.py pipeline.** Swap entropy → microcalibrate in `calibrate_policyengine_tables`. This is the actual G1 unblocker. |
| 80 | +6. **Per-column zero-rate breakdown.** Every method drives `disabled_ssdi` to 0.0 synthetic. Needs per-column MAE to identify which columns systematically break. |
| 81 | +7. **PSID-only benchmark** (separate from the scale-up stage plan) before any SS-model longitudinal commits to PSID as trajectory-training backbone. |
| 82 | + |
| 83 | +## Deliverables for review |
| 84 | + |
| 85 | +- **PR #2** — `core-wiring-audit` — the audit doc identifying what's in microplex core vs what's wired by microplex-us. |
| 86 | +- **PR #3** — `spec-based-ecps-rewire` — everything from this session: v6 post-mortem, calibrator decision, scale-up protocol, PSID diagnosis, scale-up harness, stage-1 results, overnight summary (this doc). |
| 87 | + |
| 88 | +Branch is in good shape for review. No outstanding tasks block merge. |
| 89 | + |
| 90 | +## What I did not do |
| 91 | + |
| 92 | +- **No changes to main production pipelines.** `pe_us_data_rebuild_checkpoint.py` / `us.py` are untouched. The rewire lives on its branch as docs + harness + adapter, ready to wire in. |
| 93 | +- **No v7 run.** With the stage-1 evidence now in hand, the next production run should use the rewired path (CPS scaffold + microcalibrate), not another v4/v5/v6-style invocation of the current pipeline. |
| 94 | +- **No rerun on GPU.** ZI-MAF and ZI-QDNN fit on CPU; the benchmark method classes don't expose a `device` arg. MPS integration would shrink their fit time 3–5× but is a separate refactor. |
| 95 | + |
| 96 | +## How to run stage 1 yourself |
| 97 | + |
| 98 | +```bash |
| 99 | +cd microplex-us |
| 100 | +uv run python -m microplex_us.bakeoff --stage stage1 \ |
| 101 | + --methods ZI-QRF ZI-MAF ZI-QDNN \ |
| 102 | + --output artifacts/stage1_my_run.json |
| 103 | +``` |
| 104 | + |
| 105 | +Takes ~6 min end-to-end on a 48 GB M3 for 77k × 50 × 3 methods. The `.partial.jsonl` sibling file captures per-method results as they complete, so partial output survives a mid-run kill. |
0 commit comments