Amend calibrator-decision with sparse_coverage evidence; add scale-up plan

MaxGhenis · claude · MaxGhenis · commit 7186926c2c7e · 2026-04-16T23:31:40.000-04:00
calibrator-decision.md:
  - Cites microplex/benchmarks/results/sparse_coverage.csv as empirical
    support: sparse L0 drives rare-subpopulation ratios to 0.0 at 10%,
    2%, and 1% sparsity (elderly_selfemp, young_dividend both zero),
    while generative synthesis preserves them at 7-30x oracle ratio.
  - Adds an explicit scale caveat: sparse_coverage evidence is from
    10k-row synthetic data; the structural pattern (L0 zeros records
    exactly) survives scale-up on mathematical grounds even if
    absolute numbers shift.

synthesizer-benchmark-scale-up.md (new):
  - Records what the existing benchmark_multi_seed.json measures:
    10k rows x 7 columns of SYNTHETIC data. The cps/sipp/psid labels
    are partial-observation schemas over one synthetic population, not
    real sources.
  - Production gap: 3,000-7,000x on (rows x columns) plus the
    synthetic-to-real jump.
  - Predicted failure modes per method at scale (QRF compute-bound
    above 1M rows, MAF tail-coverage risk on top income, QDNN needs
    joint zero-mask head at 150 zero-capable vars, PRDC metric
    degenerates in 150D without embedding).
  - Three-stage scale-up protocol (100k x 50, 1M x 50, 3.4M x 155)
    with matched holdouts, rare-cell preservation tracking, and
    wall-time / RSS measurements per method.
  - Ballpark runtime expectations per method per stage on a 48 GB M3.
  - Diagnoses PSID coverage = 0 as unresolved and must-fix before
    any SS-model longitudinal work commits to PSID as the backbone.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/calibrator-decision.md b/docs/calibrator-decision.md
@@ -99,6 +99,47 @@ Yes, operationally:
 - Tests and small-scale diagnostics: `Calibrator`.
 - No single-pipeline run crosses all three. Each tool has a distinct and non-overlapping job.
 
+## Empirical support: sparse selection annihilates rare subpopulations
+
+The single cleanest empirical argument for this split comes from
+`microplex/benchmarks/results/sparse_coverage.csv`. Measuring rare-subpopulation
+preservation at varying sparsity levels (lower `coverage_median` = closer to
+oracle):
+
+| Method | `coverage_median` | elderly_selfemp_ratio | young_dividend_ratio |
+|---|---:|---:|---:|
+| Oracle (full) | 0.009 | 0.94 | 1.11 |
+| Generative (10%) | 0.53 | 27.7 | 20.6 |
+| Generative (2%) | 0.42 | 22.1 | 32.3 |
+| Generative (1%) | 0.25 | 7.2 | 1.7 |
+| Weighted (10%) | 0.24 | **0.00** | **0.00** |
+| Weighted (2%) | 0.35 | 0.02 | **0.00** |
+| Weighted (1%) | 0.65 | **0.00** | **0.00** |
+
+Sparse L0 weighting drops rare subpopulations to **zero representation** at
+every sparsity level tested. Generative synthesis preserves them at 7–30× the
+oracle ratio. For policy analysis, where rare subpopulations (elderly
+self-employed, young dividend earners, disability recipients, top-1% earners)
+drive outsized fiscal and distributional effects, sparse-as-mainline is
+non-viable on accuracy grounds alone.
+
+This empirical pattern reinforces the decision above: L0/sparse selection is a
+**post-calibration deployment tool**, not a calibration method. Apply it after
+the mainline `microcalibrate` run has produced a fully-covered adjusted-weight
+artifact, and only when a downstream consumer needs a small subsample.
+
+### Scale caveat
+
+`sparse_coverage.csv` was produced on **10,000-row synthetic data with ~7
+variables**. Production scale is 1.5M rows × 150+ variables on real joint
+microdata. We should not assume the 20–30× generative-vs-weighted gap holds at
+that scale — the absolute numbers will shift, and rare-subpopulation
+preservation may tighten for both methods. What is expected to hold is the
+structural pattern: sparse L0 exactly zeros out records, generative synthesis
+does not. The argument against sparse-as-mainline survives any plausible
+scale-up because the failure mode (zero representation of rare cells) is not a
+noise issue, it is mathematically baked into L0 selection.
+
 ## What this unblocks
 
 - Migration step 2 of `docs/core-wiring-audit.md`: "Adopt `Calibrator` end-to-end" is revised to "Adopt `microcalibrate` end-to-end as the production calibrator." That becomes the first real code change in `spec-based-ecps-rewire`.
diff --git a/docs/synthesizer-benchmark-scale-up.md b/docs/synthesizer-benchmark-scale-up.md
@@ -0,0 +1,170 @@
+# Synthesizer benchmark — what we know, and what scale-up will test
+
+*Draft plan for extending the existing ZI-synthesizer benchmark to production scale.*
+
+## What the existing benchmark tested
+
+Results in `microplex/benchmarks/results/benchmark_multi_seed.json` compare six synthesizers — QRF, ZI-QRF, QDNN, ZI-QDNN, MAF, ZI-MAF — on PRDC coverage across three schemas labeled `cps`, `sipp`, `psid`.
+
+| Method | CPS ASEC coverage | SIPP coverage | PSID coverage |
+|---|---:|---:|---:|
+| QRF | 0.337 | 0.938 | 0.000 |
+| ZI-QRF | 0.347 | **0.950** | 0.000 |
+| QDNN | 0.380 | 0.293 | 0.000 |
+| ZI-QDNN | 0.406 | 0.717 | 0.000 |
+| MAF | 0.398 | 0.349 | 0.000 |
+| ZI-MAF | **0.499** | 0.866 | 0.000 |
+
+**Data used**: synthetic population generated by `benchmarks/run_benchmarks.py::generate_realistic_microdata`, 10,000 rows, **4 target variables** (`income`, `assets`, `debt`, `savings`) conditioned on **3 predictors** (`age`, `education`, `region`). The multi-survey fusion setup partially-observes this population as different "surveys" (CPS-schema sees one subset, SIPP-schema sees another, PSID-schema sees another).
+
+**Important**: the `cps` / `sipp` / `psid` labels in the result JSON are partial-observation schemas over the same synthetic population, not real CPS / SIPP / PSID data.
+
+## Scale gap to production
+
+| Dimension | Existing benchmark | Production (microplex-us G1) | Gap |
+|---|---:|---:|---:|
+| Rows | 10,000 | 430,000 (CPS) – 3,400,000 (ACS scaffold) | 43×–340× |
+| Columns | 7 (3 cond + 4 target) | 150+ joint variables | ~22× |
+| Source realism | Synthetic generator with analytical zero-inflation | Real CPS + PUF + SIPP + SCF joints with real tail structure | Categorical jump |
+| Held-out set | 20% of synthetic population | TBD — ECPS baseline, external targets (SOI, BEA, Census) | — |
+
+Combined row × column gap: **~1,000×–8,000×**. Plus the synthetic-to-real jump, which is not measurable as a multiplier because real data has structure the generator cannot produce.
+
+## What we expect to break at scale
+
+### Coverage metric itself
+
+**PRDC k-NN coverage concentrates in high dimensions.** With 150+ features, nearest-neighbor distances bunch up (curse of dimensionality) and a small distance threshold starts excluding almost everything while a larger one starts including almost everything. Raw-feature PRDC above ~50 columns is typically noise-dominated without dimensionality reduction or a learned embedding.
+
+**Mitigation**: compute PRDC in a learned embedding (autoencoder or the synthesizer's latent space) rather than raw features. Or compute per-block PRDC on demographically-stratified cells. Or switch to a metric that scales better with dimension (MMD with an RBF kernel, or mode-wise Wasserstein).
+
+### ZI-QRF training
+
+**Quantile random forests scale poorly in both rows and columns.**
+
+- Row scaling: train time is roughly O(N log N) per tree; memory is O(N × features × n_trees). On 1.5M rows × 150 cols × 100 trees, that's ~180 GB for naive storage without sparse leaves. Even with efficient implementations (`quantile-forest`, `lightgbm`-style histogram trees), training time is hours-to-days on CPU for a full run.
+- Column scaling: splits over 150+ features explore a larger hyperparameter space; conditional coverage on rare variables gets noisier; `max_features` tuning becomes load-bearing.
+
+**Prediction**: ZI-QRF's dominance on small-SIPP is partly because 500-person panels fit neatly into tree leaves. At 1.5M rows, expect the advantage to narrow or invert — partly because QRF hits practical compute limits and has to subsample.
+
+### ZI-MAF training
+
+**Normalizing flows need careful hyperparameter tuning on real data.**
+
+- Mode-collapse risk: ZI-MAF's joint distribution over 150 variables can collapse onto a lower-dimensional manifold, especially when many variables are zero-inflated with correlated zero patterns (same person has zero across many income sources at once).
+- Training time: MAF is GPU-accelerated and scales linearly in rows. 1.5M rows × 150 cols × 200 epochs is feasible on a single H100, ~several hours. On Apple Silicon (Max's 48 GB M3), ~8–16 hours with MPS backend.
+- Conditioning: the existing benchmark uses 3 condition variables. Real microdata conditions on ~10–20 demographics. Adding conditioning dimensions is the easier part of scaling MAF.
+
+**Prediction**: ZI-MAF's lead on CPS should hold or grow at scale (flows scale well with rows). Main risk is tail coverage — top-1% income, extreme wealth — which is exactly where the SS-model application cares most.
+
+### ZI-QDNN training
+
+**Deep quantile networks scale well but need careful tuning at width + depth.**
+
+- Row scaling: straightforward, O(N) per epoch, linear in batch size.
+- Column scaling: the pinball loss surface gets jagged with many zero-inflated targets; per-target head design matters more at 150 vars than at 4.
+- Zero-inflation head: a single logistic head for `P(zero)` becomes underpowered at 150 zero-capable variables with complex joint zero patterns (observing income=0 informs dividends=0 informs wages=0). Joint zero-mask modeling is probably needed.
+
+**Prediction**: ZI-QDNN as currently implemented will degrade fastest under scale-up without a joint zero-mask head. Worth testing whether a graph-structured zero-mask extension rescues it.
+
+### PRDC coverage = 0 on PSID across all methods
+
+This is unresolved in the existing benchmark and is the single most important thing to diagnose before the SS-model longitudinal extension commits to PSID. Three hypotheses:
+
+1. **Test-setup degeneracy.** PSID-schema's observed-variable mask may overlap with the CPS / SIPP masks in a way that produces an empty held-out set. Check the mask logic.
+2. **Panel structure breaks per-record PRDC.** PSID is a panel; a "record" could mean a person-year or a person. If the test set uses person-year and the synthesizer generates persons, coverage is trivially 0. Fix: switch to a panel-aware metric (per-person trajectory coverage) or generate person-years.
+3. **Real limitation.** Attrition + sparse-year coverage in PSID creates tail records the synthesizers cannot cover. If this is the case, the SS-model trajectory training must either accept this ceiling, use a different panel source (SIPP panel, HRS, NLSY), or augment PSID with synthetic history.
+
+**Action**: diagnose before any PSID-dependent architecture work commits.
+
+## Proposed scale-up experiment protocol
+
+Run three stages, each keeping row count and column count explicit. All stages report three classes of metric: accuracy (coverage), cost (time + memory), and health (convergence + rare-cell preservation).
+
+### Stage 1 — medium rows, medium columns
+
+Scale: **100,000 rows × 50 columns**
+
+Data: subsample enhanced_cps_2024 to 100k persons, select 50 PE-native-relevant columns (income components, demographics, tax inputs, benefit receipts). Use a real subsample, not synthetic.
+
+Purpose: exercise real joint structure (tails, categorical constraints, zero correlations) without the full row cost. Should fit comfortably in 48 GB RAM on CPU, in hours.
+
+Metrics per method:
+- PRDC coverage on 20% holdout (computed in raw features and in a 16-dim PCA embedding)
+- Per-stratum coverage (age × income-bracket × filing-status cells) — specifically flag any cell with <10 records that drops to 0 coverage
+- Rare-subpopulation preservation (elderly self-employed, young dividend, SSDI, top-1% earnings — the `sparse_coverage.csv` pattern)
+- Training wall time
+- Peak RSS during training
+- Generation wall time for 100k samples
+- Zero-rate MAE per variable
+
+### Stage 2 — large rows, medium columns
+
+Scale: **1,000,000 rows × 50 columns**
+
+Data: 10× oversample of stage 1's column set with enhanced_cps_2024 clone-and-assign style replication (as PE-US-data does for local area) to reach 1M rows.
+
+Purpose: expose row-scaling failures before column scaling. ZI-QRF is the most likely to fall off here. ZI-MAF should be OK. ZI-QDNN should scale cleanly.
+
+Same metrics as stage 1.
+
+### Stage 3 — full rows, full columns
+
+Scale: **3,373,378 rows × 155 columns** (exactly the v6 seed-ready shape, so we can compare the post-donor frame at production scale).
+
+Data: the actual v6 seed frame if we can retrieve it from the log (it was never persisted); otherwise regenerate by running donor integration only. Since we don't have the v6 artifact, this stage requires regenerating the seed — ~9 hours of donor integration.
+
+Purpose: verify which synthesizer survives production scale, in what time, at what memory cost.
+
+Same metrics, plus:
+- Time to first valid sample (can we get ANY synthetic records out?)
+- Sample quality trajectory over training time (does it stabilize, or degrade with more training?)
+- Memory peak vs memory average (does it OOM on a 48 GB machine?)
+
+## Runtime expectations (rough a priori)
+
+Order-of-magnitude estimates for training one model to convergence on a 48 GB M3:
+
+| Method | Stage 1 (100k × 50) | Stage 2 (1M × 50) | Stage 3 (3.4M × 155) |
+|---|---|---|---|
+| ZI-QRF | minutes | hours, may OOM | days or infeasible; needs subsample |
+| ZI-MAF | 30 min (CPU) / 5 min (MPS) | few hours (MPS) | 8–16 hours (MPS), needs batch tuning |
+| ZI-QDNN | 15 min (CPU) / 3 min (MPS) | 1–2 hours (MPS) | 4–8 hours (MPS), lowest memory footprint |
+
+These are coarse and based on library benchmarks + extrapolation. The scale-up experiment's actual measurements are what we commit to.
+
+## Evaluation contract — matched-size comparison
+
+To avoid the "we ran ZI-MAF at 1M and ZI-QRF at 100k and declared a winner" trap, all three stages enforce:
+
+- **Same held-out split** across methods per stage (same 20% records).
+- **Same feature set** across methods per stage.
+- **Same wall-time budget** for training. (If ZI-QRF hits the budget without converging, that counts as its stage-3 result — "did not finish.")
+
+Report all three as a single table with method × stage × metric cells. Pick production defaults from this table alone, not from the existing 10k-row benchmark.
+
+## What this experiment would actually update
+
+1. **Production synthesizer default for G1.** Currently implied as ZI-MAF from the small benchmark. Scale-up may confirm or overturn.
+2. **SS-model methodology doc's ZI-QDNN production claim.** If ZI-QDNN does not emerge as a clear winner at scale, the doc needs a pointer to this evaluation.
+3. **PSID coverage ceiling.** If PSID coverage-0 is a real limitation, the longitudinal-training plan needs a fallback panel source.
+4. **Compute budget for production runs.** Knowing that ZI-MAF needs 12 hours MPS at production scale changes how often we can iterate on synthesizer hyperparameters.
+
+## Out of scope (for now)
+
+- Training on real-panel data at scale. The stage-3 experiment uses the cross-section; panel synthesis is a separate scale-up that depends on PSID-coverage diagnosis first.
+- Comparing against external non-microplex synthesizers (CTGAN, TVAE, TabDDPM, TabPFN) at full scale. Do after internal best is clear.
+- Runtime on GPU clusters. Local laptop numbers first; remote GPU only if production bottleneck demands it.
+
+## Risks to the experiment itself
+
+1. **Retrieving the v6 seed frame requires rerunning donor integration** (~9h) because v6 never persisted. A cheaper alternative: use the enhanced_cps_2024 HDF5 at its native scale (~400k persons × ~250 columns — already close to stage-3 scale) and adapt the donor conditioning.
+2. **PRDC in 150D is likely noise.** Budget time for the embedding-based variant before committing to any absolute coverage number.
+3. **ZI-QRF may be infeasible at stage 3.** That is itself a finding; have a fallback "QRF on top-20-important-columns" variant ready to report as a scale-constrained baseline.
+4. **The existing synthesizers may not even run at stage 3** without code changes (memory bugs at scale). Budget for 1–2 days of debugging on first attempt.
+
+## Minimum useful subset
+
+If full three-stage execution is too costly as a first pass, the minimum that informs the rearchitecture direction is **stage 1 alone**: 100k real-subsample rows × 50 real-feature columns, running all three ZI variants, reporting coverage + runtime + rare-cell preservation.
+
+That alone would invalidate or confirm the small-benchmark conclusions and give us enough signal to pick a G1 default.