|
| 1 | +# Synthesizer benchmark — what we know, and what scale-up will test |
| 2 | + |
| 3 | +*Draft plan for extending the existing ZI-synthesizer benchmark to production scale.* |
| 4 | + |
| 5 | +## What the existing benchmark tested |
| 6 | + |
| 7 | +Results in `microplex/benchmarks/results/benchmark_multi_seed.json` compare six synthesizers — QRF, ZI-QRF, QDNN, ZI-QDNN, MAF, ZI-MAF — on PRDC coverage across three schemas labeled `cps`, `sipp`, `psid`. |
| 8 | + |
| 9 | +| Method | CPS ASEC coverage | SIPP coverage | PSID coverage | |
| 10 | +|---|---:|---:|---:| |
| 11 | +| QRF | 0.337 | 0.938 | 0.000 | |
| 12 | +| ZI-QRF | 0.347 | **0.950** | 0.000 | |
| 13 | +| QDNN | 0.380 | 0.293 | 0.000 | |
| 14 | +| ZI-QDNN | 0.406 | 0.717 | 0.000 | |
| 15 | +| MAF | 0.398 | 0.349 | 0.000 | |
| 16 | +| ZI-MAF | **0.499** | 0.866 | 0.000 | |
| 17 | + |
| 18 | +**Data used**: synthetic population generated by `benchmarks/run_benchmarks.py::generate_realistic_microdata`, 10,000 rows, **4 target variables** (`income`, `assets`, `debt`, `savings`) conditioned on **3 predictors** (`age`, `education`, `region`). The multi-survey fusion setup partially-observes this population as different "surveys" (CPS-schema sees one subset, SIPP-schema sees another, PSID-schema sees another). |
| 19 | + |
| 20 | +**Important**: the `cps` / `sipp` / `psid` labels in the result JSON are partial-observation schemas over the same synthetic population, not real CPS / SIPP / PSID data. |
| 21 | + |
| 22 | +## Scale gap to production |
| 23 | + |
| 24 | +| Dimension | Existing benchmark | Production (microplex-us G1) | Gap | |
| 25 | +|---|---:|---:|---:| |
| 26 | +| Rows | 10,000 | 430,000 (CPS) – 3,400,000 (ACS scaffold) | 43×–340× | |
| 27 | +| Columns | 7 (3 cond + 4 target) | 150+ joint variables | ~22× | |
| 28 | +| Source realism | Synthetic generator with analytical zero-inflation | Real CPS + PUF + SIPP + SCF joints with real tail structure | Categorical jump | |
| 29 | +| Held-out set | 20% of synthetic population | TBD — ECPS baseline, external targets (SOI, BEA, Census) | — | |
| 30 | + |
| 31 | +Combined row × column gap: **~1,000×–8,000×**. Plus the synthetic-to-real jump, which is not measurable as a multiplier because real data has structure the generator cannot produce. |
| 32 | + |
| 33 | +## What we expect to break at scale |
| 34 | + |
| 35 | +### Coverage metric itself |
| 36 | + |
| 37 | +**PRDC k-NN coverage concentrates in high dimensions.** With 150+ features, nearest-neighbor distances bunch up (curse of dimensionality) and a small distance threshold starts excluding almost everything while a larger one starts including almost everything. Raw-feature PRDC above ~50 columns is typically noise-dominated without dimensionality reduction or a learned embedding. |
| 38 | + |
| 39 | +**Mitigation**: compute PRDC in a learned embedding (autoencoder or the synthesizer's latent space) rather than raw features. Or compute per-block PRDC on demographically-stratified cells. Or switch to a metric that scales better with dimension (MMD with an RBF kernel, or mode-wise Wasserstein). |
| 40 | + |
| 41 | +### ZI-QRF training |
| 42 | + |
| 43 | +**Quantile random forests scale poorly in both rows and columns.** |
| 44 | + |
| 45 | +- Row scaling: train time is roughly O(N log N) per tree; memory is O(N × features × n_trees). On 1.5M rows × 150 cols × 100 trees, that's ~180 GB for naive storage without sparse leaves. Even with efficient implementations (`quantile-forest`, `lightgbm`-style histogram trees), training time is hours-to-days on CPU for a full run. |
| 46 | +- Column scaling: splits over 150+ features explore a larger hyperparameter space; conditional coverage on rare variables gets noisier; `max_features` tuning becomes load-bearing. |
| 47 | + |
| 48 | +**Prediction**: ZI-QRF's dominance on small-SIPP is partly because 500-person panels fit neatly into tree leaves. At 1.5M rows, expect the advantage to narrow or invert — partly because QRF hits practical compute limits and has to subsample. |
| 49 | + |
| 50 | +### ZI-MAF training |
| 51 | + |
| 52 | +**Normalizing flows need careful hyperparameter tuning on real data.** |
| 53 | + |
| 54 | +- Mode-collapse risk: ZI-MAF's joint distribution over 150 variables can collapse onto a lower-dimensional manifold, especially when many variables are zero-inflated with correlated zero patterns (same person has zero across many income sources at once). |
| 55 | +- Training time: MAF is GPU-accelerated and scales linearly in rows. 1.5M rows × 150 cols × 200 epochs is feasible on a single H100, ~several hours. On Apple Silicon (Max's 48 GB M3), ~8–16 hours with MPS backend. |
| 56 | +- Conditioning: the existing benchmark uses 3 condition variables. Real microdata conditions on ~10–20 demographics. Adding conditioning dimensions is the easier part of scaling MAF. |
| 57 | + |
| 58 | +**Prediction**: ZI-MAF's lead on CPS should hold or grow at scale (flows scale well with rows). Main risk is tail coverage — top-1% income, extreme wealth — which is exactly where the SS-model application cares most. |
| 59 | + |
| 60 | +### ZI-QDNN training |
| 61 | + |
| 62 | +**Deep quantile networks scale well but need careful tuning at width + depth.** |
| 63 | + |
| 64 | +- Row scaling: straightforward, O(N) per epoch, linear in batch size. |
| 65 | +- Column scaling: the pinball loss surface gets jagged with many zero-inflated targets; per-target head design matters more at 150 vars than at 4. |
| 66 | +- Zero-inflation head: a single logistic head for `P(zero)` becomes underpowered at 150 zero-capable variables with complex joint zero patterns (observing income=0 informs dividends=0 informs wages=0). Joint zero-mask modeling is probably needed. |
| 67 | + |
| 68 | +**Prediction**: ZI-QDNN as currently implemented will degrade fastest under scale-up without a joint zero-mask head. Worth testing whether a graph-structured zero-mask extension rescues it. |
| 69 | + |
| 70 | +### PRDC coverage = 0 on PSID across all methods |
| 71 | + |
| 72 | +This is unresolved in the existing benchmark and is the single most important thing to diagnose before the SS-model longitudinal extension commits to PSID. Three hypotheses: |
| 73 | + |
| 74 | +1. **Test-setup degeneracy.** PSID-schema's observed-variable mask may overlap with the CPS / SIPP masks in a way that produces an empty held-out set. Check the mask logic. |
| 75 | +2. **Panel structure breaks per-record PRDC.** PSID is a panel; a "record" could mean a person-year or a person. If the test set uses person-year and the synthesizer generates persons, coverage is trivially 0. Fix: switch to a panel-aware metric (per-person trajectory coverage) or generate person-years. |
| 76 | +3. **Real limitation.** Attrition + sparse-year coverage in PSID creates tail records the synthesizers cannot cover. If this is the case, the SS-model trajectory training must either accept this ceiling, use a different panel source (SIPP panel, HRS, NLSY), or augment PSID with synthetic history. |
| 77 | + |
| 78 | +**Action**: diagnose before any PSID-dependent architecture work commits. |
| 79 | + |
| 80 | +## Proposed scale-up experiment protocol |
| 81 | + |
| 82 | +Run three stages, each keeping row count and column count explicit. All stages report three classes of metric: accuracy (coverage), cost (time + memory), and health (convergence + rare-cell preservation). |
| 83 | + |
| 84 | +### Stage 1 — medium rows, medium columns |
| 85 | + |
| 86 | +Scale: **100,000 rows × 50 columns** |
| 87 | + |
| 88 | +Data: subsample enhanced_cps_2024 to 100k persons, select 50 PE-native-relevant columns (income components, demographics, tax inputs, benefit receipts). Use a real subsample, not synthetic. |
| 89 | + |
| 90 | +Purpose: exercise real joint structure (tails, categorical constraints, zero correlations) without the full row cost. Should fit comfortably in 48 GB RAM on CPU, in hours. |
| 91 | + |
| 92 | +Metrics per method: |
| 93 | +- PRDC coverage on 20% holdout (computed in raw features and in a 16-dim PCA embedding) |
| 94 | +- Per-stratum coverage (age × income-bracket × filing-status cells) — specifically flag any cell with <10 records that drops to 0 coverage |
| 95 | +- Rare-subpopulation preservation (elderly self-employed, young dividend, SSDI, top-1% earnings — the `sparse_coverage.csv` pattern) |
| 96 | +- Training wall time |
| 97 | +- Peak RSS during training |
| 98 | +- Generation wall time for 100k samples |
| 99 | +- Zero-rate MAE per variable |
| 100 | + |
| 101 | +### Stage 2 — large rows, medium columns |
| 102 | + |
| 103 | +Scale: **1,000,000 rows × 50 columns** |
| 104 | + |
| 105 | +Data: 10× oversample of stage 1's column set with enhanced_cps_2024 clone-and-assign style replication (as PE-US-data does for local area) to reach 1M rows. |
| 106 | + |
| 107 | +Purpose: expose row-scaling failures before column scaling. ZI-QRF is the most likely to fall off here. ZI-MAF should be OK. ZI-QDNN should scale cleanly. |
| 108 | + |
| 109 | +Same metrics as stage 1. |
| 110 | + |
| 111 | +### Stage 3 — full rows, full columns |
| 112 | + |
| 113 | +Scale: **3,373,378 rows × 155 columns** (exactly the v6 seed-ready shape, so we can compare the post-donor frame at production scale). |
| 114 | + |
| 115 | +Data: the actual v6 seed frame if we can retrieve it from the log (it was never persisted); otherwise regenerate by running donor integration only. Since we don't have the v6 artifact, this stage requires regenerating the seed — ~9 hours of donor integration. |
| 116 | + |
| 117 | +Purpose: verify which synthesizer survives production scale, in what time, at what memory cost. |
| 118 | + |
| 119 | +Same metrics, plus: |
| 120 | +- Time to first valid sample (can we get ANY synthetic records out?) |
| 121 | +- Sample quality trajectory over training time (does it stabilize, or degrade with more training?) |
| 122 | +- Memory peak vs memory average (does it OOM on a 48 GB machine?) |
| 123 | + |
| 124 | +## Runtime expectations (rough a priori) |
| 125 | + |
| 126 | +Order-of-magnitude estimates for training one model to convergence on a 48 GB M3: |
| 127 | + |
| 128 | +| Method | Stage 1 (100k × 50) | Stage 2 (1M × 50) | Stage 3 (3.4M × 155) | |
| 129 | +|---|---|---|---| |
| 130 | +| ZI-QRF | minutes | hours, may OOM | days or infeasible; needs subsample | |
| 131 | +| ZI-MAF | 30 min (CPU) / 5 min (MPS) | few hours (MPS) | 8–16 hours (MPS), needs batch tuning | |
| 132 | +| ZI-QDNN | 15 min (CPU) / 3 min (MPS) | 1–2 hours (MPS) | 4–8 hours (MPS), lowest memory footprint | |
| 133 | + |
| 134 | +These are coarse and based on library benchmarks + extrapolation. The scale-up experiment's actual measurements are what we commit to. |
| 135 | + |
| 136 | +## Evaluation contract — matched-size comparison |
| 137 | + |
| 138 | +To avoid the "we ran ZI-MAF at 1M and ZI-QRF at 100k and declared a winner" trap, all three stages enforce: |
| 139 | + |
| 140 | +- **Same held-out split** across methods per stage (same 20% records). |
| 141 | +- **Same feature set** across methods per stage. |
| 142 | +- **Same wall-time budget** for training. (If ZI-QRF hits the budget without converging, that counts as its stage-3 result — "did not finish.") |
| 143 | + |
| 144 | +Report all three as a single table with method × stage × metric cells. Pick production defaults from this table alone, not from the existing 10k-row benchmark. |
| 145 | + |
| 146 | +## What this experiment would actually update |
| 147 | + |
| 148 | +1. **Production synthesizer default for G1.** Currently implied as ZI-MAF from the small benchmark. Scale-up may confirm or overturn. |
| 149 | +2. **SS-model methodology doc's ZI-QDNN production claim.** If ZI-QDNN does not emerge as a clear winner at scale, the doc needs a pointer to this evaluation. |
| 150 | +3. **PSID coverage ceiling.** If PSID coverage-0 is a real limitation, the longitudinal-training plan needs a fallback panel source. |
| 151 | +4. **Compute budget for production runs.** Knowing that ZI-MAF needs 12 hours MPS at production scale changes how often we can iterate on synthesizer hyperparameters. |
| 152 | + |
| 153 | +## Out of scope (for now) |
| 154 | + |
| 155 | +- Training on real-panel data at scale. The stage-3 experiment uses the cross-section; panel synthesis is a separate scale-up that depends on PSID-coverage diagnosis first. |
| 156 | +- Comparing against external non-microplex synthesizers (CTGAN, TVAE, TabDDPM, TabPFN) at full scale. Do after internal best is clear. |
| 157 | +- Runtime on GPU clusters. Local laptop numbers first; remote GPU only if production bottleneck demands it. |
| 158 | + |
| 159 | +## Risks to the experiment itself |
| 160 | + |
| 161 | +1. **Retrieving the v6 seed frame requires rerunning donor integration** (~9h) because v6 never persisted. A cheaper alternative: use the enhanced_cps_2024 HDF5 at its native scale (~400k persons × ~250 columns — already close to stage-3 scale) and adapt the donor conditioning. |
| 162 | +2. **PRDC in 150D is likely noise.** Budget time for the embedding-based variant before committing to any absolute coverage number. |
| 163 | +3. **ZI-QRF may be infeasible at stage 3.** That is itself a finding; have a fallback "QRF on top-20-important-columns" variant ready to report as a scale-constrained baseline. |
| 164 | +4. **The existing synthesizers may not even run at stage 3** without code changes (memory bugs at scale). Budget for 1–2 days of debugging on first attempt. |
| 165 | + |
| 166 | +## Minimum useful subset |
| 167 | + |
| 168 | +If full three-stage execution is too costly as a first pass, the minimum that informs the rearchitecture direction is **stage 1 alone**: 100k real-subsample rows × 50 real-feature columns, running all three ZI variants, reporting coverage + runtime + rare-cell preservation. |
| 169 | + |
| 170 | +That alone would invalidate or confirm the small-benchmark conclusions and give us enough signal to pick a G1 default. |
0 commit comments