Skip to content

Commit 7186926

Browse files
MaxGhenisclaude
andcommitted
Amend calibrator-decision with sparse_coverage evidence; add scale-up plan
calibrator-decision.md: - Cites microplex/benchmarks/results/sparse_coverage.csv as empirical support: sparse L0 drives rare-subpopulation ratios to 0.0 at 10%, 2%, and 1% sparsity (elderly_selfemp, young_dividend both zero), while generative synthesis preserves them at 7-30x oracle ratio. - Adds an explicit scale caveat: sparse_coverage evidence is from 10k-row synthetic data; the structural pattern (L0 zeros records exactly) survives scale-up on mathematical grounds even if absolute numbers shift. synthesizer-benchmark-scale-up.md (new): - Records what the existing benchmark_multi_seed.json measures: 10k rows x 7 columns of SYNTHETIC data. The cps/sipp/psid labels are partial-observation schemas over one synthetic population, not real sources. - Production gap: 3,000-7,000x on (rows x columns) plus the synthetic-to-real jump. - Predicted failure modes per method at scale (QRF compute-bound above 1M rows, MAF tail-coverage risk on top income, QDNN needs joint zero-mask head at 150 zero-capable vars, PRDC metric degenerates in 150D without embedding). - Three-stage scale-up protocol (100k x 50, 1M x 50, 3.4M x 155) with matched holdouts, rare-cell preservation tracking, and wall-time / RSS measurements per method. - Ballpark runtime expectations per method per stage on a 48 GB M3. - Diagnoses PSID coverage = 0 as unresolved and must-fix before any SS-model longitudinal work commits to PSID as the backbone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 699ea28 commit 7186926

2 files changed

Lines changed: 211 additions & 0 deletions

File tree

docs/calibrator-decision.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,47 @@ Yes, operationally:
9999
- Tests and small-scale diagnostics: `Calibrator`.
100100
- No single-pipeline run crosses all three. Each tool has a distinct and non-overlapping job.
101101

102+
## Empirical support: sparse selection annihilates rare subpopulations
103+
104+
The single cleanest empirical argument for this split comes from
105+
`microplex/benchmarks/results/sparse_coverage.csv`. Measuring rare-subpopulation
106+
preservation at varying sparsity levels (lower `coverage_median` = closer to
107+
oracle):
108+
109+
| Method | `coverage_median` | elderly_selfemp_ratio | young_dividend_ratio |
110+
|---|---:|---:|---:|
111+
| Oracle (full) | 0.009 | 0.94 | 1.11 |
112+
| Generative (10%) | 0.53 | 27.7 | 20.6 |
113+
| Generative (2%) | 0.42 | 22.1 | 32.3 |
114+
| Generative (1%) | 0.25 | 7.2 | 1.7 |
115+
| Weighted (10%) | 0.24 | **0.00** | **0.00** |
116+
| Weighted (2%) | 0.35 | 0.02 | **0.00** |
117+
| Weighted (1%) | 0.65 | **0.00** | **0.00** |
118+
119+
Sparse L0 weighting drops rare subpopulations to **zero representation** at
120+
every sparsity level tested. Generative synthesis preserves them at 7–30× the
121+
oracle ratio. For policy analysis, where rare subpopulations (elderly
122+
self-employed, young dividend earners, disability recipients, top-1% earners)
123+
drive outsized fiscal and distributional effects, sparse-as-mainline is
124+
non-viable on accuracy grounds alone.
125+
126+
This empirical pattern reinforces the decision above: L0/sparse selection is a
127+
**post-calibration deployment tool**, not a calibration method. Apply it after
128+
the mainline `microcalibrate` run has produced a fully-covered adjusted-weight
129+
artifact, and only when a downstream consumer needs a small subsample.
130+
131+
### Scale caveat
132+
133+
`sparse_coverage.csv` was produced on **10,000-row synthetic data with ~7
134+
variables**. Production scale is 1.5M rows × 150+ variables on real joint
135+
microdata. We should not assume the 20–30× generative-vs-weighted gap holds at
136+
that scale — the absolute numbers will shift, and rare-subpopulation
137+
preservation may tighten for both methods. What is expected to hold is the
138+
structural pattern: sparse L0 exactly zeros out records, generative synthesis
139+
does not. The argument against sparse-as-mainline survives any plausible
140+
scale-up because the failure mode (zero representation of rare cells) is not a
141+
noise issue, it is mathematically baked into L0 selection.
142+
102143
## What this unblocks
103144

104145
- Migration step 2 of `docs/core-wiring-audit.md`: "Adopt `Calibrator` end-to-end" is revised to "Adopt `microcalibrate` end-to-end as the production calibrator." That becomes the first real code change in `spec-based-ecps-rewire`.
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Synthesizer benchmark — what we know, and what scale-up will test
2+
3+
*Draft plan for extending the existing ZI-synthesizer benchmark to production scale.*
4+
5+
## What the existing benchmark tested
6+
7+
Results in `microplex/benchmarks/results/benchmark_multi_seed.json` compare six synthesizers — QRF, ZI-QRF, QDNN, ZI-QDNN, MAF, ZI-MAF — on PRDC coverage across three schemas labeled `cps`, `sipp`, `psid`.
8+
9+
| Method | CPS ASEC coverage | SIPP coverage | PSID coverage |
10+
|---|---:|---:|---:|
11+
| QRF | 0.337 | 0.938 | 0.000 |
12+
| ZI-QRF | 0.347 | **0.950** | 0.000 |
13+
| QDNN | 0.380 | 0.293 | 0.000 |
14+
| ZI-QDNN | 0.406 | 0.717 | 0.000 |
15+
| MAF | 0.398 | 0.349 | 0.000 |
16+
| ZI-MAF | **0.499** | 0.866 | 0.000 |
17+
18+
**Data used**: synthetic population generated by `benchmarks/run_benchmarks.py::generate_realistic_microdata`, 10,000 rows, **4 target variables** (`income`, `assets`, `debt`, `savings`) conditioned on **3 predictors** (`age`, `education`, `region`). The multi-survey fusion setup partially-observes this population as different "surveys" (CPS-schema sees one subset, SIPP-schema sees another, PSID-schema sees another).
19+
20+
**Important**: the `cps` / `sipp` / `psid` labels in the result JSON are partial-observation schemas over the same synthetic population, not real CPS / SIPP / PSID data.
21+
22+
## Scale gap to production
23+
24+
| Dimension | Existing benchmark | Production (microplex-us G1) | Gap |
25+
|---|---:|---:|---:|
26+
| Rows | 10,000 | 430,000 (CPS) – 3,400,000 (ACS scaffold) | 43×–340× |
27+
| Columns | 7 (3 cond + 4 target) | 150+ joint variables | ~22× |
28+
| Source realism | Synthetic generator with analytical zero-inflation | Real CPS + PUF + SIPP + SCF joints with real tail structure | Categorical jump |
29+
| Held-out set | 20% of synthetic population | TBD — ECPS baseline, external targets (SOI, BEA, Census) ||
30+
31+
Combined row × column gap: **~1,000×–8,000×**. Plus the synthetic-to-real jump, which is not measurable as a multiplier because real data has structure the generator cannot produce.
32+
33+
## What we expect to break at scale
34+
35+
### Coverage metric itself
36+
37+
**PRDC k-NN coverage concentrates in high dimensions.** With 150+ features, nearest-neighbor distances bunch up (curse of dimensionality) and a small distance threshold starts excluding almost everything while a larger one starts including almost everything. Raw-feature PRDC above ~50 columns is typically noise-dominated without dimensionality reduction or a learned embedding.
38+
39+
**Mitigation**: compute PRDC in a learned embedding (autoencoder or the synthesizer's latent space) rather than raw features. Or compute per-block PRDC on demographically-stratified cells. Or switch to a metric that scales better with dimension (MMD with an RBF kernel, or mode-wise Wasserstein).
40+
41+
### ZI-QRF training
42+
43+
**Quantile random forests scale poorly in both rows and columns.**
44+
45+
- Row scaling: train time is roughly O(N log N) per tree; memory is O(N × features × n_trees). On 1.5M rows × 150 cols × 100 trees, that's ~180 GB for naive storage without sparse leaves. Even with efficient implementations (`quantile-forest`, `lightgbm`-style histogram trees), training time is hours-to-days on CPU for a full run.
46+
- Column scaling: splits over 150+ features explore a larger hyperparameter space; conditional coverage on rare variables gets noisier; `max_features` tuning becomes load-bearing.
47+
48+
**Prediction**: ZI-QRF's dominance on small-SIPP is partly because 500-person panels fit neatly into tree leaves. At 1.5M rows, expect the advantage to narrow or invert — partly because QRF hits practical compute limits and has to subsample.
49+
50+
### ZI-MAF training
51+
52+
**Normalizing flows need careful hyperparameter tuning on real data.**
53+
54+
- Mode-collapse risk: ZI-MAF's joint distribution over 150 variables can collapse onto a lower-dimensional manifold, especially when many variables are zero-inflated with correlated zero patterns (same person has zero across many income sources at once).
55+
- Training time: MAF is GPU-accelerated and scales linearly in rows. 1.5M rows × 150 cols × 200 epochs is feasible on a single H100, ~several hours. On Apple Silicon (Max's 48 GB M3), ~8–16 hours with MPS backend.
56+
- Conditioning: the existing benchmark uses 3 condition variables. Real microdata conditions on ~10–20 demographics. Adding conditioning dimensions is the easier part of scaling MAF.
57+
58+
**Prediction**: ZI-MAF's lead on CPS should hold or grow at scale (flows scale well with rows). Main risk is tail coverage — top-1% income, extreme wealth — which is exactly where the SS-model application cares most.
59+
60+
### ZI-QDNN training
61+
62+
**Deep quantile networks scale well but need careful tuning at width + depth.**
63+
64+
- Row scaling: straightforward, O(N) per epoch, linear in batch size.
65+
- Column scaling: the pinball loss surface gets jagged with many zero-inflated targets; per-target head design matters more at 150 vars than at 4.
66+
- Zero-inflation head: a single logistic head for `P(zero)` becomes underpowered at 150 zero-capable variables with complex joint zero patterns (observing income=0 informs dividends=0 informs wages=0). Joint zero-mask modeling is probably needed.
67+
68+
**Prediction**: ZI-QDNN as currently implemented will degrade fastest under scale-up without a joint zero-mask head. Worth testing whether a graph-structured zero-mask extension rescues it.
69+
70+
### PRDC coverage = 0 on PSID across all methods
71+
72+
This is unresolved in the existing benchmark and is the single most important thing to diagnose before the SS-model longitudinal extension commits to PSID. Three hypotheses:
73+
74+
1. **Test-setup degeneracy.** PSID-schema's observed-variable mask may overlap with the CPS / SIPP masks in a way that produces an empty held-out set. Check the mask logic.
75+
2. **Panel structure breaks per-record PRDC.** PSID is a panel; a "record" could mean a person-year or a person. If the test set uses person-year and the synthesizer generates persons, coverage is trivially 0. Fix: switch to a panel-aware metric (per-person trajectory coverage) or generate person-years.
76+
3. **Real limitation.** Attrition + sparse-year coverage in PSID creates tail records the synthesizers cannot cover. If this is the case, the SS-model trajectory training must either accept this ceiling, use a different panel source (SIPP panel, HRS, NLSY), or augment PSID with synthetic history.
77+
78+
**Action**: diagnose before any PSID-dependent architecture work commits.
79+
80+
## Proposed scale-up experiment protocol
81+
82+
Run three stages, each keeping row count and column count explicit. All stages report three classes of metric: accuracy (coverage), cost (time + memory), and health (convergence + rare-cell preservation).
83+
84+
### Stage 1 — medium rows, medium columns
85+
86+
Scale: **100,000 rows × 50 columns**
87+
88+
Data: subsample enhanced_cps_2024 to 100k persons, select 50 PE-native-relevant columns (income components, demographics, tax inputs, benefit receipts). Use a real subsample, not synthetic.
89+
90+
Purpose: exercise real joint structure (tails, categorical constraints, zero correlations) without the full row cost. Should fit comfortably in 48 GB RAM on CPU, in hours.
91+
92+
Metrics per method:
93+
- PRDC coverage on 20% holdout (computed in raw features and in a 16-dim PCA embedding)
94+
- Per-stratum coverage (age × income-bracket × filing-status cells) — specifically flag any cell with <10 records that drops to 0 coverage
95+
- Rare-subpopulation preservation (elderly self-employed, young dividend, SSDI, top-1% earnings — the `sparse_coverage.csv` pattern)
96+
- Training wall time
97+
- Peak RSS during training
98+
- Generation wall time for 100k samples
99+
- Zero-rate MAE per variable
100+
101+
### Stage 2 — large rows, medium columns
102+
103+
Scale: **1,000,000 rows × 50 columns**
104+
105+
Data: 10× oversample of stage 1's column set with enhanced_cps_2024 clone-and-assign style replication (as PE-US-data does for local area) to reach 1M rows.
106+
107+
Purpose: expose row-scaling failures before column scaling. ZI-QRF is the most likely to fall off here. ZI-MAF should be OK. ZI-QDNN should scale cleanly.
108+
109+
Same metrics as stage 1.
110+
111+
### Stage 3 — full rows, full columns
112+
113+
Scale: **3,373,378 rows × 155 columns** (exactly the v6 seed-ready shape, so we can compare the post-donor frame at production scale).
114+
115+
Data: the actual v6 seed frame if we can retrieve it from the log (it was never persisted); otherwise regenerate by running donor integration only. Since we don't have the v6 artifact, this stage requires regenerating the seed — ~9 hours of donor integration.
116+
117+
Purpose: verify which synthesizer survives production scale, in what time, at what memory cost.
118+
119+
Same metrics, plus:
120+
- Time to first valid sample (can we get ANY synthetic records out?)
121+
- Sample quality trajectory over training time (does it stabilize, or degrade with more training?)
122+
- Memory peak vs memory average (does it OOM on a 48 GB machine?)
123+
124+
## Runtime expectations (rough a priori)
125+
126+
Order-of-magnitude estimates for training one model to convergence on a 48 GB M3:
127+
128+
| Method | Stage 1 (100k × 50) | Stage 2 (1M × 50) | Stage 3 (3.4M × 155) |
129+
|---|---|---|---|
130+
| ZI-QRF | minutes | hours, may OOM | days or infeasible; needs subsample |
131+
| ZI-MAF | 30 min (CPU) / 5 min (MPS) | few hours (MPS) | 8–16 hours (MPS), needs batch tuning |
132+
| ZI-QDNN | 15 min (CPU) / 3 min (MPS) | 1–2 hours (MPS) | 4–8 hours (MPS), lowest memory footprint |
133+
134+
These are coarse and based on library benchmarks + extrapolation. The scale-up experiment's actual measurements are what we commit to.
135+
136+
## Evaluation contract — matched-size comparison
137+
138+
To avoid the "we ran ZI-MAF at 1M and ZI-QRF at 100k and declared a winner" trap, all three stages enforce:
139+
140+
- **Same held-out split** across methods per stage (same 20% records).
141+
- **Same feature set** across methods per stage.
142+
- **Same wall-time budget** for training. (If ZI-QRF hits the budget without converging, that counts as its stage-3 result — "did not finish.")
143+
144+
Report all three as a single table with method × stage × metric cells. Pick production defaults from this table alone, not from the existing 10k-row benchmark.
145+
146+
## What this experiment would actually update
147+
148+
1. **Production synthesizer default for G1.** Currently implied as ZI-MAF from the small benchmark. Scale-up may confirm or overturn.
149+
2. **SS-model methodology doc's ZI-QDNN production claim.** If ZI-QDNN does not emerge as a clear winner at scale, the doc needs a pointer to this evaluation.
150+
3. **PSID coverage ceiling.** If PSID coverage-0 is a real limitation, the longitudinal-training plan needs a fallback panel source.
151+
4. **Compute budget for production runs.** Knowing that ZI-MAF needs 12 hours MPS at production scale changes how often we can iterate on synthesizer hyperparameters.
152+
153+
## Out of scope (for now)
154+
155+
- Training on real-panel data at scale. The stage-3 experiment uses the cross-section; panel synthesis is a separate scale-up that depends on PSID-coverage diagnosis first.
156+
- Comparing against external non-microplex synthesizers (CTGAN, TVAE, TabDDPM, TabPFN) at full scale. Do after internal best is clear.
157+
- Runtime on GPU clusters. Local laptop numbers first; remote GPU only if production bottleneck demands it.
158+
159+
## Risks to the experiment itself
160+
161+
1. **Retrieving the v6 seed frame requires rerunning donor integration** (~9h) because v6 never persisted. A cheaper alternative: use the enhanced_cps_2024 HDF5 at its native scale (~400k persons × ~250 columns — already close to stage-3 scale) and adapt the donor conditioning.
162+
2. **PRDC in 150D is likely noise.** Budget time for the embedding-based variant before committing to any absolute coverage number.
163+
3. **ZI-QRF may be infeasible at stage 3.** That is itself a finding; have a fallback "QRF on top-20-important-columns" variant ready to report as a scale-constrained baseline.
164+
4. **The existing synthesizers may not even run at stage 3** without code changes (memory bugs at scale). Budget for 1–2 days of debugging on first attempt.
165+
166+
## Minimum useful subset
167+
168+
If full three-stage execution is too costly as a first pass, the minimum that informs the rearchitecture direction is **stage 1 alone**: 100k real-subsample rows × 50 real-feature columns, running all three ZI variants, reporting coverage + runtime + rare-cell preservation.
169+
170+
That alone would invalidate or confirm the small-benchmark conclusions and give us enough signal to pick a G1 default.

0 commit comments

Comments
 (0)