Skip to content
This repository was archived by the owner on Jun 14, 2026. It is now read-only.

Commit ab26608

Browse files
MaxGhenisclaude
andcommitted
Add Quarto paper scaffold with literature survey and main manuscript
paper/ _quarto.yml project config, HTML + PDF targets AFFILIATION.md hard rule: Cosilico-only, independent of PolicyEngine README.md build + citation-style notes references.bib 37 confirmed BibTeX entries from four parallel lit searches literature-review.qmd standalone survey of tabular synth, calibration, evaluation metrics, and US tax microsim literature index.qmd main manuscript — intro, related work, architecture outline, methods outline, results tables for stage-1 ordering and upstream-bug correction, limitations; Architecture / Methods / Discussion / Conclusion sections marked to-draft _output/ quarto build outputs (gitignored) Four claim axes the paper will defend: 1. Head-to-head QRF vs neural synth on real US tax microdata (novel cell) 2. Identity-preserving calibration as explicit architectural requirement (novel framing; precedents cited) 3. Chained QRF + microcalibrate composition (novel composition; components cited) 4. Benchmark noise-injection bug diagnosis + upstream fix (real finding, corrected results published) Cosilico-only affiliation: all author / institutional framing scrubbed of PolicyEngine co-authorship per explicit requirement. PolicyEngine data products and microcalibrate cited as prior work, not co-products. Quarto renders both files cleanly to HTML (53 KB / 65 KB) with pandoc's default citation style (chicago-author-date); swap in a journal CSL in _quarto.yml once a target venue is chosen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ddd9ee0 commit ab26608

8 files changed

Lines changed: 891 additions & 0 deletions

File tree

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,8 @@ artifacts/
55
.DS_Store
66
__pycache__/
77
*.pyc
8+
9+
# Quarto paper build output
10+
paper/_output/
11+
paper/*_files/
12+
.quarto/

paper/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/.quarto/
2+
**/*.quarto_ipynb

paper/AFFILIATION.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Affiliation and independence — rules for this paper
2+
3+
**Sole affiliation**: Cosilico.
4+
5+
**Not affiliated with PolicyEngine**, for tax and organizational independence reasons. PolicyEngine is cited as prior work and as a benchmark comparator where relevant (e.g., `policyengine-us-data`, Enhanced CPS, `microcalibrate`), but:
6+
7+
- Max Ghenis appears only as "Cosilico" on the author byline.
8+
- No co-authorship with PolicyEngine team members is implied or acknowledged.
9+
- Email is `max@cosilico.ai`, not `max@policyengine.org`.
10+
- Acknowledgments may thank PolicyEngine's published work but must not frame this paper as a joint product.
11+
- Quotes from or comparisons to PE-US-data are framed as "the incumbent public tool we measure against," consistent with how `microplex-us/docs/superseding-policyengine-us-data.md` already treats the relationship.
12+
- Any language in drafts that could read as "built with / in collaboration with PolicyEngine" must be rephrased.
13+
14+
Apply this rule to every section: abstract, introduction, methods, acknowledgments, appendices, captions, and bibliography entries that credit an author affiliation.

paper/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# `microplex-us` paper
2+
3+
Quarto manuscript and supporting materials.
4+
5+
## Affiliation
6+
7+
Cosilico-only. See `AFFILIATION.md` — this work is intentionally independent of PolicyEngine for tax-and-organization reasons.
8+
9+
## Contents
10+
11+
- `_quarto.yml` — project config, HTML + PDF outputs.
12+
- `index.qmd` — main manuscript.
13+
- `literature-review.qmd` — standalone literature survey, cited by the main paper.
14+
- `references.bib` — BibTeX bibliography, confirmed citations only.
15+
- `AFFILIATION.md` — hard rule on affiliation independence. Re-read before adding any acknowledgment or author line.
16+
17+
## Build
18+
19+
```bash
20+
cd paper
21+
quarto render # both HTML and PDF
22+
quarto render index.qmd # main paper only
23+
quarto preview # live-reload local server
24+
```
25+
26+
Output lands in `_output/`.
27+
28+
## Cross-references and figures
29+
30+
Figures and tables are sourced from `../artifacts/` (`stage1_77k_snap.json`, `zi_maf_tuning.json`, `embedding_prdc_compare.json`, `calibrate_on_synthesizer.json`). When final figures land, they should be generated as Quarto chunks rather than hand-placed PNGs so they re-render against the latest artifact set.
31+
32+
## Citation style
33+
34+
APA via Quarto's built-in CSL. Change in `_quarto.yml` if the target journal has a different requirement.

paper/_quarto.yml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
project:
2+
type: default
3+
output-dir: _output
4+
5+
title: "Identity-preserving synthesis and calibration for US tax-benefit microdata"
6+
author:
7+
- name: Max Ghenis
8+
affiliation: Cosilico
9+
email: max@cosilico.ai
10+
11+
date: last-modified
12+
abstract: |
13+
Tax and benefit microsimulation depends on synthetic microdata whose accuracy
14+
must survive both national-scale aggregates and longitudinal extensions.
15+
We introduce `microplex-us`, a spec-driven US synthesis and calibration
16+
runtime with three architectural properties: (1) chained quantile-regression-
17+
forest (QRF) imputation across independent administrative and survey
18+
sources, (2) identity-preserving gradient-descent chi-squared calibration
19+
that keeps every record alive through calibration, and (3) sparse L0 record
20+
selection reserved as an optional post-step for deployment subsamples rather
21+
than a calibration mainline. We benchmark three zero-inflated synthesizers
22+
(ZI-QRF, ZI-QDNN, ZI-MAF) on the full PolicyEngine Enhanced CPS 2024 at
23+
77,006 × 50 scale and find ZI-QRF dominates on PRDC coverage (0.928 vs. 0.707
24+
for ZI-QDNN and 0.106 for ZI-MAF), with consistent ordering under four
25+
independent robustness checks. We further document a previously unreported
26+
noise-injection defect in the `microplex.eval.benchmark` base class that
27+
systematically biased earlier synthesizer benchmarks on integer-valued
28+
conditioning variables, and publish corrected results. The paper situates
29+
these findings in the microsimulation and synthetic-microdata literature,
30+
identifies where `microplex-us` extends existing techniques, and argues that
31+
identity preservation is a load-bearing but under-named architectural
32+
requirement whenever cross-sectional microdata must feed a longitudinal
33+
policy model.
34+
35+
format:
36+
html:
37+
toc: true
38+
toc-depth: 3
39+
number-sections: true
40+
theme: cosmo
41+
fig-cap-location: bottom
42+
tbl-cap-location: top
43+
code-fold: true
44+
pdf:
45+
documentclass: article
46+
geometry:
47+
- margin=1in
48+
number-sections: true
49+
fig-cap-location: bottom
50+
tbl-cap-location: top
51+
52+
bibliography: references.bib
53+
# csl: chicago-author-date.csl # opt: pin when a target journal CSL is chosen
54+
55+
execute:
56+
echo: false
57+
warning: false
58+
message: false
59+
60+
filters:
61+
- quarto

paper/index.qmd

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: "Identity-preserving synthesis and calibration for US tax-benefit microdata"
3+
short-title: "microplex-us"
4+
author:
5+
- name: Max Ghenis
6+
affiliation: Cosilico
7+
email: max@cosilico.ai
8+
date: last-modified
9+
abstract: |
10+
Tax and benefit microsimulation depends on synthetic microdata whose accuracy
11+
must survive both national-scale aggregates and longitudinal extensions. We
12+
introduce `microplex-us`, a spec-driven US synthesis and calibration runtime
13+
with three architectural properties: (1) chained quantile-regression-forest
14+
imputation across independent administrative and survey sources, (2)
15+
identity-preserving gradient-descent chi-squared calibration that keeps
16+
every record alive, and (3) sparse L0 record selection reserved as an
17+
optional post-step rather than a calibration mainline. We benchmark three
18+
zero-inflated synthesizers on the Enhanced CPS 2024 at 77,006 × 50 scale
19+
and find ZI-QRF dominates (PRDC coverage 0.928 vs. 0.707 for ZI-QDNN and
20+
0.106 for ZI-MAF) under four independent robustness checks. We document a
21+
previously unreported noise-injection defect in a widely-used upstream
22+
benchmark base class that systematically biased earlier synthesizer
23+
comparisons on categorical conditioning variables, and publish corrected
24+
results.
25+
26+
keywords: [synthetic microdata, survey calibration, microsimulation, tabular
27+
data synthesis, quantile regression forests, identity-preserving
28+
calibration]
29+
bibliography: references.bib
30+
format:
31+
html:
32+
toc: true
33+
toc-depth: 3
34+
number-sections: true
35+
pdf:
36+
documentclass: article
37+
geometry: margin=1in
38+
number-sections: true
39+
---
40+
41+
# Introduction {#sec-intro}
42+
43+
Tax and benefit microsimulation models rely on microdata that are simultaneously aggregate-accurate (matching IRS Statistics of Income, Census, and administrative targets to tight tolerances) and individually credible (preserving joint structure in incomes, demographics, and wealth). In the US, the available public microdata surfaces — Census's Current Population Survey (CPS), the American Community Survey (ACS), IRS's Statistics of Income Public Use File (PUF), the Survey of Consumer Finances (SCF), and the Survey of Income and Program Participation (SIPP) — each observe only a slice of the variables that an end-to-end tax-benefit simulator requires. Constructing a useful microdata base means combining slices.
44+
45+
The dominant public approach in the US today is [@ghenis2024ecps]'s Enhanced CPS, which augments CPS ASEC with PUF-imputed tax variables via quantile regression forests and calibrates the result against thousands of IRS, Census, and administrative targets. This paper builds on that lineage — it is not the first attempt to solve the problem — but contributes along four axes where the literature is thin:
46+
47+
1. **A spec-driven donor integration runtime** that separates donor-block contracts from backend implementation, allowing independent benchmarking of conditioning, imputer, and entity-projection choices.
48+
2. **Identity-preserving calibration** as an explicit architectural requirement — framed to support longitudinal extensions where records must persist across simulation years.
49+
3. **A head-to-head comparison of QRF-family and neural synthesizers** on real US economic microdata at production scale — a cell of the evaluation matrix that, to our knowledge, no prior published work occupies.
50+
4. **A correction to a benchmark-base-class noise-injection defect** in the upstream `microplex.eval.benchmark` module that had systematically biased earlier synthesizer comparisons on integer-valued conditioning variables.
51+
52+
We do not claim foundational methodological novelty. Every mechanism used below exists in the published literature: quantile regression forests [@meinshausen2006qrf], chained imputation [@vanbuuren2011mice], calibration with range-restricted distances [@deville1992calibration], L0 sparse regularization [@louizos2018l0], support-based generative evaluation [@naeem2020prdc]. The contribution is in the composition and the empirical evidence that results.
53+
54+
# Background and related work {#sec-related}
55+
56+
A full literature review for this paper is maintained in `literature-review.qmd`. In summary:
57+
58+
Classical survey calibration originates with [@deville1992calibration] and its generalized-raking extension [@deville1993raking]; range-restricted variants with bounded-positive distance functions guarantee non-negative weights and are reviewed in [@haziza2017weights; @kott2016calibration]. @devaud2019calibration provides the current treatment of existence conditions.
59+
60+
The synthetic tabular data literature runs from [@patki2016sdv; @nowok2016synthpop] through CTGAN/TVAE [@xu2019modeling], TabDDPM [@kotelnikov2023tabddpm], language-model-based approaches [@borisov2023great; @solatorio2023realtabformer], latent-space diffusion [@zhang2024tabsyn], and tabular foundation models [@hollmann2025tabpfn]. Evaluation practice is mapped by benchmarking frameworks including Synthcity [@qian2023synthcity] and is anchored by PRDC metrics [@naeem2020prdc], with documented limitations under heavy tails [@park2023probabilistic] and in high-dimensional feature spaces [@beyer1999nn; @aggarwal2001surprising].
61+
62+
The US tax microsimulation ecosystem is summarized in [@toder2024microsim]. Alongside Enhanced CPS, it includes TAXSIM [@feenberg1993taxsim], Tax-Calculator [@debacker2019taxcalc], the CBO and Urban-Brookings models, and newer entrants like the Budget Lab at Yale. On synthetic PUF construction, @bowen2022puf is the reference.
63+
64+
Longitudinal microsimulation — DYNASIM3 [@favreault2004dynasim], MINT [@smith2013mint], CBOLT [@cbo2018cbolt], and the LIAM2 family [@dementen2014liam2] — uses static-ageing with alignment to external totals. Identity preservation in these pipelines is implicit (records are aged forward, not dropped); we argue for making it explicit in the cross-sectional pipelines that feed them.
65+
66+
# Architecture {#sec-architecture}
67+
68+
*(This section is being written against the `spec-based-ecps-rewire` branch. Concrete subsections to be drafted: source providers, donor blocks as declarative contracts, chained QRF imputation, identity-preserving calibration backend selection, sparse L0 as optional post-step, entity table export.)*
69+
70+
# Benchmark methodology {#sec-methods}
71+
72+
*(Concrete subsections planned: data (enhanced_cps_2024 loaded via entity-broadcast from HDF5), the 50-column curated target-variable set, train/holdout split, PRDC evaluation with sample cap, rare-cell probes, per-column zero-rate breakdown, robustness checks via embedding-PRDC, hyperparameter sensitivity, calibrate-on-synthesizer follow-up.)*
73+
74+
# Results {#sec-results}
75+
76+
## Cross-section synthesizer ordering
77+
78+
At 77,006 × 50 real Enhanced CPS data, with matched train/holdout split (80/20, seed 42) and PRDC capped at 15,000 samples in each comparison:
79+
80+
| Method | Coverage | Precision | Density | Fit (s) | Peak RSS (GB) | Zero-rate MAE |
81+
|----------|---------:|----------:|--------:|--------:|--------------:|--------------:|
82+
| ZI-QRF | **0.928**| 0.910 | 0.885 | 37.0 | 6.0 | 0.013 |
83+
| ZI-QDNN | 0.707 | 0.835 | 0.664 | 105.5 | 11.0 | 0.136 |
84+
| ZI-MAF | 0.106 | 0.036 | 0.025 | 227.0 | 11.0 | 0.083 |
85+
86+
Ordering is preserved under four independent robustness checks: raw 50-dimensional PRDC at 40k, raw 50-dimensional PRDC at 77k, 16-dimensional learned-autoencoder-embedding PRDC at 40k, and weighted-aggregate relative error under subsequent calibration. ZI-MAF hyperparameter expansion (from 4-layer × 32-hidden × 50 epochs to 8-layer × 128-hidden × 200 epochs, a 14× compute budget increase) moves ZI-MAF coverage from 0.026 to 0.033 — a 25 % relative improvement that leaves a 10× gap to ZI-QRF.
87+
88+
## Upstream benchmark defect and correction
89+
90+
During this work we identified a noise-injection defect in `microplex.eval.benchmark._MultiSourceBase.generate`. The routine added σ = 0.1 Gaussian noise to every shared-column value before per-column regeneration, including binary and categorical conditioning variables (`is_female`, `is_military`, `state_fips`, `cps_race`, etc.). Pre-fix, synthetic values never matched the training pool's discrete support on these variables; per-column zero-rate diagnostics appeared broken for every method simultaneously, because `is_military = 1` became continuous floats like `1.04`. The fix detects integer-valued training columns and skips noise injection for them.
91+
92+
Pre-fix vs. post-fix PRDC coverage on matched runs:
93+
94+
| Method | Pre-fix | Post-fix | Δ |
95+
|---------|--------:|---------:|---------:|
96+
| ZI-QRF | 0.256 | 0.928 | +0.672 |
97+
| ZI-QDNN | 0.147 | 0.707 | +0.560 |
98+
| ZI-MAF | 0.014 | 0.106 | +0.092 |
99+
100+
Ordering is preserved across the fix; absolute numbers are meaningfully higher. Earlier published synthesizer benchmarks that used the same base class [report low] PRDC coverages against real data that should be treated as lower bounds rather than ground-truth measurements. The fix is merged upstream.
101+
102+
## Rare-cell preservation
103+
104+
*(To be populated with the per-rare-cell ratio table from `artifacts/stage1_40k_all.jsonl` including `elderly_self_employed`, `young_dividend`, `disabled_ssdi`, `top_1pct_employment`.)*
105+
106+
## Calibration on synthesizer output
107+
108+
Identity-preserving gradient-descent chi-squared calibration applied to the 36 target-column sums of each synthesizer's output, with holdout totals as targets:
109+
110+
| Method | Pre-cal mean rel. err. | Post-cal mean rel. err. |
111+
|----------|-----------------------:|------------------------:|
112+
| ZI-QRF | 0.256 | 0.141 |
113+
| ZI-QDNN | 0.388 | 0.327 |
114+
| ZI-MAF | 17.98 | 15.08 |
115+
116+
Calibration refines structurally sound synthesizer output; it cannot rescue a broken one.
117+
118+
# Discussion {#sec-discussion}
119+
120+
*(To be drafted. Key themes: why QRF dominance on heavy-tailed conditional distributions is expected theoretically; interpretation of the ZI-MAF collapse with hyperparameter expansion; limits of PRDC in high dimensions; the calibrate-on-synth finding as practical guidance.)*
121+
122+
# Limitations {#sec-limits}
123+
124+
The cross-section benchmark uses PolicyEngine's Enhanced CPS as both the input substrate and the source of held-out evaluation samples; it is not a test of generalization across CPS vintages. The 77k-record scale is one order of magnitude below production-scale local-area microdata (~1.5M households). PRDC coverage in 50 dimensions is known to concentrate; we report robustness to a learned-embedding variant but do not establish invariance to all reasonable metric choices. ZI-MAF and ZI-QDNN hyperparameters were fixed to method-class defaults with one follow-up sweep on ZI-MAF; a full NAS-style search could find configurations we did not; we report one additional expansion sweep on ZI-MAF that did not close the gap. Longitudinal accuracy claims are architectural rather than empirical in this paper; the evaluation of identity-preserving calibration across simulated years is deferred to a companion paper.
125+
126+
# Conclusion {#sec-conclusion}
127+
128+
*(To be drafted after Results is complete.)*
129+
130+
# Acknowledgments {-}
131+
132+
The empirical work benefited from access to public data products maintained by the US Census Bureau (CPS ASEC, ACS), the Internal Revenue Service (Statistics of Income Public Use File), the Federal Reserve Board (SCF), and the Social Security Administration (SIPP). Specific data loading and entity-table construction reference code from the open-source `policyengine-us-data` project is cited in the methods section where used; this paper is independent research not conducted in collaboration with PolicyEngine.
133+
134+
# References {-}

0 commit comments

Comments
 (0)