You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets.
2
+
Subnational microsimulation requires survey microdata that reproduce administrative totals across
3
+
nested geographies while remaining usable in a policy model. In the United States, that means
4
+
calibrating mixed count and dollar targets for district-level units, states, and the nation from a
5
+
single microdata pipeline. Classical calibration methods provide important reference points, but
6
+
they do not naturally cover the full production problem: generalized regression (GREG) can
7
+
produce negative weights and becomes difficult to use in very large, collinear systems, while
8
+
iterative proportional fitting (IPF, or raking) is most natural for count-style margins.
3
9
4
-
Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.
5
-
6
-
The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.
10
+
We present an $L_0$-regularized calibration pipeline built in PolicyEngine's US data workflow. The
11
+
pipeline clones CPS households across sampled geographies, constructs a sparse calibration matrix
12
+
from tax-benefit simulations, and jointly optimizes positive weights and Hard Concrete gates. The
13
+
gates make sparsity explicit, so the same framework can support compact national datasets and
14
+
larger subnational datasets. The empirical sections benchmark $L_0$ against GREG and IPF on shared
15
+
exported calibration packages, moving from a tractable comparison tier to a scaling frontier and a
The calibration pipeline draws targets from seven administrative sources across three geographic levels. The database contains 37,758 active targets in total. The following tables list every target domain included in the \texttt{policy\_data.db} database, grouped by geographic level.
39
+
The calibration pipeline draws targets from seven administrative sources across three geographic
40
+
levels. The pipeline stores these targets in a target database, \texttt{policy\_data.db}, which
41
+
contains 37,758 active targets in total. The following tables list every target domain included in
42
+
that database, grouped by geographic level.
39
43
40
-
\subsection{Congressional district targets (33,572)}
44
+
\subsection{District-level targets (33,572)}
41
45
42
46
\begin{table}[H]
43
47
\centering
@@ -47,24 +51,26 @@ \subsection{Congressional district targets (33,572)}
47
51
Target domain & Type & Count \\
48
52
\midrule
49
53
\multicolumn{3}{l}{\textit{Census ACS S0101}} \\
50
-
Person count by age band (18 bands $\times$ 436 CDs) & count & 7{,}848 \\
54
+
Person count by age band (18 bands $\times$ 436 district-level units) & count & 7{,}848 \\
51
55
\midrule
52
56
\multicolumn{3}{l}{\textit{IRS SOI}} \\
53
-
Person count by AGI bracket (9 bins $\times$ 436 CDs) & count & 3{,}924 \\
54
-
EITC dollars by qualifying children (4 bins $\times$ 436 CDs) & \$ & 1{,}744 \\
55
-
Tax unit count by qualifying children (4 bins $\times$ 436 CDs) & count & 1{,}744 \\
\caption{Congressional district calibration targets (436 CDs). Each row is replicated across all 436 districts. IRS SOI provides paired dollar and count targets for each income/deduction domain.}
71
+
\caption{District-level calibration targets. The 436 district-level units correspond to the 435
72
+
congressional districts plus the District of Columbia. IRS SOI provides paired dollar and count
State income tax collections ($\times$51 states) & \$ & 51 \\
105
+
State income tax collections ($\times$50 states + DC) & \$ & 51 \\
100
106
\midrule
101
107
& & \textbf{4{,}080} \\
102
108
\bottomrule
103
109
\end{tabular}
104
110
}
105
-
\caption{State-level calibration targets (50 states + DC). IRS SOI variables mirror the district structure. USDA provides both SNAP spending and household counts; CMS provides Medicaid enrollment.}
111
+
\caption{State-level calibration targets for the 50 states plus the District of Columbia. IRS SOI
112
+
variables mirror the district-level structure. USDA provides both SNAP spending and household
\caption{National-level calibration targets. CBO, JCT, SSA, CMS, and Census values are curated from the cited administrative sources and stored in the ETL pipeline. Dollar values are inflation-adjusted to the calibration year.}
161
+
\caption{National-level calibration targets. CBO, JCT, SSA, CMS, and Census values are curated from
162
+
the cited administrative sources and stored in the ETL pipeline. Dollar values are
0 commit comments