Skip to content

Commit 84a5614

Browse files
authored
Merge pull request #644 from PolicyEngine/maria/methodology-docs
Update pipeline documentation, both public facing and internal
2 parents 93f1acc + 05a5c6f commit 84a5614

24 files changed

Lines changed: 4872 additions & 1406 deletions

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ all: data test
2020

2121
format:
2222
ruff format .
23+
mdformat --wrap 100 docs/
2324

2425
test:
2526
pytest
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add `docs/internals/` developer reference: three notebooks covering all nine pipeline stages (Stage 1 data build, Stage 2 calibration matrix assembly, Stages 3–4 L0 optimization and H5 assembly) plus a README with pipeline orchestration reference, run ID format, Modal volume layout, and HuggingFace artifact paths.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Update public-facing methodology and data documentation to reflect the current pipeline implementation; pipeline now uploads validation diagnostics to HuggingFace after H5 builds complete.

docs/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ This project uses [MyST Markdown](https://mystmd.org/) for documentation.
55
## Building Locally
66

77
### Requirements
8+
89
- Python 3.14+ with dev dependencies: `uv pip install -e .[dev] --system`
910
- Node.js 20+ (required by MyST)
1011

1112
### Commands
13+
1214
```bash
1315
make documentation # Build static HTML files
1416
make documentation-serve # Serve locally on http://localhost:8080
@@ -21,7 +23,8 @@ make documentation-serve # Serve locally on http://localhost:8080
2123
- `_build/html/` - **Static HTML files (use for GitHub Pages deployment)**
2224
- `_build/site/` - Dynamic content for `myst start` development server only
2325

24-
**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.
26+
**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory
27+
contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.
2528

2629
## GitHub Pages Deployment
2730

@@ -33,14 +36,17 @@ make documentation-serve # Serve locally on http://localhost:8080
3336
## Troubleshooting
3437

3538
**Blank page after deployment:**
39+
3640
- Check that workflow deploys `folder: docs/_build/html` (not `_build/site`)
3741
- Wait 5-10 minutes for GitHub Pages propagation
3842
- Hard refresh browser (Ctrl+Shift+R / Cmd+Shift+R)
3943

4044
**Build fails in CI:**
45+
4146
- Ensure Node.js setup step exists in workflow (MyST requires Node.js)
4247
- Never add timeouts or `|| true` to build commands - they mask failures
4348

4449
**Missing index.html:**
50+
4551
- MyST auto-generates index.html in `_build/html/`
4652
- Do not create manual index.html in docs/

docs/abstract.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
# Abstract
22

3-
We present a methodology for creating enhanced microsimulation datasets by combining the
4-
Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
5-
quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
6-
preserving distributional characteristics while maintaining household composition and member
7-
relationships. The imputation process alone does not guarantee consistency with official
8-
statistics, necessitating a reweighting step to align the combined dataset with known
9-
population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics of Income, Census population projections, Congressional Budget Office benefit program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare spending patterns, and other benefit program costs. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. The dataset maintains the CPS's demographic detail and geographic granularity while
10-
incorporating tax reporting data from administrative sources. We release the enhanced
11-
dataset, source code, and documentation to support policy analysis.
3+
We present a methodology for creating enhanced microsimulation datasets by combining the Current
4+
Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses quantile regression
5+
forests to impute 67 tax variables from the PUF onto CPS records, preserving distributional
6+
characteristics while maintaining household composition and member relationships. The imputation
7+
process alone does not guarantee consistency with official statistics, necessitating a reweighting
8+
step to align the combined dataset with known population totals and administrative benchmarks. We
9+
apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics
10+
of Income, Census population projections, Congressional Budget Office benefit program estimates,
11+
Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
12+
spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
13+
gradient descent optimization to ensure consistency with administrative benchmarks. The dataset
14+
maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting
15+
data from administrative sources. We release the enhanced dataset, source code, and documentation to
16+
support policy analysis.

docs/appendix.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44

55
### A.1 Quantile Regression Forest Implementation
66

7-
The following code demonstrates the implementation of Quantile Regression Forests for variable imputation:
7+
The following code demonstrates the implementation of Quantile Regression Forests for variable
8+
imputation:
89

910
```python
1011
from quantile_forest import RandomForestQuantileRegressor
@@ -49,6 +50,7 @@ for iteration in range(5000):
4950
#### Variables Imputed from IRS Public Use File (67 variables)
5051

5152
**Income Variables:**
53+
5254
- employment_income
5355
- partnership_s_corp_income
5456
- social_security
@@ -75,6 +77,7 @@ for iteration in range(5000):
7577
- salt_refund_income
7678

7779
**Deductions and Adjustments:**
80+
7881
- interest_deduction
7982
- unreimbursed_business_employee_expenses
8083
- pre_tax_contributions
@@ -92,6 +95,7 @@ for iteration in range(5000):
9295
- deductible_mortgage_interest
9396

9497
**Tax Credits:**
98+
9599
- cdcc_relevant_expenses
96100
- foreign_tax_credit
97101
- american_opportunity_credit
@@ -104,6 +108,7 @@ for iteration in range(5000):
104108
- other_credits
105109

106110
**Qualified Business Income Variables:**
111+
107112
- w2_wages_from_qualified_business
108113
- unadjusted_basis_qualified_property
109114
- business_is_sstb
@@ -118,6 +123,7 @@ for iteration in range(5000):
118123
- self_employment_income_would_be_qualified
119124

120125
**Other Tax Variables:**
126+
121127
- traditional_ira_contributions
122128
- qualified_tuition_expenses
123129
- casualty_loss
@@ -137,4 +143,4 @@ for iteration in range(5000):
137143
#### Variables Imputed from American Community Survey (2 variables)
138144

139145
- rent
140-
- real_estate_taxes
146+
- real_estate_taxes

docs/background.md

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,36 @@
22

33
## The Microsimulation Landscape
44

5-
Tax and benefit microsimulation models play a role in policy analysis by projecting the distributional and revenue impacts of proposed reforms. Institutions maintaining these models include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation (JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model (PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common data challenges.
6-
7-
The core challenges these models face stem from the tradeoff between data comprehensiveness and accessibility. Administrative tax data provides income reporting but lacks the household context that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`. Survey data captures household relationships and program participation but suffers from income underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect taxpayer privacy limits data availability because administrators cannot publicly release microdata.
5+
Tax and benefit microsimulation models play a role in policy analysis by projecting the
6+
distributional and revenue impacts of proposed reforms. Institutions maintaining these models
7+
include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation
8+
(JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations
9+
including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model
10+
(PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source
11+
Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common
12+
data challenges.
13+
14+
The core challenges these models face stem from the tradeoff between data comprehensiveness and
15+
accessibility. Administrative tax data provides income reporting but lacks the household context
16+
that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`.
17+
Survey data captures household relationships and program participation but suffers from income
18+
underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect
19+
taxpayer privacy limits data availability because administrators cannot publicly release microdata.
820

921
## Data Enhancement Approaches
1022

1123
Different microsimulation models use various approaches to enhance their underlying data:
1224

13-
Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot share their enhanced microdata. Non-governmental models work with public data, leading to various enhancement strategies. Some organizations use proprietary extracts of tax returns, while others enhance survey data with various methods.
25+
Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot
26+
share their enhanced microdata. Non-governmental models work with public data, leading to various
27+
enhancement strategies. Some organizations use proprietary extracts of tax returns, while others
28+
enhance survey data with various methods.
1429

15-
Our enhanced dataset provides an open-source methodology with state identifiers and calibration to state-level targets. This enables analysis of federal-state tax interactions. Researchers can use the dataset with PolicyEngine or other microsimulation models.
30+
Our enhanced dataset provides an open-source methodology with state identifiers and calibration to
31+
state-level targets. This enables analysis of federal-state tax interactions. Researchers can use
32+
the dataset with PolicyEngine or other microsimulation models.
1633

17-
The open-source nature promotes methodological transparency. The modular design allows researchers to substitute alternative imputation or calibration methods while maintaining the overall framework. Regular updates as new CPS and administrative data become available ensure the dataset remains current.
34+
The open-source nature promotes methodological transparency. The modular design allows researchers
35+
to substitute alternative imputation or calibration methods while maintaining the overall framework.
36+
Regular updates as new CPS and administrative data become available ensure the dataset remains
37+
current.

0 commit comments

Comments
 (0)