Skip to content

Commit 8a515ab

Browse files
authored
Add dataset READMEs with codebook links for CPS, SIPP, SCF, and ORG (#793)
Adds README.md files to policyengine_us_data/datasets/{cps,sipp,scf,org}/ linking to the canonical codebooks, data dictionaries, and landing pages for each underlying public-use microdata source: - CPS ASEC: Census Bureau ASEC data dictionaries (2023-2025). - SIPP: Census Bureau 2023 data dictionary and user's guide. - SCF: Federal Reserve Board 2016/2019/2022 codebooks and summary-extract macro. - ORG: Census Bureau 2024 basic monthly record layout and documentation landing pages. Each README also describes what the folder's Python modules do and where the raw data come from, mirroring the style of the existing acs/README.md and puf/README.md. Closes #221.
1 parent 62e8d78 commit 8a515ab

5 files changed

Lines changed: 160 additions & 0 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add README files with codebook and documentation links to the cps, sipp, scf, and org dataset folders.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Current Population Survey (CPS ASEC)
2+
3+
This folder contains the tooling that ingests the Census Bureau's Current
4+
Population Survey Annual Social and Economic Supplement (CPS ASEC) into
5+
PolicyEngine's US microdata pipeline (`census_cps.py`, `cps.py`,
6+
`enhanced_cps.py`, `extended_cps.py`, `small_enhanced_cps.py`, `takeup.py`,
7+
and `tipped_occupation.py`).
8+
9+
The CPS ASEC is the Census Bureau / Bureau of Labor Statistics' primary
10+
source of annual demographic and income data for the US civilian
11+
noninstitutional population. PolicyEngine uses it as the demographic
12+
backbone of the Enhanced CPS; tax-return detail from the IRS PUF is then
13+
merged onto each CPS record.
14+
15+
## Documentation
16+
17+
The Census Bureau publishes a data dictionary and technical documentation
18+
for each ASEC vintage. These are the canonical reference for every
19+
variable name, code, and SPM/tax-unit construction used by the code in
20+
this folder:
21+
22+
- [2023 ASEC data dictionary (full PDF)](https://www2.census.gov/programs-surveys/cps/datasets/2023/march/asec2023_ddl_pub_full.pdf)
23+
- [2024 ASEC data dictionary (full PDF)](https://www2.census.gov/programs-surveys/cps/datasets/2024/march/asec2024_ddl_pub_full.pdf)
24+
- [2025 ASEC data dictionary (full PDF)](https://www2.census.gov/programs-surveys/cps/datasets/2025/march/asec2025_ddl_pub_full.pdf)
25+
26+
See also:
27+
28+
- [CPS ASEC landing page](https://www.census.gov/programs-surveys/cps.html)
29+
- [CPS ASEC technical documentation](https://www.census.gov/programs-surveys/cps/technical-documentation.html)
30+
- [CPS ASEC public-use microdata datasets](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html)
31+
32+
The exact Census URLs the pipeline downloads for each ASEC year are
33+
enumerated in `CPS_URL_BY_YEAR` inside `census_cps.py`.
34+
35+
## Data products in this folder
36+
37+
- `census_cps.py` — downloads and stages the raw ASEC person/family/
38+
household tables from Census for a given ASEC year.
39+
- `cps.py` — derives the PolicyEngine `CPS` dataset (PolicyEngine variable
40+
names, entity structure, SPM units, tax units) from the Census tables.
41+
- `enhanced_cps.py`, `extended_cps.py`, `small_enhanced_cps.py`
42+
downstream enhanced datasets that merge PUF-based tax-return detail and
43+
imputed variables onto the CPS backbone.
44+
- `takeup.py` — program take-up anchoring against reported CPS recipiency.
45+
- `tipped_occupation.py` — Treasury tipped-occupation code derivation.
46+
- `imputation_parameters.yaml` — hyperparameters for QRF imputations used
47+
by the enhanced CPS pipeline.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# CPS Outgoing Rotation Group (ORG)
2+
3+
This folder contains the tooling that builds a labor-market donor frame
4+
from the CPS basic monthly public-use files (`org.py`).
5+
6+
The CPS Outgoing Rotation Group (ORG) earnings questions are asked only
7+
of the one-quarter of the sample that is rotating out in a given month.
8+
Pooling the twelve monthly ORG samples for a year yields a donor frame
9+
PolicyEngine uses to impute wage, hourly-pay, and union variables onto
10+
the CPS ASEC records.
11+
12+
The checked-in code does not vendor the donor file itself. Instead,
13+
`org.py` builds `census_cps_org_2024_wages.csv.gz` on demand by
14+
downloading the twelve official CPS basic monthly public-use CSVs for
15+
`ORG_YEAR` (currently 2024) directly from the Census Bureau and filtering
16+
each file to the ORG rotations.
17+
18+
## Documentation
19+
20+
The Census Bureau and BLS publish a data dictionary and users' guide for
21+
the CPS basic monthly public-use microdata. These are the canonical
22+
reference for every variable name and earnings-recipiency code used by
23+
the code in this folder:
24+
25+
- [2024 CPS basic monthly public-use record layout (TXT)](https://www2.census.gov/programs-surveys/cps/datasets/2024/basic/2024_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt)
26+
- [CPS basic monthly documentation landing page](https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html)
27+
28+
See also:
29+
30+
- [CPS technical documentation](https://www.census.gov/programs-surveys/cps/technical-documentation.html)
31+
32+
## Data products in this folder
33+
34+
- `org.py` — downloads the twelve monthly CSVs, filters to the MIS-4 and
35+
MIS-8 outgoing rotations (`HRMIS`), and caches the combined ORG donor
36+
frame. Trains a QRF model to impute `wage_income`, `hourly_wage`, and
37+
union-coverage variables onto the CPS ASEC records used by the
38+
Enhanced CPS pipeline.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Survey of Consumer Finances (SCF)
2+
3+
This folder contains the tooling that ingests the Federal Reserve Board's
4+
Survey of Consumer Finances (SCF) summary extract into PolicyEngine's US
5+
microdata pipeline (`fed_scf.py`, `scf.py`).
6+
7+
The SCF is the Fed's triennial household-level survey of wealth, debt,
8+
and income. PolicyEngine uses the summary extract to inform net-worth
9+
and asset-related calibration targets.
10+
11+
## Documentation
12+
13+
The Federal Reserve Board publishes a codebook for each SCF survey wave
14+
describing every summary variable, derivation, and weight. These are the
15+
canonical reference for the code in this folder:
16+
17+
- [2022 SCF main-survey codebook (TXT)](https://www.federalreserve.gov/econres/files/codebk2022.txt)
18+
- [2019 SCF main-survey codebook (TXT)](https://www.federalreserve.gov/econres/files/codebk2019.txt)
19+
- [2016 SCF main-survey codebook (TXT)](https://www.federalreserve.gov/econres/files/codebk2016.txt)
20+
- [SCF summary-extract variable-definition macro (bulletin.macro.txt)](https://www.federalreserve.gov/econres/files/bulletin.macro.txt)
21+
22+
See also:
23+
24+
- [SCF landing page](https://www.federalreserve.gov/econres/scfindex.htm)
25+
- [SCF documentation (working papers, methodology)](https://www.federalreserve.gov/econres/scf-documentation.htm)
26+
27+
## Data products in this folder
28+
29+
- `fed_scf.py` — downloads the Fed's SAS summary-extract ZIPs
30+
(`SummarizedFedSCF_2016`, `SummarizedFedSCF_2019`, `SummarizedFedSCF_2022`)
31+
and reads them into a pandas DataFrame.
32+
- `scf.py` — wraps the raw summary extract in a PolicyEngine `Dataset`
33+
(`SCF`) with the standard ARRAYS format used downstream.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Survey of Income and Program Participation (SIPP)
2+
3+
This folder contains the tooling that uses the Census Bureau's Survey of
4+
Income and Program Participation (SIPP) as a donor source for imputations
5+
onto the CPS (`sipp.py`).
6+
7+
PolicyEngine currently uses SIPP to train QRF imputation models for
8+
tip income (using the SIPP job-level tip-amount columns) and for
9+
household-level asset categories (bank, stock, bond, vehicle). These
10+
models are then applied to the CPS-based Enhanced CPS to obtain
11+
person-level tip income and household-level countable resources that the
12+
CPS itself does not capture.
13+
14+
## Documentation
15+
16+
The Census Bureau publishes a users' guide and data dictionary for each
17+
SIPP panel wave. These are the canonical reference for every variable
18+
name, value code, and weighting construct used by the code in this
19+
folder:
20+
21+
- [SIPP 2023 public-use data dictionary (PDF)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/data-dictionaries/2023/2023_SIPP_Data_Dictionary.pdf)
22+
- [SIPP 2023 users' guide (PDF, Aug 2026 revision)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/methodology/2023_SIPP_Users_Guide_AUG26.pdf)
23+
24+
See also:
25+
26+
- [SIPP landing page](https://www.census.gov/programs-surveys/sipp.html)
27+
- [SIPP technical documentation](https://www.census.gov/programs-surveys/sipp/tech-documentation.html)
28+
- [SIPP public-use datasets](https://www.census.gov/programs-surveys/sipp/data/datasets.html)
29+
30+
## Data products in this folder
31+
32+
- `sipp.py` — trains and caches QRF imputation models (`get_tip_model`,
33+
`get_asset_model`, `get_vehicle_model`) from SIPP 2023 person-month
34+
data. The training frame is filtered to `MONTHCODE == 12` (December)
35+
so every row represents one person-year rather than twelve annualized
36+
months.
37+
38+
The raw SIPP CSVs (`pu2023.csv` and the slim variant `pu2023_slim.csv`)
39+
are mirrored on the `PolicyEngine/policyengine-us-data` HuggingFace model
40+
repo and downloaded on demand when a training run is needed. They are
41+
not vendored in this Git repository.

0 commit comments

Comments
 (0)