Skip to content

Commit 23ce32c

Browse files
MaxGhenisclaude
andauthored
Clarify SIPP is public-use; only IRS-PUF is access-restricted (#809)
John Sabelhaus corrected a licensing overclaim in the 2026-04-21 meeting: the SIPP vintage we consume (Census public-use SIPP) has no per-user license, data-use agreement, or registration requirement. Of the six upstream sources the pipeline ingests (CPS, ACS, SCF, ORG, SIPP, IRS-PUF), only IRS-PUF has a genuine access restriction. The HuggingFace mirror of pu2023.csv is a caching convenience, not an access-restriction workaround. This matters for TRACE / reproducibility writeups: overstating which inputs are restricted distorts the institutional-certification story. Fixes #808. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2db4cc7 commit 23ce32c

2 files changed

Lines changed: 20 additions & 0 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Clarified SIPP licensing language in `policyengine_us_data/datasets/sipp/README.md`: SIPP public-use data is unrestricted (no per-user license, agreement, or registration). Of the six upstream microdata sources the Enhanced CPS pipeline ingests (CPS, ACS, SCF, ORG, SIPP, IRS-PUF), only IRS-PUF has a genuine access restriction. Fixes #808.

policyengine_us_data/datasets/sipp/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,22 @@ The raw SIPP CSVs (`pu2023.csv` and the slim variant `pu2023_slim.csv`)
3939
are mirrored on the `PolicyEngine/policyengine-us-data` HuggingFace model
4040
repo and downloaded on demand when a training run is needed. They are
4141
not vendored in this Git repository.
42+
43+
## Licensing
44+
45+
SIPP public-use files are, as the name implies, **public-use data** — no
46+
per-user license, data-use agreement, or registration is required to
47+
download or redistribute them. We mirror them on our HuggingFace model
48+
repo purely as a caching convenience (Census's own hosting is slow and
49+
occasionally unavailable), not to work around any access restriction.
50+
51+
This matters because PolicyEngine's enhanced CPS pipeline ingests several
52+
different upstream microdata sources, and only **one** of them —
53+
**IRS Public Use File (PUF)** — has any genuine access restriction. PUF
54+
requires agreeing to IRS's terms of use before download, even though the
55+
file is itself intended for public release. CPS, ACS, SCF, ORG, and SIPP
56+
are all unrestricted public-use. If you are writing about the pipeline's
57+
licensing posture (for a paper, replication packet, or TRACE TRO), only
58+
IRS-PUF should appear in the restricted column.
59+
60+
See issue #808 for the background on this correction.

0 commit comments

Comments
 (0)