Skip to content

Commit 98fc990

Browse files
Merge pull request #335 from chusloj/matching_docs
Documentation for CPS/PUF matching file [Review]
2 parents f716ee6 + efb4f7a commit 98fc990

4 files changed

Lines changed: 41 additions & 2 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,6 @@ cps_data/pycps/cps_raw.csv.gz
2121
cpsmar*.sas
2222
cpsmar*.csv
2323
*.dat
24+
25+
# pickle
26+
cps*.pkl

puf_data/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,5 @@ This directory contains the following script:
1515
Documentation
1616
-------------
1717

18-
The SAS scripts used to prepare the cps-matched-puf.csv file are
18+
The Python scripts used to prepare the cps-matched-puf.csv file are
1919
described in the [StatMatch subdirectory](StatMatch/README.md).

puf_data/StatMatch/MATCH.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# CPS Match File documentation
2+
3+
The `cps-matched-puf.csv` file is the result of a statistical match performed between the 2016 Current Population Annual Social and Economic Supplement and the 2011 IRS-SOI Public Use File. See the [Datasets documentation](/datasets.md#input-files) for more information.
4+
5+
6+
## Process
7+
8+
Data from the 2016 CPS data is first collected and organized. Tax filing units are then constructed from this CPS data, missing data is imputed from similar data through [Predictive Mean Matching](https://stefvanbuuren.name/fimd/sec-pmm.html) and the statistical match is performed using nearest neighbor distance as the criterion for matching filing units. [`runmatch.py`](Matching/runmatch.py) is responsible for running the match and compiling a final production file.
9+
10+
11+
12+
#### Pros/Cons of Matching Methodology
13+
14+
The statistical match performed to create this document is a "constrained" match. This is a process that matches similar filing units from the CPS and PUF such that the weights on individual filing units within a document aggregate (sum up) to match the total weight of said document (e.x. all individual weights in the CPS document sum up to the weight of the CPS document itself). The benefit of this constrained matching approach is it retains original disributions of variables from within the CPS and PUF files. There are 2 principal drawbacks to this method: (1) The solver may sometimes max filing units that are not very similar statistically, and (2) the method is computationally expensive.
15+
16+
17+
18+
## Files Used
19+
20+
`cps-matched-puf.csv` is created by running [`runmatch.py`](Matching/runmatch.py). See the [README](README.md) for more information on the scripts used by [`runmatch.py`](Matching/runmatch.py).
21+
22+
The output of [`runmatch.py`](Matching/runmatch.py) is used in [`stage2.py`](/puf_stage2/stage2.py) and [`stage3.py`](/puf_stage3/stage3.py). More information on the Stage 2 and Stage 3 files can be found in [this document](/puf_stage3/doc/puf_stage3.md), and the statistical matching process is outlined in detail in [this document](doc/MatchingDocumentationRevised.pdf).
23+
24+
25+
26+
## Contributors
27+
28+
- Matt Jensen
29+
- Peter Metz
30+
- Anderson Frailey
31+
- Martin Holmer
32+
- Max Ghenis
33+
34+

puf_data/StatMatch/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ using the March 2016 CPS and 2011 IRS SOI Public Use File (PUF) datasets.
66
## Usage
77

88
The entire statistical match can be run by using `python runmatch.py` in the
9-
command line. Three optional arguments can be included as well:
9+
command line. The output of this process should be `cps-matched-puf.csv` (not in the repo because `runmatch.py` requires restricted files).
10+
11+
Three optional arguments can be included as well:
1012

1113
* `-c`, `--cps` takes the path to an already created .CSV version of the CPS.
1214
Including this will allow the program to simply read in the CSV, rather than

0 commit comments

Comments
 (0)