Merge pull request #335 from chusloj/matching_docs

andersonfrailey · web-flow · commit 98fc99011b08 · 2020-08-10T16:17:10.000-05:00
Documentation for CPS/PUF matching file [Review]
diff --git a/.gitignore b/.gitignore
@@ -21,3 +21,6 @@ cps_data/pycps/cps_raw.csv.gz
 cpsmar*.sas
 cpsmar*.csv
 *.dat
+
+# pickle
+cps*.pkl
diff --git a/puf_data/README.md b/puf_data/README.md
@@ -15,5 +15,5 @@ This directory contains the following script:
 Documentation
 -------------
 
-The SAS scripts used to prepare the cps-matched-puf.csv file are
+The Python scripts used to prepare the cps-matched-puf.csv file are
 described in the [StatMatch subdirectory](StatMatch/README.md).
diff --git a/puf_data/StatMatch/MATCH.md b/puf_data/StatMatch/MATCH.md
@@ -0,0 +1,34 @@
+# CPS Match File documentation
+
+The `cps-matched-puf.csv` file is the result of a statistical match performed between the 2016 Current Population Annual Social and Economic Supplement and the 2011 IRS-SOI Public Use File. See the [Datasets documentation](/datasets.md#input-files) for more information.
+
+
+## Process
+
+Data from the 2016 CPS data is first collected and organized. Tax filing units are then constructed from this CPS data, missing data is imputed from similar data through [Predictive Mean Matching](https://stefvanbuuren.name/fimd/sec-pmm.html) and the statistical match is performed using nearest neighbor distance as the criterion for matching filing units. [`runmatch.py`](Matching/runmatch.py) is responsible for running the match and compiling a final production file.
+
+
+
+#### Pros/Cons of Matching Methodology
+
+The statistical match performed to create this document is a "constrained" match. This is a process that matches similar filing units from the CPS and PUF such that the weights on individual filing units within a document aggregate (sum up) to match the total weight of said document (e.x. all individual weights in the CPS document sum up to the weight of the CPS document itself). The benefit of this constrained matching approach is it retains original disributions of variables from within the CPS and PUF files. There are 2 principal drawbacks to this method: (1) The solver may sometimes max filing units that are not very similar statistically, and (2) the method is computationally expensive.
+
+
+
+## Files Used
+
+`cps-matched-puf.csv` is created by running [`runmatch.py`](Matching/runmatch.py). See the [README](README.md) for more information on the scripts used by [`runmatch.py`](Matching/runmatch.py).
+
+The output of [`runmatch.py`](Matching/runmatch.py) is used in [`stage2.py`](/puf_stage2/stage2.py) and [`stage3.py`](/puf_stage3/stage3.py). More information on the Stage 2 and Stage 3 files can be found in [this document](/puf_stage3/doc/puf_stage3.md), and the statistical matching process is outlined in detail in [this document](doc/MatchingDocumentationRevised.pdf).
+
+
+
+## Contributors
+
+- Matt Jensen
+- Peter Metz
+- Anderson Frailey
+- Martin Holmer
+- Max Ghenis
+
+
diff --git a/puf_data/StatMatch/README.md b/puf_data/StatMatch/README.md
@@ -6,7 +6,9 @@ using the March 2016 CPS and 2011 IRS SOI Public Use File (PUF) datasets.
 ## Usage
 
 The entire statistical match can be run by using `python runmatch.py` in the
-command line. Three optional arguments can be included as well:
+command line. The output of this process should be `cps-matched-puf.csv` (not in the repo because `runmatch.py` requires restricted files).
+
+Three optional arguments can be included as well:
 
 * `-c`, `--cps` takes the path to an already created .CSV version of the CPS.
 Including this will allow the program to simply read in the CSV, rather than