|
| 1 | +# PSID coverage = 0 in `benchmark_multi_seed.json`: diagnosed |
| 2 | + |
| 3 | +*Closes the open question raised in `docs/synthesizer-benchmark-scale-up.md`.* |
| 4 | + |
| 5 | +## Summary |
| 6 | + |
| 7 | +PSID coverage is 0.0 across all 6 methods (QRF, ZI-QRF, QDNN, ZI-QDNN, MAF, ZI-MAF) for all 10 seeds **not because PSID is unsynthesizable, but because the benchmark harness collapses PSID conditioning to 2 variables** (`is_male` and `age`) when it computes the shared-column pool. |
| 8 | + |
| 9 | +This is a benchmark-architecture bug, not a data limitation. PSID is still a viable backbone for the SS-model longitudinal extension, conditional on fixing or bypassing this specific benchmark setup. |
| 10 | + |
| 11 | +## Reproduction |
| 12 | + |
| 13 | +Input: `microplex/data/stacked_comprehensive.parquet` (630,216 rows, 38 cols, stacks sipp + cps + psid). |
| 14 | + |
| 15 | +Benchmark setup (`microplex/scripts/run_benchmark.py` + `microplex/src/microplex/eval/benchmark.py`): |
| 16 | + |
| 17 | +1. For each source, keep only numeric columns with <5 % NaN, then `dropna()`. |
| 18 | +2. Compute `shared_cols` = columns present in ALL sources with <5 % NaN each. |
| 19 | +3. Each synthesizer is trained as a multi-source fusion: pool `shared_cols` across sources, fit a per-column model for each non-shared column on only the source that has it. |
| 20 | +4. At generation: sample a shared-column record, then predict each non-shared column from its per-source model conditioned on the shared columns. |
| 21 | +5. Per-source PRDC coverage: holdout = that source's full column set; synthetic = generated records' intersecting column set; `prdc` library computes coverage with k=5. |
| 22 | + |
| 23 | +Diagnostic script (runs in a few seconds): |
| 24 | + |
| 25 | +```python |
| 26 | +import pandas as pd |
| 27 | +import numpy as np |
| 28 | + |
| 29 | +df = pd.read_parquet("data/stacked_comprehensive.parquet") |
| 30 | +numeric_dtypes = [np.float64, np.int64, np.float32, np.int32] |
| 31 | +exclude = {"weight", "person_id", "household_id", "interview_number"} |
| 32 | + |
| 33 | +survey_dfs = {} |
| 34 | +for src in ["sipp", "cps", "psid"]: |
| 35 | + sub = df[df["_survey"] == src].drop(columns=["_survey"]).copy() |
| 36 | + num = [c for c in sub.columns |
| 37 | + if sub[c].dtype in numeric_dtypes and sub[c].isna().mean() < 0.05] |
| 38 | + survey_dfs[src] = sub[num].dropna().reset_index(drop=True) |
| 39 | + print(src, len(survey_dfs[src]), num) |
| 40 | + |
| 41 | +first = next(iter(survey_dfs.values())) |
| 42 | +shared = [c for c in first.columns |
| 43 | + if c not in exclude and all(c in d.columns for d in survey_dfs.values())] |
| 44 | +print("shared_cols:", shared) |
| 45 | +``` |
| 46 | + |
| 47 | +Output: |
| 48 | + |
| 49 | +| Source | Rows after dropna | Low-NaN numeric columns | |
| 50 | +|---|---:|---| |
| 51 | +| SIPP | 476,744 | hispanic, race, is_male, wave, job_gain, age, job_loss, weight, month | |
| 52 | +| CPS | 144,265 | state_fips, is_male, dividend_income, farm_income, age, self_employment_income, weight, rental_income, wage_income, interest_income | |
| 53 | +| PSID | 9,207 | state_fips, food_stamps, total_family_income, is_male, marital_status, year, dividend_income, taxable_income, age, weight, rental_income, wage_income, interview_number, social_security, interest_income | |
| 54 | + |
| 55 | +**Intersection after excluding `{weight, person_id, household_id, interview_number}`: `['is_male', 'age']` — 2 columns.** |
| 56 | + |
| 57 | +## Why this gives PSID coverage 0 |
| 58 | + |
| 59 | +- PSID has the **most** unique non-shared columns (13 of its 15 are non-shared), all trained per-column on only 9,207 rows conditioned on 2 shared variables. |
| 60 | +- PRDC for PSID is computed on PSID's full 15-column feature space. The synthesizer's predicted values for the 13 non-shared columns are drawn from a model that's severely under-conditioned (2D conditioning on 13 target dimensions, each with a per-column RF or flow trained on 9,207 rows). |
| 61 | +- k-NN coverage with k=5 in 15D looks for any synthetic record within the k-th nearest-neighbor distance of each real holdout record. With under-conditioned predictions the synthetic records cluster around model means and rarely fall within the real holdout's neighborhood ball. Coverage → 0. |
| 62 | +- CPS has 10 total columns with 8 non-shared and 144,265 rows → coverage ~0.34–0.50 (mediocre but non-zero). SIPP has 9 total columns with 7 non-shared and 476,744 rows → coverage ~0.72–0.95 (highest). **The pattern tracks column-uniqueness ratio and row count.** PSID is worst because its non-shared ratio is highest and its row count is lowest. |
| 63 | + |
| 64 | +## Why this is a benchmark bug, not a PSID limitation |
| 65 | + |
| 66 | +The benchmark implicitly assumes sources share rich conditioning information. Here the `<5 % NaN` filter removes many latently-shared columns from individual sources. For example, `wage_income` appears in both CPS (144,265 non-null) and PSID (9,207 non-null) but NOT in SIPP — so it's excluded from `shared_cols`. If the benchmark harmonized the column schema across sources before applying the NaN filter (either by imputing cross-source or by using an intersection-of-non-null-across-sources strategy), `shared_cols` would be much richer and all sources would benefit. |
| 67 | + |
| 68 | +PSID itself has 15 low-NaN columns — more than either SIPP (9) or CPS (10). On a **PSID-only** benchmark (train on PSID, test on PSID holdout), coverage would likely be competitive with SIPP's. |
| 69 | + |
| 70 | +## Implications for the architecture work |
| 71 | + |
| 72 | +### For synthesizer selection (G1 cross-section) |
| 73 | + |
| 74 | +- **The benchmark's PSID=0 verdict should not influence cross-section synthesizer choice.** G1 works with CPS-core scaffold, not PSID, so the issue doesn't propagate. My earlier recommendation of ZI-MAF for cross-section and ZI-QRF for panel stands. |
| 75 | + |
| 76 | +### For SS-model longitudinal extension (G3) |
| 77 | + |
| 78 | +- **PSID can still be the trajectory-training backbone.** The SS-model methodology doc's plan to use PSID (1968–present) for lifetime earnings trajectories is not invalidated by this benchmark. |
| 79 | +- However, before committing compute, run a **PSID-only synthesizer benchmark**: train ZI-MAF / ZI-QRF / ZI-QDNN on PSID alone, test on PSID holdout. That is the relevant evaluation for the SS-model use case. The existing multi-source benchmark result for PSID is not the relevant number. |
| 80 | +- If PSID-only benchmarks still show low coverage, the real issue may be the attrition-induced sparsity in PSID's joint feature space (real data limitation). That is a separate investigation. |
| 81 | + |
| 82 | +### For the benchmark harness itself (deprioritized) |
| 83 | + |
| 84 | +- The benchmark's `find_shared_cols` policy is brittle at the intersection: any source with a different NaN rate on a column knocks that column out of the shared pool for every source. For future benchmark work, consider: |
| 85 | + - Lift the NaN filter or pre-impute cross-source. |
| 86 | + - Report results **per-source** on same-source train/test splits, not cross-source. |
| 87 | + - Report `shared_cols` and per-source `non_shared_cols` counts alongside coverage so reviewers can see the conditioning bottleneck. |
| 88 | + |
| 89 | +## Action items |
| 90 | + |
| 91 | +1. **Update `docs/synthesizer-benchmark-scale-up.md`** to note this finding — the PSID=0 line in the initial summary should be annotated, not taken as evidence that PSID is unusable. |
| 92 | +2. **Before any SS-model work commits compute to PSID-based trajectory training**, run a PSID-only synthesizer benchmark. That is a ~day of work on `experiments/` with existing method classes. |
| 93 | +3. **No change to G1 plan.** Cross-section proceeds with CPS-scaffold as planned; PSID is not on the G1 critical path. |
| 94 | + |
| 95 | +## What was reliable in the original PSID=0 signal |
| 96 | + |
| 97 | +- It is genuine that the specific multi-source fusion benchmark here cannot cover PSID well. Consumers who use that benchmark output (e.g., paper draft in `microplex/paper/paper_results.py`) need to adjust claims accordingly — it is not valid to say "all methods fail on PSID." The valid claim is "cross-source fusion with 2 shared variables fails on PSID, in a way that tracks non-shared column ratio." |
0 commit comments