Skip to content

Commit a408fb4

Browse files
MaxGhenisclaude
andcommitted
Diagnose PSID coverage = 0 in benchmark_multi_seed.json
Root cause: the multi-source fusion benchmark harness in microplex (scripts/run_benchmark.py + src/microplex/eval/benchmark.py) collapses the shared-column pool across sipp/cps/psid to exactly 2 variables (is_male, age) because of a <5% NaN filter applied per-source before intersection. PSID has the highest ratio of non-shared columns (13 of 15) and the smallest row count (9,207), so its per-column models are the most under-conditioned. PRDC k-NN coverage collapses to 0 because synthetic records cluster around model means and miss the real holdout neighborhoods. Key facts: - shared_cols intersection for the benchmark is literally ['is_male', 'age'] - SIPP (9 cols, 7 non-shared, 476k rows): coverage 0.29-0.95 - CPS (10 cols, 8 non-shared, 144k rows): coverage 0.34-0.50 - PSID (15 cols, 13 non-shared, 9k rows): coverage 0.00 uniformly - Pattern tracks non-shared-ratio and row count, not method choice Implications: - G1 cross-section synthesizer choice: unaffected, continue with ZI-MAF for CPS-style, ZI-QRF for panel - SS-model longitudinal work: PSID is NOT ruled out as trajectory training backbone; the benchmark verdict is not the relevant evaluation. A PSID-only benchmark is needed before committing. - Paper claims depending on PSID=0 need qualification: valid claim is "cross-source fusion with 2 shared vars fails on PSID" not "all methods fail on PSID" Reproduction script included in the doc (runs in seconds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7d7ca66 commit a408fb4

1 file changed

Lines changed: 97 additions & 0 deletions

File tree

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# PSID coverage = 0 in `benchmark_multi_seed.json`: diagnosed
2+
3+
*Closes the open question raised in `docs/synthesizer-benchmark-scale-up.md`.*
4+
5+
## Summary
6+
7+
PSID coverage is 0.0 across all 6 methods (QRF, ZI-QRF, QDNN, ZI-QDNN, MAF, ZI-MAF) for all 10 seeds **not because PSID is unsynthesizable, but because the benchmark harness collapses PSID conditioning to 2 variables** (`is_male` and `age`) when it computes the shared-column pool.
8+
9+
This is a benchmark-architecture bug, not a data limitation. PSID is still a viable backbone for the SS-model longitudinal extension, conditional on fixing or bypassing this specific benchmark setup.
10+
11+
## Reproduction
12+
13+
Input: `microplex/data/stacked_comprehensive.parquet` (630,216 rows, 38 cols, stacks sipp + cps + psid).
14+
15+
Benchmark setup (`microplex/scripts/run_benchmark.py` + `microplex/src/microplex/eval/benchmark.py`):
16+
17+
1. For each source, keep only numeric columns with <5 % NaN, then `dropna()`.
18+
2. Compute `shared_cols` = columns present in ALL sources with <5 % NaN each.
19+
3. Each synthesizer is trained as a multi-source fusion: pool `shared_cols` across sources, fit a per-column model for each non-shared column on only the source that has it.
20+
4. At generation: sample a shared-column record, then predict each non-shared column from its per-source model conditioned on the shared columns.
21+
5. Per-source PRDC coverage: holdout = that source's full column set; synthetic = generated records' intersecting column set; `prdc` library computes coverage with k=5.
22+
23+
Diagnostic script (runs in a few seconds):
24+
25+
```python
26+
import pandas as pd
27+
import numpy as np
28+
29+
df = pd.read_parquet("data/stacked_comprehensive.parquet")
30+
numeric_dtypes = [np.float64, np.int64, np.float32, np.int32]
31+
exclude = {"weight", "person_id", "household_id", "interview_number"}
32+
33+
survey_dfs = {}
34+
for src in ["sipp", "cps", "psid"]:
35+
sub = df[df["_survey"] == src].drop(columns=["_survey"]).copy()
36+
num = [c for c in sub.columns
37+
if sub[c].dtype in numeric_dtypes and sub[c].isna().mean() < 0.05]
38+
survey_dfs[src] = sub[num].dropna().reset_index(drop=True)
39+
print(src, len(survey_dfs[src]), num)
40+
41+
first = next(iter(survey_dfs.values()))
42+
shared = [c for c in first.columns
43+
if c not in exclude and all(c in d.columns for d in survey_dfs.values())]
44+
print("shared_cols:", shared)
45+
```
46+
47+
Output:
48+
49+
| Source | Rows after dropna | Low-NaN numeric columns |
50+
|---|---:|---|
51+
| SIPP | 476,744 | hispanic, race, is_male, wave, job_gain, age, job_loss, weight, month |
52+
| CPS | 144,265 | state_fips, is_male, dividend_income, farm_income, age, self_employment_income, weight, rental_income, wage_income, interest_income |
53+
| PSID | 9,207 | state_fips, food_stamps, total_family_income, is_male, marital_status, year, dividend_income, taxable_income, age, weight, rental_income, wage_income, interview_number, social_security, interest_income |
54+
55+
**Intersection after excluding `{weight, person_id, household_id, interview_number}`: `['is_male', 'age']` — 2 columns.**
56+
57+
## Why this gives PSID coverage 0
58+
59+
- PSID has the **most** unique non-shared columns (13 of its 15 are non-shared), all trained per-column on only 9,207 rows conditioned on 2 shared variables.
60+
- PRDC for PSID is computed on PSID's full 15-column feature space. The synthesizer's predicted values for the 13 non-shared columns are drawn from a model that's severely under-conditioned (2D conditioning on 13 target dimensions, each with a per-column RF or flow trained on 9,207 rows).
61+
- k-NN coverage with k=5 in 15D looks for any synthetic record within the k-th nearest-neighbor distance of each real holdout record. With under-conditioned predictions the synthetic records cluster around model means and rarely fall within the real holdout's neighborhood ball. Coverage → 0.
62+
- CPS has 10 total columns with 8 non-shared and 144,265 rows → coverage ~0.34–0.50 (mediocre but non-zero). SIPP has 9 total columns with 7 non-shared and 476,744 rows → coverage ~0.72–0.95 (highest). **The pattern tracks column-uniqueness ratio and row count.** PSID is worst because its non-shared ratio is highest and its row count is lowest.
63+
64+
## Why this is a benchmark bug, not a PSID limitation
65+
66+
The benchmark implicitly assumes sources share rich conditioning information. Here the `<5 % NaN` filter removes many latently-shared columns from individual sources. For example, `wage_income` appears in both CPS (144,265 non-null) and PSID (9,207 non-null) but NOT in SIPP — so it's excluded from `shared_cols`. If the benchmark harmonized the column schema across sources before applying the NaN filter (either by imputing cross-source or by using an intersection-of-non-null-across-sources strategy), `shared_cols` would be much richer and all sources would benefit.
67+
68+
PSID itself has 15 low-NaN columns — more than either SIPP (9) or CPS (10). On a **PSID-only** benchmark (train on PSID, test on PSID holdout), coverage would likely be competitive with SIPP's.
69+
70+
## Implications for the architecture work
71+
72+
### For synthesizer selection (G1 cross-section)
73+
74+
- **The benchmark's PSID=0 verdict should not influence cross-section synthesizer choice.** G1 works with CPS-core scaffold, not PSID, so the issue doesn't propagate. My earlier recommendation of ZI-MAF for cross-section and ZI-QRF for panel stands.
75+
76+
### For SS-model longitudinal extension (G3)
77+
78+
- **PSID can still be the trajectory-training backbone.** The SS-model methodology doc's plan to use PSID (1968–present) for lifetime earnings trajectories is not invalidated by this benchmark.
79+
- However, before committing compute, run a **PSID-only synthesizer benchmark**: train ZI-MAF / ZI-QRF / ZI-QDNN on PSID alone, test on PSID holdout. That is the relevant evaluation for the SS-model use case. The existing multi-source benchmark result for PSID is not the relevant number.
80+
- If PSID-only benchmarks still show low coverage, the real issue may be the attrition-induced sparsity in PSID's joint feature space (real data limitation). That is a separate investigation.
81+
82+
### For the benchmark harness itself (deprioritized)
83+
84+
- The benchmark's `find_shared_cols` policy is brittle at the intersection: any source with a different NaN rate on a column knocks that column out of the shared pool for every source. For future benchmark work, consider:
85+
- Lift the NaN filter or pre-impute cross-source.
86+
- Report results **per-source** on same-source train/test splits, not cross-source.
87+
- Report `shared_cols` and per-source `non_shared_cols` counts alongside coverage so reviewers can see the conditioning bottleneck.
88+
89+
## Action items
90+
91+
1. **Update `docs/synthesizer-benchmark-scale-up.md`** to note this finding — the PSID=0 line in the initial summary should be annotated, not taken as evidence that PSID is unusable.
92+
2. **Before any SS-model work commits compute to PSID-based trajectory training**, run a PSID-only synthesizer benchmark. That is a ~day of work on `experiments/` with existing method classes.
93+
3. **No change to G1 plan.** Cross-section proceeds with CPS-scaffold as planned; PSID is not on the G1 critical path.
94+
95+
## What was reliable in the original PSID=0 signal
96+
97+
- It is genuine that the specific multi-source fusion benchmark here cannot cover PSID well. Consumers who use that benchmark output (e.g., paper draft in `microplex/paper/paper_results.py`) need to adjust claims accordingly — it is not valid to say "all methods fail on PSID." The valid claim is "cross-source fusion with 2 shared variables fails on PSID, in a way that tracks non-shared column ratio."

0 commit comments

Comments
 (0)