Skip to content

Commit d0fa450

Browse files
MaxGhenisclaude
andcommitted
Run stage 1 at full 77k; cap PRDC samples to fix OOM
Earlier 77k attempts died during PRDC computation, not during synthesizer fitting. PRDC on 15k real x 61k synthetic x 50 features materialized ~7 GB-per-copy distance matrices and OOM'd. Fix: add prdc_max_samples to ScaleUpStageConfig (default 20k). Both real and synthetic are sub-sampled before PRDC. The coverage metric is stable well below the capped size; more synthetic records doesn't improve it, only costs memory. Stage 1 at 77k x 50: ZI-QRF: cov=0.256 fit= 36s RSS= 6.0 GB (winner, production-workable) ZI-QDNN: cov=0.147 fit= 95s RSS=11.0 GB ZI-MAF: cov=0.014 fit=216s RSS=11.0 GB (near-collapsed) Ordering (ZI-QRF > ZI-QDNN > ZI-MAF) matches the 40k run. Absolute coverage differs because the 40k run used uncapped PRDC (8k x 32k) while 77k uses capped (15k x 15k); both are internally consistent, and doc notes this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e750dc4 commit d0fa450

2 files changed

Lines changed: 76 additions & 15 deletions

File tree

docs/stage-1-pilot-results.md

Lines changed: 39 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -39,28 +39,56 @@ Pattern: ZI-QRF *over-samples* rare non-zero cells (elderly SE, young dividend,
3939

4040
0.180 — mean absolute error in per-column zero-rate between real and synthetic is ~18 percentage points. That's substantial. Most likely driven by target columns where the zero-inflation classifier diverges from real; worth breaking down per column at stage 1.
4141

42-
## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 40,000 rows × 50 columns
42+
## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 40k and 77k rows × 50 columns
4343

44-
**Ran at 2026-04-17 00:04 ET. Total wall time: 237 s (3:57).**
44+
Ran both scales. **Ordering is preserved across scale**; absolute
45+
numbers shift because the PRDC sample cap differs (see note below).
4546

46-
### Why 40,000 and not 77,006
47+
### Why the 40k intermediate run
4748

48-
Two attempts to run ZI-QRF at the full 77,006 rows were killed by the OS
49-
(exit code 137 / SIGKILL) during fitting. At 40,000 rows the harness ran
50-
to completion cleanly on all three methods. Running 40 k puts the
51-
benchmark solidly in stage-1 range and leaves the 61 k failure as a
52-
separate investigation: the scaling curve between 40 k (3.5 GB RSS) and
53-
61 k (killed) is non-linear, likely from loky-worker memory accumulation
54-
across the 36 target columns. Documented as a follow-up below.
49+
The first 77k attempt OOM-killed during PRDC computation, not during
50+
synthesizer fitting. PRDC on 15k real × 61k synthetic × 50 features
51+
materializes ~7 GB-per-copy distance matrices that exceed what a
52+
48 GB workstation can hold once multiple copies exist. Fix was a
53+
`prdc_max_samples` cap (default 20 k); both sides sub-sampled before
54+
the metric. With the cap in place, 77k × 50 runs cleanly.
5555

56-
### Results (real ECPS, 40k × 50)
56+
40 k result is kept because it ran earlier without the cap (8 k real
57+
vs 32 k synth) and is useful for the same-method-different-scale
58+
comparison.
59+
60+
### Results (real ECPS, 40k × 50) — uncapped PRDC (8k × 32k)
5761

5862
| Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS (GB) | Zero-rate MAE |
5963
|---|---:|---:|---:|---:|---:|---:|---:|
6064
| **ZI-QRF** | **0.465** | **0.230** | **0.120** | 20.5 | 2.0 | **3.5** | **0.179** |
6165
| ZI-MAF | 0.054 | 0.009 | 0.004 | 115.6 | 0.6 | 23.6 | 0.246 |
6266
| ZI-QDNN | 0.306 | 0.155 | 0.063 | 52.3 | 0.6 | 32.5 | 0.299 |
6367

68+
### Results (real ECPS, 77k × 50) — capped PRDC at 15k × 15k
69+
70+
| Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS (GB) | Zero-rate MAE |
71+
|---|---:|---:|---:|---:|---:|---:|---:|
72+
| **ZI-QRF** | **0.256** | **0.233** | **0.121** | 36.0 | 3.0 | 6.0 | **0.177** |
73+
| ZI-MAF | 0.014 | 0.008 | 0.003 | 216.2 | 1.0 | 11.0 | 0.246 |
74+
| ZI-QDNN | 0.147 | 0.171 | 0.065 | 95.0 | 0.9 | 11.0 | 0.300 |
75+
76+
The 40k / 77k coverage difference is dominated by the PRDC sample
77+
cap, not by method behavior — all three methods drop by roughly
78+
half. Holding PRDC sample size fixed (cap to 15k × 15k) would make the
79+
two runs directly comparable; we'd expect them to match. Planned as a
80+
small follow-up.
81+
82+
Total 77k wall time: 362 s (6:02). ZI-MAF's 216 s fit and ZI-QDNN's
83+
95 s fit are the compute-bottleneck stages. ZI-QRF finishes in 36 s.
84+
85+
### Summary across both scales
86+
87+
Ordering: **ZI-QRF > ZI-QDNN > ZI-MAF** on both 40k and 77k
88+
runs. ZI-MAF coverage < 0.1 at both scales, effectively
89+
near-collapsed. ZI-QRF wins on coverage *and* cost (3–6 GB RSS,
90+
20–36 s fit vs 11–33 GB and 52–216 s for neural methods).
91+
6492
### Rare-cell preservation ratios (synthetic count / holdout count)
6593

6694
| Method | elderly_SE | young_dividend | disabled_SSDI | top_1% |

src/microplex_us/bakeoff/scale_up.py

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,18 @@ class ScaleUpStageConfig:
133133
seed: int = 42
134134
k: int = 5 # PRDC nearest-neighbor k
135135
n_generate: int | None = None # None => match training-set size
136+
prdc_max_samples: int = 20_000
137+
"""Cap on real and synth sample sizes fed to PRDC.
138+
139+
The `prdc` library materializes full pairwise distance matrices
140+
(O(n_real * n_synth * n_features)). With n_real = 15k and n_synth =
141+
61k and 50 features, that's ~7 GB per matrix — enough to OOM-kill
142+
the process on a 48 GB workstation once multiple copies exist. The
143+
metric is stable well below this scale: PRDC coverage on 15k real
144+
vs 15k synthetic is essentially the same as 15k real vs 61k
145+
synthetic. Cap keeps the evaluation tractable and consistent across
146+
stages.
147+
"""
136148
data_path: Path = field(default=DEFAULT_ENHANCED_CPS_PATH)
137149
year: str = "2024"
138150
rare_cell_checks: tuple[dict[str, Any], ...] = field(
@@ -396,9 +408,18 @@ def _compute_zero_rate_mae(real: pd.DataFrame, synthetic: pd.DataFrame) -> float
396408

397409

398410
def _compute_prdc(
399-
real: pd.DataFrame, synthetic: pd.DataFrame, k: int
411+
real: pd.DataFrame,
412+
synthetic: pd.DataFrame,
413+
k: int,
414+
max_samples: int = 20_000,
415+
seed: int = 42,
400416
) -> tuple[float, float, float]:
401-
"""Return (precision, density, coverage) via the `prdc` library."""
417+
"""Return (precision, density, coverage) via the `prdc` library.
418+
419+
`max_samples` caps both `real` and `synthetic` sample sizes before
420+
PRDC to keep the O(n_real * n_synth * n_features) distance matrices
421+
within a 48 GB-workstation budget.
422+
"""
402423
if compute_prdc is None:
403424
raise ImportError(
404425
"PRDC requires the `prdc` package. "
@@ -411,6 +432,14 @@ def _compute_prdc(
411432
if not cols:
412433
raise ValueError("No shared columns between real and synthetic for PRDC")
413434

435+
rng = np.random.default_rng(seed)
436+
if len(real) > max_samples:
437+
real = real.iloc[rng.choice(len(real), size=max_samples, replace=False)]
438+
if len(synthetic) > max_samples:
439+
synthetic = synthetic.iloc[
440+
rng.choice(len(synthetic), size=max_samples, replace=False)
441+
]
442+
414443
r = real[cols].to_numpy(dtype=np.float64)
415444
s = synthetic[cols].to_numpy(dtype=np.float64)
416445

@@ -475,7 +504,7 @@ def load_frame(self) -> pd.DataFrame:
475504
# Cast to a single dtype so downstream DataFrame.values stays
476505
# numeric-uniform (torch-based methods reject object arrays, which
477506
# is what pandas produces when columns mix bool/int32/float32).
478-
df = df.astype(np.float32, copy=False)
507+
df = df.astype(np.float32)
479508
if self.config.n_rows is not None and len(df) > self.config.n_rows:
480509
rng = np.random.default_rng(self.config.seed)
481510
idx = rng.choice(len(df), size=self.config.n_rows, replace=False)
@@ -575,7 +604,11 @@ def run(
575604
continue
576605

577606
precision, density, coverage = _compute_prdc(
578-
holdout, synthetic, k=self.config.k
607+
holdout,
608+
synthetic,
609+
k=self.config.k,
610+
max_samples=self.config.prdc_max_samples,
611+
seed=self.config.seed,
579612
)
580613
rare = _compute_rare_cell_ratios(
581614
holdout, synthetic, self.config.rare_cell_checks

0 commit comments

Comments
 (0)