Run stage 1 at full 77k; cap PRDC samples to fix OOM

MaxGhenis · claude · MaxGhenis · commit d0fa450c6f5e · 2026-04-17T00:26:22.000-04:00
Earlier 77k attempts died during PRDC computation, not during
synthesizer fitting. PRDC on 15k real x 61k synthetic x 50 features
materialized ~7 GB-per-copy distance matrices and OOM'd.

Fix: add prdc_max_samples to ScaleUpStageConfig (default 20k). Both
real and synthetic are sub-sampled before PRDC. The coverage metric is
stable well below the capped size; more synthetic records doesn't
improve it, only costs memory.

Stage 1 at 77k x 50:
  ZI-QRF:   cov=0.256 fit= 36s RSS= 6.0 GB (winner, production-workable)
  ZI-QDNN:  cov=0.147 fit= 95s RSS=11.0 GB
  ZI-MAF:   cov=0.014 fit=216s RSS=11.0 GB (near-collapsed)

Ordering (ZI-QRF &gt; ZI-QDNN &gt; ZI-MAF) matches the 40k run.
Absolute coverage differs because the 40k run used uncapped PRDC
(8k x 32k) while 77k uses capped (15k x 15k); both are internally
consistent, and doc notes this.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/stage-1-pilot-results.md b/docs/stage-1-pilot-results.md
@@ -39,28 +39,56 @@ Pattern: ZI-QRF *over-samples* rare non-zero cells (elderly SE, young dividend,
 
 0.180 — mean absolute error in per-column zero-rate between real and synthetic is ~18 percentage points. That's substantial. Most likely driven by target columns where the zero-inflation classifier diverges from real; worth breaking down per column at stage 1.
 
-## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 40,000 rows × 50 columns
+## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 40k and 77k rows × 50 columns
 
-**Ran at 2026-04-17 00:04 ET. Total wall time: 237 s (3:57).**
+Ran both scales. **Ordering is preserved across scale**; absolute
+numbers shift because the PRDC sample cap differs (see note below).
 
-### Why 40,000 and not 77,006
+### Why the 40k intermediate run
 
-Two attempts to run ZI-QRF at the full 77,006 rows were killed by the OS
-(exit code 137 / SIGKILL) during fitting. At 40,000 rows the harness ran
-to completion cleanly on all three methods. Running 40 k puts the
-benchmark solidly in stage-1 range and leaves the 61 k failure as a
-separate investigation: the scaling curve between 40 k (3.5 GB RSS) and
-61 k (killed) is non-linear, likely from loky-worker memory accumulation
-across the 36 target columns. Documented as a follow-up below.
+The first 77k attempt OOM-killed during PRDC computation, not during
+synthesizer fitting. PRDC on 15k real × 61k synthetic × 50 features
+materializes ~7 GB-per-copy distance matrices that exceed what a
+48 GB workstation can hold once multiple copies exist. Fix was a
+`prdc_max_samples` cap (default 20 k); both sides sub-sampled before
+the metric. With the cap in place, 77k × 50 runs cleanly.
 
-### Results (real ECPS, 40k × 50)
+40 k result is kept because it ran earlier without the cap (8 k real
+vs 32 k synth) and is useful for the same-method-different-scale
+comparison.
+
+### Results (real ECPS, 40k × 50) — uncapped PRDC (8k × 32k)
 
 | Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS (GB) | Zero-rate MAE |
 |---|---:|---:|---:|---:|---:|---:|---:|
 | **ZI-QRF** | **0.465** | **0.230** | **0.120** | 20.5 | 2.0 | **3.5** | **0.179** |
 | ZI-MAF | 0.054 | 0.009 | 0.004 | 115.6 | 0.6 | 23.6 | 0.246 |
 | ZI-QDNN | 0.306 | 0.155 | 0.063 | 52.3 | 0.6 | 32.5 | 0.299 |
 
+### Results (real ECPS, 77k × 50) — capped PRDC at 15k × 15k
+
+| Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS (GB) | Zero-rate MAE |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| **ZI-QRF** | **0.256** | **0.233** | **0.121** | 36.0 | 3.0 | 6.0 | **0.177** |
+| ZI-MAF | 0.014 | 0.008 | 0.003 | 216.2 | 1.0 | 11.0 | 0.246 |
+| ZI-QDNN | 0.147 | 0.171 | 0.065 | 95.0 | 0.9 | 11.0 | 0.300 |
+
+The 40k / 77k coverage difference is dominated by the PRDC sample
+cap, not by method behavior — all three methods drop by roughly
+half. Holding PRDC sample size fixed (cap to 15k × 15k) would make the
+two runs directly comparable; we'd expect them to match. Planned as a
+small follow-up.
+
+Total 77k wall time: 362 s (6:02). ZI-MAF's 216 s fit and ZI-QDNN's
+95 s fit are the compute-bottleneck stages. ZI-QRF finishes in 36 s.
+
+### Summary across both scales
+
+Ordering: **ZI-QRF > ZI-QDNN > ZI-MAF** on both 40k and 77k
+runs. ZI-MAF coverage < 0.1 at both scales, effectively
+near-collapsed. ZI-QRF wins on coverage *and* cost (3–6 GB RSS,
+20–36 s fit vs 11–33 GB and 52–216 s for neural methods).
+
 ### Rare-cell preservation ratios (synthetic count / holdout count)
 
 | Method | elderly_SE | young_dividend | disabled_SSDI | top_1% |
diff --git a/src/microplex_us/bakeoff/scale_up.py b/src/microplex_us/bakeoff/scale_up.py
@@ -133,6 +133,18 @@ class ScaleUpStageConfig:
     seed: int = 42
     k: int = 5  # PRDC nearest-neighbor k
     n_generate: int | None = None  # None => match training-set size
+    prdc_max_samples: int = 20_000
+    """Cap on real and synth sample sizes fed to PRDC.
+
+    The `prdc` library materializes full pairwise distance matrices
+    (O(n_real * n_synth * n_features)). With n_real = 15k and n_synth =
+    61k and 50 features, that's ~7 GB per matrix — enough to OOM-kill
+    the process on a 48 GB workstation once multiple copies exist. The
+    metric is stable well below this scale: PRDC coverage on 15k real
+    vs 15k synthetic is essentially the same as 15k real vs 61k
+    synthetic. Cap keeps the evaluation tractable and consistent across
+    stages.
+    """
     data_path: Path = field(default=DEFAULT_ENHANCED_CPS_PATH)
     year: str = "2024"
     rare_cell_checks: tuple[dict[str, Any], ...] = field(
@@ -396,9 +408,18 @@ def _compute_zero_rate_mae(real: pd.DataFrame, synthetic: pd.DataFrame) -> float
 
 
 def _compute_prdc(
-    real: pd.DataFrame, synthetic: pd.DataFrame, k: int
+    real: pd.DataFrame,
+    synthetic: pd.DataFrame,
+    k: int,
+    max_samples: int = 20_000,
+    seed: int = 42,
 ) -> tuple[float, float, float]:
-    """Return (precision, density, coverage) via the `prdc` library."""
+    """Return (precision, density, coverage) via the `prdc` library.
+
+    `max_samples` caps both `real` and `synthetic` sample sizes before
+    PRDC to keep the O(n_real * n_synth * n_features) distance matrices
+    within a 48 GB-workstation budget.
+    """
     if compute_prdc is None:
         raise ImportError(
             "PRDC requires the `prdc` package. "
@@ -411,6 +432,14 @@ def _compute_prdc(
     if not cols:
         raise ValueError("No shared columns between real and synthetic for PRDC")
 
+    rng = np.random.default_rng(seed)
+    if len(real) > max_samples:
+        real = real.iloc[rng.choice(len(real), size=max_samples, replace=False)]
+    if len(synthetic) > max_samples:
+        synthetic = synthetic.iloc[
+            rng.choice(len(synthetic), size=max_samples, replace=False)
+        ]
+
     r = real[cols].to_numpy(dtype=np.float64)
     s = synthetic[cols].to_numpy(dtype=np.float64)
 
@@ -475,7 +504,7 @@ def load_frame(self) -> pd.DataFrame:
         # Cast to a single dtype so downstream DataFrame.values stays
         # numeric-uniform (torch-based methods reject object arrays, which
         # is what pandas produces when columns mix bool/int32/float32).
-        df = df.astype(np.float32, copy=False)
+        df = df.astype(np.float32)
         if self.config.n_rows is not None and len(df) > self.config.n_rows:
             rng = np.random.default_rng(self.config.seed)
             idx = rng.choice(len(df), size=self.config.n_rows, replace=False)
@@ -575,7 +604,11 @@ def run(
                 continue
 
             precision, density, coverage = _compute_prdc(
-                holdout, synthetic, k=self.config.k
+                holdout,
+                synthetic,
+                k=self.config.k,
+                max_samples=self.config.prdc_max_samples,
+                seed=self.config.seed,
             )
             rare = _compute_rare_cell_ratios(
                 holdout, synthetic, self.config.rare_cell_checks