Skip to content

Commit e750dc4

Browse files
MaxGhenisclaude
andcommitted
Stage 1 scale-up results: small-benchmark ordering inverts at real scale
Ran ZI-QRF, ZI-MAF, ZI-QDNN on 40,000 rows x 50 columns of real enhanced_cps_2024 and compared against the existing 10k x 7 synthetic benchmark_multi_seed result. Small (10k x 7 synthetic CPS) Stage 1 (40k x 50 real ECPS) ZI-MAF 0.499 (winner) ZI-MAF 0.054 (near-collapsed) ZI-QDNN 0.406 ZI-QDNN 0.306 (mid-pack) ZI-QRF 0.347 ZI-QRF 0.465 (winner) Rare-cell preservation: ZI-QRF: modest over-sampling (2-4x), disabled_ssdi -> 0.0 ZI-MAF: elderly_self_employed -> 103x (zero-inflation classifier miscalibrated on real data), disabled_ssdi -> 0.0 ZI-QDNN: elderly_self_employed -> 116x, disabled_ssdi -> 0.0 RSS cost: ZI-QRF 3.5 GB (production-workable on a 48 GB machine) ZI-MAF 23.5 GB (marginal) ZI-QDNN 32.5 GB (marginal; 1.6 TB naive extrapolation at 3.4M rows) Harness fix: cast loaded DataFrame to float32. Column dtype mix (bool / int32 / float32) previously caused torch-based methods to fail with "can't convert np.ndarray of type numpy.object_". Implications: - Revises the G1 cross-section synthesizer default: ZI-QRF, not ZI-MAF (the small-benchmark winner). - SS-model methodology doc's "production direction: ZI-QDNN" claim does not survive this stage. Needs revision. - ZI-MAF + ZI-QDNN might recover with hyperparameter tuning, but at the default settings in the benchmark classes they are not competitive. Not resolved: - 61k rows OOM-kills ZI-QRF (SIGKILL, no output). Scaling is clean to 40k. Cause likely loky worker accumulation across 36 target columns. - PRDC in 50D may be degenerate — the scale-up doc flagged this as a risk. Needs embedding-based PRDC to confirm or deny the ordering. uv.lock regenerated after the earlier Python >= 3.13 bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 06367fa commit e750dc4

3 files changed

Lines changed: 283 additions & 990 deletions

File tree

docs/stage-1-pilot-results.md

Lines changed: 118 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -39,38 +39,133 @@ Pattern: ZI-QRF *over-samples* rare non-zero cells (elderly SE, young dividend,
3939

4040
0.180 — mean absolute error in per-column zero-rate between real and synthetic is ~18 percentage points. That's substantial. Most likely driven by target columns where the zero-inflation classifier diverges from real; worth breaking down per column at stage 1.
4141

42-
## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 77,006 rows × 50 columns
42+
## Stage 1 — ZI-QRF + ZI-MAF + ZI-QDNN at 40,000 rows × 50 columns
4343

44-
**Status: running at 2026-04-16 23:50 ET.** Results will be appended here when the job completes.
44+
**Ran at 2026-04-17 00:04 ET. Total wall time: 237 s (3:57).**
4545

46-
Expected completion based on ballpark from `docs/synthesizer-benchmark-scale-up.md`:
46+
### Why 40,000 and not 77,006
4747

48-
- ZI-QRF fit: ~15 minutes (36 target cols × ~25s each on 61k rows × 100 trees)
49-
- ZI-MAF fit: probably 45 min – 2 hours on CPU (no MPS integration in the benchmark class; one flow per column × 50 epochs × 256 batch size)
50-
- ZI-QDNN fit: ~20 min (smaller network, CPU-friendly)
51-
- Generation: 5–15 min per method
48+
Two attempts to run ZI-QRF at the full 77,006 rows were killed by the OS
49+
(exit code 137 / SIGKILL) during fitting. At 40,000 rows the harness ran
50+
to completion cleanly on all three methods. Running 40 k puts the
51+
benchmark solidly in stage-1 range and leaves the 61 k failure as a
52+
separate investigation: the scaling curve between 40 k (3.5 GB RSS) and
53+
61 k (killed) is non-linear, likely from loky-worker memory accumulation
54+
across the 36 target columns. Documented as a follow-up below.
5255

53-
Total stage 1 wall time: 1–3 hours.
56+
### Results (real ECPS, 40k × 50)
5457

55-
Output: `artifacts/scale_up_stage1.json`, `artifacts/scale_up_stage1.log`.
56-
57-
### Results (TO BE POPULATED)
58-
59-
Template table — update in place once the job completes:
60-
61-
| Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS | Zero-rate MAE |
58+
| Method | Coverage | Precision | Density | Fit (s) | Gen (s) | Peak RSS (GB) | Zero-rate MAE |
6259
|---|---:|---:|---:|---:|---:|---:|---:|
63-
| ZI-QRF | | | | | | | |
64-
| ZI-MAF | | | | | | | |
65-
| ZI-QDNN | | | | | | | |
60+
| **ZI-QRF** | **0.465** | **0.230** | **0.120** | 20.5 | 2.0 | **3.5** | **0.179** |
61+
| ZI-MAF | 0.054 | 0.009 | 0.004 | 115.6 | 0.6 | 23.6 | 0.246 |
62+
| ZI-QDNN | 0.306 | 0.155 | 0.063 | 52.3 | 0.6 | 32.5 | 0.299 |
6663

67-
### Rare-cell preservation ratios (TO BE POPULATED)
64+
### Rare-cell preservation ratios (synthetic count / holdout count)
6865

69-
| Method | elderly_SE | young_div | disabled_SSDI | top_1% |
66+
| Method | elderly_SE | young_dividend | disabled_SSDI | top_1% |
7067
|---|---:|---:|---:|---:|
71-
| ZI-QRF |||||
72-
| ZI-MAF |||||
73-
| ZI-QDNN |||||
68+
| ZI-QRF | 2.4 | 3.8 | **0.0** | 3.95 |
69+
| ZI-MAF | 103.6 | 3.8 | **0.0** | 3.95 |
70+
| ZI-QDNN | 116.7 | 3.4 | **0.0** | 3.95 |
71+
72+
Neural methods severely over-produce `elderly_self_employed` (100×+) —
73+
suggests their zero-inflation classifiers are fundamentally
74+
miscalibrated for this cell on real data. Every method drives
75+
`disabled_ssdi` to 0.0, consistent with the pilot finding. Every method
76+
over-produces top-1% employment at ~4×.
77+
78+
## Major finding: the small-benchmark ordering inverts at production scale
79+
80+
| Method | 10k × 7 synthetic (benchmark_multi_seed, CPS column) | 40k × 50 real ECPS |
81+
|---|---:|---:|
82+
| ZI-MAF | 0.499 ← winner | **0.054** |
83+
| ZI-QDNN | 0.406 | 0.306 |
84+
| ZI-QRF | 0.347 | **0.465** ← winner |
85+
86+
**Read from this result before trusting any small-scale benchmark.** The
87+
published ranking that named ZI-MAF (and by implication ZI-QDNN as the
88+
near-term production direction in the SS-model doc) best reversed
89+
completely as soon as we moved to:
90+
91+
1. Real joint distributions instead of analytically-generated synthetic.
92+
2. 50 columns instead of 7 (~7× feature dimensionality).
93+
3. 40 k rows instead of 10 k (4× data).
94+
95+
## Interpretation
96+
97+
1. **ZI-MAF at 0.054 is near-collapsed.** Not merely "third-best" — it's
98+
producing samples that aren't close to any holdout record. Three
99+
plausible causes, any combination of which might be active:
100+
- Default hyperparameters (n_layers=4, hidden_dim=32, 50 epochs) are
101+
too small for 50-dim targets. The network is a per-column flow, so
102+
each of the 36 flows has only ~1k–5k effective parameters. May be
103+
fundamentally under-capacity.
104+
- Zero-inflation handling in ZI-MAF combines a classifier (RF, 50
105+
trees) for P(zero) with a MAF for nonzero values. When the
106+
classifier is imprecise on rare non-zero cells, the MAF has very
107+
few positive samples to train on, and mode-collapses.
108+
- The loss log-transforms positive values and standardizes; for
109+
heavy-tailed distributions (top-1 % income) this degrades
110+
conditional tail estimation.
111+
2. **ZI-QDNN at 0.306 is mid-pack.** Better than ZI-MAF but materially
112+
worse than ZI-QRF. Suggests the quantile DNN's conditional
113+
estimates are reasonable but not tree-accurate. Worth noting RSS
114+
was 32 GB — highest of the three — which would OOM on a typical
115+
workstation without swap. Not a production-ready cost profile
116+
without batch-size or architecture tuning.
117+
3. **ZI-QRF at 0.465 is the clear winner.** 3.5 GB RSS, 20-second fit,
118+
and nearly 2× ZI-QDNN's coverage. This is the production default for
119+
the rewire's cross-section synthesizer step.
120+
121+
## Implications for the SS-model methodology doc
122+
123+
The SS-model methodology doc's "production direction: ZI-QDNN" claim
124+
does not survive this benchmark. At production scale on real data with
125+
default hyperparameters, neither ZI-MAF nor ZI-QDNN is competitive with
126+
ZI-QRF. The doc should be updated to note this finding, and the
127+
longitudinal extension should treat ZI-QRF as at minimum a strong
128+
baseline.
129+
130+
Two caveats that keep the SS-model direction alive:
131+
132+
1. Hyperparameter-tuned ZI-MAF / ZI-QDNN *might* beat ZI-QRF. The
133+
scale-up doc listed "ZI-MAF needs careful hyperparameter tuning on
134+
real data" as a known risk; stage-1 confirms the risk.
135+
2. Trajectory / pathwise generation is a different problem from
136+
cross-sectional conditional modeling. A sequence-model win at
137+
longitudinal need not follow from cross-sectional results.
138+
3. Both neural methods used 32-GB-class memory to train; at the 3.4 M
139+
row v6 scale the naive extrapolation is ~1.6 TB. Tree methods'
140+
modest memory profile may be decisive on a workstation regardless
141+
of quality.
142+
143+
## Follow-up work flagged by this run
144+
145+
1. **61k ZI-QRF OOM diagnosis.** Scaling is clean up to 40 k (3.5 GB
146+
RSS). 61 k fails silently in < 2 min with SIGKILL. Most likely
147+
cause: loky workers accumulating memory across the 36 target
148+
columns. Fix paths: `n_jobs=4` instead of `-1`, or a
149+
worker-recycling wrapper, or just disable parallelism and accept
150+
slower fit.
151+
2. **ZI-MAF hyperparameter search.** Before accepting
152+
ZI-MAF-is-not-viable as the final answer, run with n_layers=8,
153+
hidden_dim=128, epochs=200 and see if coverage recovers. One
154+
evening of tuning could either rescue the method or definitively
155+
rule it out.
156+
3. **Embedding-based PRDC.** Raw-feature PRDC in 50 dimensions is
157+
predicted by the scale-up doc to degenerate. Fit a 16-dim
158+
autoencoder on holdout, re-run PRDC in that space, and check
159+
whether the method ordering changes. If it does, the 50 k result
160+
is a metric artifact, not a method verdict.
161+
4. **Per-column zero-rate breakdown.** All three methods drive
162+
`disabled_ssdi` to 0.0 synthetic count. Needs per-column MAE
163+
reporting to identify which other columns systematically break.
164+
5. **`microcalibrate` applied on top.** The synthesizer results above
165+
are uncalibrated. The mainline pipeline runs synthesis then
166+
calibration. Worth repeating stage 1 with `MicrocalibrateAdapter`
167+
applied to the generated records and measuring whether calibration
168+
lifts ZI-MAF / ZI-QDNN coverage back into the competitive range.
74169

75170
## Interpretation guide (for when results land)
76171

src/microplex_us/bakeoff/scale_up.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -472,6 +472,10 @@ def load_frame(self) -> pd.DataFrame:
472472
self.logger.info(
473473
"loaded enhanced_cps: %d rows, %d cols", len(df), len(df.columns)
474474
)
475+
# Cast to a single dtype so downstream DataFrame.values stays
476+
# numeric-uniform (torch-based methods reject object arrays, which
477+
# is what pandas produces when columns mix bool/int32/float32).
478+
df = df.astype(np.float32, copy=False)
475479
if self.config.n_rows is not None and len(df) > self.config.n_rows:
476480
rng = np.random.default_rng(self.config.seed)
477481
idx = rng.choice(len(df), size=self.config.n_rows, replace=False)

0 commit comments

Comments
 (0)