Skip to content

Commit 298d915

Browse files
MaxGhenisclaude
andcommitted
ZI-MAF hyperparameter tuning result: 10x gap to ZI-QRF not closeable
Four ZI-MAF configurations ran at 40k x 50 real ECPS: default (4L, 32h, 50e): coverage=0.026 fit=124s wide (4L, 128h, 50e): coverage=0.029 fit=228s long (4L, 32h, 200e): coverage=0.032 fit=467s wide+long (8L, 128h, 200e, lr=5e-4): coverage=0.033 fit=1711s ZI-QRF on the same data at the same PRDC cap: coverage=0.352 in 19s. 14x the compute budget moves ZI-MAF from 0.026 -> 0.033 -- a 25% relative improvement that does not close the 10x gap to ZI-QRF. Stage-1 verdict stands: ZI-QRF is the production synthesizer, ZI-MAF is confirmed non-competitive at this scale with the current method-class architecture. Diagnosis (docs/zi-maf-hyperparameter-search.md): - Per-column independent flows can't capture cross-target correlations. - Zero-inflation RF classifier + MAF combination is biased on rare cells. - Log-transform + standardization compresses heavy tails. - Rescuing ZI-MAF plausibly requires joint-target architecture, which is a week of implementation that may still not close the gap. SS-model methodology doc's "production direction: ZI-QDNN" claim remains overturned; stage-1 ZI-QDNN was mid-pack (0.147 at 77k) and this tuning exercise doesn't revisit it. Artifact: artifacts/zi_maf_tuning.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3d1ab93 commit 298d915

2 files changed

Lines changed: 95 additions & 5 deletions

File tree

docs/overnight-session-2026-04-16.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,12 +125,12 @@ After the stage-1 evidence landed, I continued with the open items:
125125
synthetic, and reports PRDC both in raw 50-dim space and in the
126126
learned 16-dim latent space. Settles whether the stage-1 ordering
127127
is metric-driven or method-driven. Not yet executed.
128-
4. **ZI-MAF hyperparameter tuning run in progress** — four configs
129-
(default, wide, long, wide+long). Running at 40k × 50. Job started
130-
07:16 ET and is still progressing; will land in a separate doc
131-
update once complete.
128+
4. **ZI-MAF hyperparameter tuning completed** (`docs/zi-maf-hyperparameter-search.md`) — four configs ran on 40 k × 50. Coverage goes from 0.026 (default) to 0.033 (wide+long, 16× params + 8 layers, 28 min fit). ZI-QRF on the same data gets 0.352 in 19 s. **ZI-MAF confirmed non-competitive** at stage-1 scale; no amount of tuning within the method-class architecture closes a 10× gap.
129+
5. **Quickstart doc** (`docs/quickstart-rewire.md`) — ordered walkthrough of all tooling: G1 flag, scale-up harness, embedding-PRDC script, calibrate-on-synth script, diagnostics reproduction.
130+
6. **Scripts for follow-on experiments**: `scripts/embedding_prdc_compare.py` (PRDC in learned 16-dim latent vs raw 50-dim) and `scripts/calibrate_on_synthesizer.py` (does calibration rescue weak synthesis?). Both executable, not yet run.
131+
7. **Method-kwargs config**`ScaleUpStageConfig.method_kwargs` lets future runs override per-method hyperparameters through the normal harness path rather than standalone tuning scripts.
132132

133-
Updated PR #3 count: **15 commits**, all green tests, all pushed.
133+
Updated PR #3 count: **19 commits**, all green tests, all pushed.
134134

135135
## How to run stage 1 yourself
136136

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# ZI-MAF hyperparameter search — does tuning rescue the method?
2+
3+
*Direct test of the stage-1 follow-up flagged in `docs/stage-1-pilot-results.md`.*
4+
5+
## Setup
6+
7+
40,000 rows × 50 columns of real enhanced_cps_2024 (identical to stage-1). ZI-MAF trained at four progressively bigger configurations on the same seed and split. PRDC evaluated in 50-dim raw feature space, capped at 15 k × 15 k samples (same cap as stage-1 77 k).
8+
9+
| Config | n_layers | hidden_dim | epochs | batch | lr | Approx params |
10+
|---|---:|---:|---:|---:|---:|---:|
11+
| default | 4 | 32 | 50 | 256 | 1e-3 | baseline |
12+
| wide | 4 | 128 | 50 | 256 | 1e-3 | 4× params |
13+
| long | 4 | 32 | 200 | 256 | 1e-3 | 4× training |
14+
| wide+long | 8 | 128 | 200 | 256 | 5e-4 | 16× both + deeper |
15+
16+
## Results
17+
18+
| Config | Coverage | Precision | Density | Fit (s) | Gen (s) |
19+
|---|---:|---:|---:|---:|---:|
20+
| default | 0.0262 | 0.0083 | 0.0038 | 124 | 0.7 |
21+
| wide | 0.0293 | 0.0088 | 0.0043 | 228 | 0.8 |
22+
| long | 0.0318 | 0.0097 | 0.0048 | 467 | 0.6 |
23+
| wide+long | **0.0328** | 0.0107 | 0.0050 | 1,711 | 1.0 |
24+
25+
Fit time to get from 0.026 → 0.033 coverage: 14× the compute budget. Compare to ZI-QRF on the same data at the same PRDC cap: **coverage 0.352 in 19 s**.
26+
27+
## Verdict
28+
29+
**ZI-MAF is confirmed non-competitive at stage-1 scale with the method-class architecture.** Expanding capacity (4× width), training longer (4× epochs), and doing both with deeper layers (16× total + 8 layers) moves coverage from 0.026 to 0.033 — a 25 % relative improvement. ZI-QRF's 0.352 is 10 × higher at 1/90 the fit time.
30+
31+
The stage-1 finding stands: ZI-QRF is the production synthesizer, not ZI-MAF. No amount of hyperparameter tuning at the default architectural level is going to close a 10× gap.
32+
33+
## Why ZI-MAF fails here
34+
35+
Hypotheses, ordered by how plausible they seem on this evidence:
36+
37+
1. **Per-column independence.** `ZIMAFMethod` trains one `ConditionalMAF` per target column independently. With 36 target columns, 36 flows each only learn `P(col_i | conditioning)` — there's no mechanism to capture cross-target correlations (e.g., someone with high wage income also has zero SNAP). Joint-target flows would be architecturally different but expensive. Tree methods (ZI-QRF) implicitly capture some of these via the conditioning features, but their per-column independence is less damaging because each tree doesn't try to encode a full joint distribution.
38+
39+
2. **Zero-inflation classifier + flow combo.** The method first classifies P(zero) via a 50-tree RF, then trains a flow on the non-zero subset. If the classifier over-predicts zero on rare non-zero cells (see stage-1's `disabled_ssdi` ratio = 0, `elderly_self_employed` ratio = 100+), the flow is trained on a biased subset and produces samples that don't cover the missing support.
40+
41+
3. **Log-transform + standardization on heavy-tailed targets.** The flow log-transforms positive values (`np.log1p(y[y>0])`) and standardizes. For variables with extreme tails (top-1% employment income, net-worth-level wealth), this compresses the tail and the flow produces samples concentrated around the mode; the sparse tail coverage is exactly what PRDC measures.
42+
43+
4. **No conditional target structure.** MAF learns `P(y | x)` where `x` is the shared demographics. 14 conditioning dims predicting 36 target dims (each modeled as 1-dim marginal flow conditional on the 14) may be under-identified at 40k × 36 samples per column.
44+
45+
## What would change my mind
46+
47+
A single condition that would lift ZI-MAF into competitive range:
48+
49+
- **Joint-target flow**: one flow over all 36 target columns simultaneously, not 36 independent flows. Direction matches the SS-model methodology doc's "pathwise / trajectory" framing for longitudinal work.
50+
- **Better zero-inflation handling**: a joint zero-mask model (which 36-dim binary vector does this person have?) instead of 36 independent RF classifiers. Training signal correlates zero patterns across targets.
51+
- **Embedding-based PRDC**: the validation run flagged in `stage-1-pilot-results.md` could show ZI-MAF produces structurally-right samples that raw-feature PRDC misses. Separate investigation.
52+
53+
None of these are in the current `ZIMAFMethod` class. Rewriting them is a materially different project.
54+
55+
## Implication for the SS-model methodology doc
56+
57+
The doc names ZI-QDNN as the production direction with ZI-MAF as a reasonable alternative. Neither survives stage-1 tuning at scale. The near-term cross-section synthesizer default on the rewire is **ZI-QRF**; any future trajectory-based modeling for the longitudinal extension will need a materially different architecture than per-column independent flows.
58+
59+
## Where this leaves us
60+
61+
- **G1 cross-section default**: ZI-QRF. Locked in.
62+
- **ZI-MAF / ZI-QDNN**: not dead as research directions, but are dead as production defaults in their current `microplex.eval.benchmark` implementations.
63+
- **Followup worth trying before fully ruling out neural**: joint-target flow + joint zero-mask model. Needs ~a week of implementation and may still not close the gap.
64+
65+
## Reproducibility
66+
67+
```bash
68+
uv run python -c "
69+
import json, time, numpy as np, pandas as pd
70+
from microplex_us.bakeoff import ScaleUpRunner, ScaleUpStageConfig, DEFAULT_CONDITION_COLS, DEFAULT_TARGET_COLS, stage1_config
71+
from microplex.eval.benchmark import ZIMAFMethod
72+
from prdc import compute_prdc
73+
from sklearn.preprocessing import StandardScaler
74+
75+
base = stage1_config()
76+
cfg = ScaleUpStageConfig(
77+
stage='zi_maf_tuning', n_rows=40000, methods=('ZI-QRF',),
78+
condition_cols=DEFAULT_CONDITION_COLS, target_cols=DEFAULT_TARGET_COLS,
79+
holdout_frac=0.2, seed=42, k=5, n_generate=32000,
80+
data_path=base.data_path, year=base.year, rare_cell_checks=(),
81+
prdc_max_samples=15000,
82+
)
83+
runner = ScaleUpRunner(cfg)
84+
df = runner.load_frame()
85+
train, holdout = runner.split(df)
86+
# ... fit and evaluate each config ...
87+
"
88+
```
89+
90+
Full results in `artifacts/zi_maf_tuning.json`. Wall time for all four configs: ~43 min.

0 commit comments

Comments
 (0)