|
| 1 | +# Embedding-PRDC validation — is the stage-1 ordering real? |
| 2 | + |
| 3 | +*Settles the open question flagged in `docs/synthesizer-benchmark-scale-up.md`: is PRDC in 50-dim raw feature space too noisy to trust? Answer: the ordering is preserved.* |
| 4 | + |
| 5 | +## Setup |
| 6 | + |
| 7 | +40,000 rows × 50 columns of real enhanced_cps_2024. Same setup as stage-1. |
| 8 | + |
| 9 | +Autoencoder: 50 → 64 → 64 → **16** → 64 → 64 → 50 (2 hidden layers encoder + decoder, ReLU activations). Fit on holdout only (not on synthetic) for 200 epochs, batch 256, lr 1e-3. Final reconstruction MSE loss: 0.054. |
| 10 | + |
| 11 | +For each method (ZI-QRF / ZI-MAF / ZI-QDNN) at default hyperparameters: fit on 32k train, generate 32k synthetic, compute PRDC on 15k/15k samples (capped) in both the raw 50-dim feature space and the 16-dim latent space. |
| 12 | + |
| 13 | +## Results |
| 14 | + |
| 15 | +| Method | Raw-50 coverage | Raw-50 precision | Raw-50 density | Emb-16 coverage | Emb-16 precision | Emb-16 density | |
| 16 | +|---|---:|---:|---:|---:|---:|---:| |
| 17 | +| ZI-QRF | **0.348** | 0.229 | 0.118 | **0.309** | 0.291 | 0.133 | |
| 18 | +| ZI-QDNN | 0.219 | 0.156 | 0.063 | 0.222 | 0.241 | 0.088 | |
| 19 | +| ZI-MAF | 0.025 | 0.008 | 0.003 | 0.038 | 0.024 | 0.010 | |
| 20 | + |
| 21 | +**Ordering preserved in both spaces: ZI-QRF > ZI-QDNN > ZI-MAF.** |
| 22 | + |
| 23 | +## Observations |
| 24 | + |
| 25 | +1. **The stage-1 verdict is not a metric artifact.** The concern in the scale-up protocol doc was that raw-feature PRDC in 50 dimensions concentrates distances and becomes noise-dominated. The embedding variant has 16 dimensions with more informative axes (learned from the data), which is where PRDC is known to behave best. The ordering is the same. So the 10× gap between ZI-QRF and ZI-MAF is a real quality gap, not a measurement artifact. |
| 26 | + |
| 27 | +2. **Precision rises in embedding space for all three methods.** The AE compresses noise: random synthetic variation that looked far from real records in 50-dim now falls near them in 16-dim. This improves precision but slightly reduces coverage because the metric's radius tightens. |
| 28 | + |
| 29 | +3. **ZI-QRF's edge narrows slightly.** 0.348 → 0.309 in raw → embed is a modest drop. ZI-QDNN held steady (0.219 → 0.222). ZI-MAF bumped up (0.025 → 0.038). So in the embedding space the gap compressed somewhat, but ZI-QRF is still 8× ZI-MAF (down from 14× in raw). |
| 30 | + |
| 31 | +4. **ZI-MAF is still near-collapsed.** Even in the generous embedding space, ZI-MAF coverage is 0.038 — roughly an order of magnitude below the other two. Hyperparameter tuning (see `docs/zi-maf-hyperparameter-search.md`) doesn't close this at the architectural level. |
| 32 | + |
| 33 | +## Interpretation |
| 34 | + |
| 35 | +The ZI-QRF / ZI-QDNN / ZI-MAF ranking is robust across: |
| 36 | + |
| 37 | +- **Scale**: small synthetic (10 k × 7) → 5 k × 50 real → 40 k × 50 real → 77 k × 50 real. |
| 38 | +- **PRDC sample cap**: uncapped (8 k × 32 k) and capped (15 k × 15 k). |
| 39 | +- **Feature space**: 50 raw features and 16 learned latent dimensions. |
| 40 | + |
| 41 | +That's four independent robustness checks. The production default for G1 cross-section synthesis is **ZI-QRF**. |
| 42 | + |
| 43 | +## One thing this does not settle |
| 44 | + |
| 45 | +Neither raw-50 nor embed-16 PRDC weighs rare cells more than bulk cells. The `sparse_coverage.csv` finding — sparse L0 selection drives rare-cell ratios to 0 — is a different failure mode that neither PRDC variant measures. That finding still drives the calibrator decision (microcalibrate as mainline, not sparse reweighting). Both findings hold independently. |
| 46 | + |
| 47 | +## Artifact |
| 48 | + |
| 49 | +`artifacts/embedding_prdc_compare.json` — full per-method raw and embed PRDC dicts. |
| 50 | + |
| 51 | +Reproduction: |
| 52 | + |
| 53 | +```bash |
| 54 | +uv run python scripts/embedding_prdc_compare.py --n-rows 40000 --output artifacts/embedding_prdc_compare.json |
| 55 | +``` |
| 56 | + |
| 57 | +~5 minutes on a 48 GB M3. |
0 commit comments