|
| 1 | +# Consolidated referee review and revision plan |
| 2 | + |
| 3 | +*Five subagent referee reviews ran in parallel on 2026-04-17 evening on the paper scaffold. This doc synthesizes their findings into an ordered revision plan.* |
| 4 | + |
| 5 | +## Reviewer verdicts |
| 6 | + |
| 7 | +| Reviewer | Verdict | Main issue | |
| 8 | +|---|---|---| |
| 9 | +| Citation | Minor revisions | Synthcity author mismatch; identity-preservation framing overstated vs Dekkers 2015 | |
| 10 | +| Methodology | Major revisions | Single-seed, non-converged calibration presented as final, correlated "robustness checks" | |
| 11 | +| Domain | Major revisions | 36 "target columns" are inputs not policy outputs; ecosystem under-represented | |
| 12 | +| Stylistic | Major revisions | 4 of 7 body sections are stubs; solo-authored "we"; documentation register | |
| 13 | +| Reproducibility | Major revisions | No code/data availability statement; 2 of 4 robustness checks used pre-snap data | |
| 14 | + |
| 15 | +Four of five reviewers reach Major Revisions. The draft is not submittable in its current state but is recoverable within 1–2 weeks of focused work. |
| 16 | + |
| 17 | +## Critical findings (blocker before submission) |
| 18 | + |
| 19 | +### B1. Two "independent robustness checks" used the pre-snap broken pipeline |
| 20 | + |
| 21 | +The reproducibility reviewer identified that `artifacts/embedding_prdc_compare.json` (Apr 17 08:03) and `artifacts/calibrate_on_synthesizer.json` (Apr 17 08:06) predate the snap fixes (harness-side at 12:06, upstream-core at 12:20). Both scripts call `method.fit` and `method.generate` directly without invoking `_snap_categorical_shared_cols`. The numbers they report are under the broken noise-injection regime. |
| 22 | + |
| 23 | +The paper's claim that "ordering is preserved under four independent robustness checks" technically still holds — ZI-QRF beats ZI-MAF under the broken pipeline too — but the framing obscures that two of the four checks are measurements of a system-we-ourselves-diagnosed-as-broken. |
| 24 | + |
| 25 | +**Action**: rerun `scripts/embedding_prdc_compare.py` and `scripts/calibrate_on_synthesizer.py` with either (a) the upstream `microplex` fix merged into the sibling clone or (b) the scripts rewritten to call `ScaleUpRunner.fit_and_generate` which applies `_snap_categorical_shared_cols`. Update artifacts. This is the first thing to do when resuming paper work. |
| 26 | + |
| 27 | +### B2. The 36 "target columns" are input variables, not policy outputs |
| 28 | + |
| 29 | +The domain reviewer's single most important finding: the paper uses `employment_income_last_year`, `snap_reported`, `ssi_reported`, etc. — CPS-reported amounts — as "targets." A tax-microsim reviewer expects "targets" to mean policy outputs: federal income tax liability, state income tax, computed EITC/CTC, SNAP benefits under program rules, SSI amounts. |
| 30 | + |
| 31 | +Two options: |
| 32 | + |
| 33 | +- **Rename**. Call them "conditioning income and benefit columns" or "target income components." Do this at minimum; the current language is misleading. |
| 34 | +- **Add downstream validation**. Run `policyengine-us` (and/or TAXSIM, Tax-Calculator, TPC — whichever the reviewer population cares about most) on microplex-us output data and report computed federal tax, EITC disbursed, CTC disbursed, SNAP/SSI/ACA PTC aggregates against external benchmarks (IRS SOI tables, USDA SNAP totals, SSA SSI totals, CBO SNAP outlays). This is the test a tax-microsim reviewer actually wants. |
| 35 | + |
| 36 | +Recommendation: do both. Rename immediately; add the downstream validation as a major new results subsection. |
| 37 | + |
| 38 | +### B3. Four of seven body sections are stubs |
| 39 | + |
| 40 | +Architecture (§3), Methods (§4), rare-cell subsection (§5.3), Discussion (§6), Conclusion (§8) are either parenthetical placeholders or explicit TBD. Not submittable in this state. |
| 41 | + |
| 42 | +**Action**: work through these in order. Methods first (reviewer can't evaluate anything else until they know what was done). Architecture second. Results-rare-cell third. Discussion and Conclusion last. |
| 43 | + |
| 44 | +### B4. No Code and Data Availability statement |
| 45 | + |
| 46 | +Standard requirement at every target venue. Must state data source (HuggingFace URL with pinned revision), code repository, software versions, Python version, OS tested, hardware, expected wall time, license. |
| 47 | + |
| 48 | +**Action**: add `## Code and Data Availability` section after Limitations. One paragraph. |
| 49 | + |
| 50 | +### B5. Conflicts of Interest disclosure missing |
| 51 | + |
| 52 | +Author founded PolicyEngine and previously led Enhanced CPS work (cited extensively in this paper). The `AFFILIATION.md` rule is followed in the byline and acknowledgments, but silence on the prior affiliation is a disclosure gap. Per domain reviewer: "Silence on the question will read worse than acknowledgement." |
| 53 | + |
| 54 | +**Action**: add explicit COI statement. Template: "The author founded PolicyEngine and previously led work on Enhanced CPS [@ghenis2024ecps]. The present work is conducted at Cosilico, an independent commercial entity, and is not a joint product with PolicyEngine. PolicyEngine's Enhanced CPS is cited as the incumbent public tool against which microplex-us is measured." |
| 55 | + |
| 56 | +## High-priority revisions (before review circulation) |
| 57 | + |
| 58 | +### H1. Convert first-person plural to first-person singular (or third-person) |
| 59 | + |
| 60 | +Solo-authored paper uses "we" throughout both documents. Per the project's global style rule and the target venues' conventions, this should be "I" or third-person recast. The stylistic reviewer identified ~20 instances needing judgment-based conversion (global find-and-replace won't work). |
| 61 | + |
| 62 | +### H2. Self-contain the Related Work section |
| 63 | + |
| 64 | +Line 56 of `index.qmd` says "A full literature review for this paper is maintained in `literature-review.qmd`." This is a documentation move, not an academic one. Self-contain §2 with 400–600 words of prose. Keep `literature-review.qmd` as supplementary material. |
| 65 | + |
| 66 | +### H3. Remove all documentation-register artifacts |
| 67 | + |
| 68 | +- `*(This section is being written against the spec-based-ecps-rewire branch...)*` — convert to outline-as-prose. |
| 69 | +- `[report low]` editorial marker at line ~100 — resolve. |
| 70 | +- `77,006 × 50 scale` — rewrite as "77,006 records across 50 columns." |
| 71 | +- "keeps every record alive" — "preserves all records" or "retains positive weight on every record." |
| 72 | +- "mainline" — "primary calibration mechanism." |
| 73 | +- Artifact paths referenced in body text — remove. |
| 74 | + |
| 75 | +### H4. Tables need captions, numbers, cross-reference labels |
| 76 | + |
| 77 | +All three tables are bare Markdown pipe-tables with no caption, no number, no Quarto `{#tbl-...}` label. Required for IJM / NTJ / JASA. |
| 78 | + |
| 79 | +### H5. Add at least one figure |
| 80 | + |
| 81 | +Pipeline schematic (source providers → donor blocks → chained QRF → calibration → L0 post-step) is the obvious first figure. Methods papers at the target tier with zero figures are unusual. |
| 82 | + |
| 83 | +### H6. Quantify or soften "widely-used upstream benchmark base class" |
| 84 | + |
| 85 | +Abstract claims the noise-injection defect "systematically biased earlier synthesizer comparisons." Evidence cited is one pre/post table on three methods using one base class. Either name the affected published benchmarks or soften to "introduced systematic bias into synthesizer comparisons using this base class." |
| 86 | + |
| 87 | +### H7. Citation form consistency |
| 88 | + |
| 89 | +Audit every `[@key]` vs `@key` for correct parenthetical vs textual intent. Pandoc renders them differently. |
| 90 | + |
| 91 | +## Medium-priority revisions (quality improvements) |
| 92 | + |
| 93 | +### M1. Uncertainty quantification |
| 94 | + |
| 95 | +Every headline table is a single-seed point estimate. Methodology reviewer correctly notes this is weak for a methods paper. ZI-QRF runs in 37 seconds — running 5-10 seeds is trivial compute. Report means with standard errors, or at least ordering-stability counts ("ordering preserved in 10/10 seeds"). |
| 96 | + |
| 97 | +### M2. Rerun with calibration converged |
| 98 | + |
| 99 | +All three entries in `artifacts/calibrate_on_synthesizer.json` have `"calibration_converged": false` at 200 epochs. The docs acknowledge this; the paper does not. Rerun at 1000-2000 epochs or report the epoch budget and frame as "fraction of pre-cal gap closed" rather than absolute post-cal error. |
| 100 | + |
| 101 | +### M3. Formal definition of identity preservation |
| 102 | + |
| 103 | +Currently asserted as an architectural property but never defined. Add Definition 1 in §3: *A weight-adjustment procedure $\phi: w \to w'$ is identity-preserving if $\forall i: w_i' > 0$ and $\phi$ does not drop records.* Either cite that `microcalibrate`'s gradient step satisfies this, or prove it. |
| 104 | + |
| 105 | +### M4. Embedding-PRDC circularity |
| 106 | + |
| 107 | +Autoencoder is fit on holdout only. Potential bias toward methods that match holdout idiosyncrasies. Re-run with AE fit on train (or an independent third partition). Report both. |
| 108 | + |
| 109 | +### M5. Soften "novel to PolicyEngine" Forbes claim |
| 110 | + |
| 111 | +Domain reviewer identified the SCF + Forbes precedent: Bricker-Henriques-Hansen-Moore (2016), Vermeulen (2018), Kennickell (2019). The tax-microsim integration remains novel; the broader pattern has precedent. Rewrite: "While top-wealth augmentation from Forbes-style lists is established practice in distributional national accounts [cites], its integration into a production tax-microsim pipeline is to our knowledge first done in policyengine-us-data." |
| 112 | + |
| 113 | +### M6. Cross-sectional motivation for identity preservation |
| 114 | + |
| 115 | +Domain reviewer: "Identity preservation also matters cross-sectionally for interpretability, subgroup analysis, confidentiality auditing, reproducibility and provenance." Add two paragraphs in Discussion making the cross-section case alongside the longitudinal case. |
| 116 | + |
| 117 | +### M7. ZI-QRF substrate circularity |
| 118 | + |
| 119 | +ECPS itself is QRF-constructed. ZI-QRF's win may be partly method-substrate match. Either add a non-ECPS robustness check (raw CPS ASEC or SCF) or explicitly note the circularity as a limitation. |
| 120 | + |
| 121 | +### M8. Target-set expansion |
| 122 | + |
| 123 | +Add Medicaid/CHIP, ACA PTC, mortgage interest, charitable contributions, medical expenses, property tax. Rerun at the expanded target set. |
| 124 | + |
| 125 | +### M9. Snap heuristic cardinality guard |
| 126 | + |
| 127 | +Stylistic and methodology reviewers flag that `_snap_categorical_shared_cols` fires on any integer-valued column, which could accidentally snap continuous-but-rounded columns (currency stored in dollars). Add cardinality threshold (e.g., snap only when `n_unique <= 50`). |
| 128 | + |
| 129 | +### M10. Decouple PRDC seed from split seed |
| 130 | + |
| 131 | +Currently both are `self.config.seed`. Use `seed + k` for the PRDC subsample. Average PRDC over 5+ subsample seeds per split to separate metric noise from split noise. |
| 132 | + |
| 133 | +## Low-priority revisions (cosmetic) |
| 134 | + |
| 135 | +### L1. Fix citation errors |
| 136 | + |
| 137 | +- Synthcity: author list should be Qian, Davis, van der Schaar for the NeurIPS 2023 D&B paper (not Cebere). Citation reviewer flagged as MAJOR but fix is trivial. |
| 138 | +- Add TabPFGen (Ma et al., arXiv 2406.05216, 2024) — referenced in lit review but not cited. |
| 139 | +- Add CTAB-GAN+ (Zhao et al. 2023, Frontiers in Big Data). |
| 140 | +- Add Auten-Splinter (2024) as DINA counterweight to PSZ 2018. |
| 141 | +- Add Meyer-Mok-Sullivan on CPS benefit under-reporting. |
| 142 | +- Add Czajka-Hirabayashi-Moffitt-Scholz (1992) for statistical matching lineage. |
| 143 | +- Add Ruggles (2025 PNAS) as engagement point. |
| 144 | +- Remove `zhang2017privbayes` (unused) or cite. |
| 145 | + |
| 146 | +### L2. URL / DOI completeness |
| 147 | + |
| 148 | +Add URLs/DOIs for: patki2016sdv (IEEE DOI 10.1109/DSAA.2016.49), xu2019modeling (NeurIPS proceedings), naeem2020prdc (PMLR), kotelnikov2023tabddpm (PMLR), borisov2023great (OpenReview), and others listed by the citation reviewer. |
| 149 | + |
| 150 | +### L3. Bibliography cleanup |
| 151 | + |
| 152 | +- `solatorio2023realtabformer` should be `@misc` not `@article` with `journal = {arXiv preprint}`. |
| 153 | +- `dementen2014liam2` needs `{de Menten}, Gaetan` brace protection. |
| 154 | +- Standardize URL-only vs DOI-only policy (document the rule once). |
| 155 | + |
| 156 | +### L4. Table formatting |
| 157 | + |
| 158 | +- Pick one bolding rule (all best-per-column or none). |
| 159 | +- Spell out abbreviated headers ("Fit (s)" → "Fit time (s)") or footnote them. |
| 160 | +- Expand "Pre-cal" / "Post-cal" to "Before calibration" / "After calibration." |
| 161 | + |
| 162 | +### L5. Abstract cleanup |
| 163 | + |
| 164 | +- Expand ZI-QRF / ZI-QDNN / ZI-MAF / PRDC on first use. |
| 165 | +- Replace "keeps every record alive," "mainline," "77,006 × 50 scale" per H3. |
| 166 | +- Either support or drop "widely-used" (H6). |
| 167 | + |
| 168 | +### L6. Remove unused references from `.bib` |
| 169 | + |
| 170 | +`ruggles2025synth` (cited in lit review but not index.qmd; consider citing in index.qmd per domain reviewer M1), `zhang2017privbayes`. |
| 171 | + |
| 172 | +### L7. Cite each data product on first reference |
| 173 | + |
| 174 | +CPS ASEC, ACS, PUF, SCF, SIPP need primary-source citations on first use. |
| 175 | + |
| 176 | +### L8. Repository hygiene |
| 177 | + |
| 178 | +- Add `LICENSE` file at repo root. |
| 179 | +- Add regression test for ordering (e.g., `test_stage1_10k_ordering`). |
| 180 | +- Move paper tables to Quarto chunks that read from `../artifacts/*.json` to auto-update. |
| 181 | + |
| 182 | +## Revision order |
| 183 | + |
| 184 | +Roughly the sequence to work through: |
| 185 | + |
| 186 | +1. **Rerun pre-snap artifacts** (B1). Half-hour compute. |
| 187 | +2. **Rename target columns + add downstream tax-output validation** (B2). Several days; the downstream run is non-trivial. |
| 188 | +3. **Draft §3 Architecture** (B3). One to two days. |
| 189 | +4. **Draft §4 Methods** (B3). One day. |
| 190 | +5. **Add Code and Data Availability statement + COI** (B4, B5). One hour. |
| 191 | +6. **Convert voice to first-person singular** (H1). Several hours, judgment-by-judgment. |
| 192 | +7. **Self-contain Related Work** (H2). Half-day. |
| 193 | +8. **Strip documentation register** (H3). Hours. |
| 194 | +9. **Table captions, numbering, labels** (H4). Hour. |
| 195 | +10. **Pipeline diagram** (H5). Hour (one TikZ / mermaid / svg figure). |
| 196 | +11. **Soften the "widely-used" claim** (H6). Minutes. |
| 197 | +12. **Citation form audit** (H7). Hour. |
| 198 | +13. **Draft §5.3 rare-cell + §6 Discussion + §8 Conclusion** (B3 cont.). Two days. |
| 199 | +14. **Medium-priority revisions** (M1–M10). Several days. |
| 200 | +15. **Low-priority / cosmetic** (L1–L8). Final pass. |
| 201 | + |
| 202 | +Total budget estimate: 2–3 weeks to a submittable draft, assuming the downstream tax-output validation is the bottleneck. |
| 203 | + |
| 204 | +## What the reviewers got wrong |
| 205 | + |
| 206 | +Two minor issues where the reviews overstated the gap: |
| 207 | + |
| 208 | +- Reproducibility reviewer said `zi_maf_tuning.json` is missing; it is present at `artifacts/zi_maf_tuning.json` (verified). The reviewer's grep missed it. |
| 209 | +- Citation reviewer flagged the identity-preservation framing as overstating the gap vs Dekkers (2015). Dekkers does discuss identity under static vs dynamic ageing; what the paper claims is novel is the cross-sectional calibration-layer framing, which Dekkers does NOT discuss. But the reviewer's point stands that the literature review should cite Dekkers and clarify which layer the claim refers to. |
| 210 | + |
| 211 | +## Reviews kept for reference |
| 212 | + |
| 213 | +Full reviewer outputs are preserved in the `a*` agent IDs noted by the subagent framework. If a rebuttal is needed later, those sessions can be resumed via `SendMessage`. |
0 commit comments