Skip to content

Commit fa959d3

Browse files
MaxGhenisclaude
andcommitted
Add consolidated referee review and revision plan
Five subagent reviewers (citation, methodology, domain, stylistic, reproducibility) ran in parallel on the paper scaffold. Four of five returned Major Revisions; one returned Minor. Consensus verdict: the draft has good bones but is not submittable in current state. Five BLOCKER findings that must land before any review circulation: B1. Two of four "independent robustness checks" were generated before the snap fix (embedding_prdc_compare.json Apr 17 08:03 and calibrate_on_synthesizer.json Apr 17 08:06 both predate the snap-fix commits at 12:06 / 12:20). Must rerun the scripts through ScaleUpRunner.fit_and_generate or with the upstream fix applied. B2. The 36 "target columns" are CPS-reported inputs, not policy outputs. Tax-microsim reviewers expect targets = federal tax, EITC, CTC, etc. Fix: rename at minimum; ideally add a downstream tax-aggregate validation running policyengine-us (or Tax-Calculator / TAXSIM) on microplex-us output and compare against IRS SOI / USDA / SSA / CBO administrative totals. B3. Four body sections (Architecture, Methods, rare-cell, Discussion, Conclusion) are stubs. Submission-blocking. B4. No Code and Data Availability statement. Required at every target venue; HuggingFace URL with pinned revision + license + software versions + hardware. B5. No Conflicts of Interest disclosure. Author founded PolicyEngine and led Enhanced CPS work cited extensively. Silence reads worse than acknowledgement given the field size. High-priority (H1-H7): first-person conversion, self-contain Related Work, strip documentation register, table captions, at least one figure, "widely-used" claim, citation form audit. Medium-priority (M1-M10): uncertainty quantification, calibration convergence, formal identity-preservation definition, embedding-PRDC circularity, Forbes claim softening, cross-sectional identity- preservation motivation, substrate circularity, target-set expansion, snap cardinality guard, PRDC/split seed decoupling. Low-priority (L1-L8): Synthcity citation error, TabPFGen / CTAB-GAN+ / Auten-Splinter / Meyer-Mok-Sullivan / Czajka additions, URL/DOI completeness, bibliography cleanup, table formatting, abstract cleanup, unused-ref removal, data-product citations, LICENSE file, regression test for ordering, Quarto-chunk-ified tables. Revision order and time budget: ~2-3 weeks to submittable draft, with the downstream tax-output validation as the main bottleneck. Detailed sequence in the doc. Noted two places where reviewers over-called: - zi_maf_tuning.json exists (reproducibility reviewer missed it) - Identity-preservation framing is defensible if scoped to the cross-section calibration layer (citation reviewer cited Dekkers 2015, which is about ageing not calibration) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ab26608 commit fa959d3

1 file changed

Lines changed: 213 additions & 0 deletions

File tree

paper/REVIEW-RESPONSE.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Consolidated referee review and revision plan
2+
3+
*Five subagent referee reviews ran in parallel on 2026-04-17 evening on the paper scaffold. This doc synthesizes their findings into an ordered revision plan.*
4+
5+
## Reviewer verdicts
6+
7+
| Reviewer | Verdict | Main issue |
8+
|---|---|---|
9+
| Citation | Minor revisions | Synthcity author mismatch; identity-preservation framing overstated vs Dekkers 2015 |
10+
| Methodology | Major revisions | Single-seed, non-converged calibration presented as final, correlated "robustness checks" |
11+
| Domain | Major revisions | 36 "target columns" are inputs not policy outputs; ecosystem under-represented |
12+
| Stylistic | Major revisions | 4 of 7 body sections are stubs; solo-authored "we"; documentation register |
13+
| Reproducibility | Major revisions | No code/data availability statement; 2 of 4 robustness checks used pre-snap data |
14+
15+
Four of five reviewers reach Major Revisions. The draft is not submittable in its current state but is recoverable within 1–2 weeks of focused work.
16+
17+
## Critical findings (blocker before submission)
18+
19+
### B1. Two "independent robustness checks" used the pre-snap broken pipeline
20+
21+
The reproducibility reviewer identified that `artifacts/embedding_prdc_compare.json` (Apr 17 08:03) and `artifacts/calibrate_on_synthesizer.json` (Apr 17 08:06) predate the snap fixes (harness-side at 12:06, upstream-core at 12:20). Both scripts call `method.fit` and `method.generate` directly without invoking `_snap_categorical_shared_cols`. The numbers they report are under the broken noise-injection regime.
22+
23+
The paper's claim that "ordering is preserved under four independent robustness checks" technically still holds — ZI-QRF beats ZI-MAF under the broken pipeline too — but the framing obscures that two of the four checks are measurements of a system-we-ourselves-diagnosed-as-broken.
24+
25+
**Action**: rerun `scripts/embedding_prdc_compare.py` and `scripts/calibrate_on_synthesizer.py` with either (a) the upstream `microplex` fix merged into the sibling clone or (b) the scripts rewritten to call `ScaleUpRunner.fit_and_generate` which applies `_snap_categorical_shared_cols`. Update artifacts. This is the first thing to do when resuming paper work.
26+
27+
### B2. The 36 "target columns" are input variables, not policy outputs
28+
29+
The domain reviewer's single most important finding: the paper uses `employment_income_last_year`, `snap_reported`, `ssi_reported`, etc. — CPS-reported amounts — as "targets." A tax-microsim reviewer expects "targets" to mean policy outputs: federal income tax liability, state income tax, computed EITC/CTC, SNAP benefits under program rules, SSI amounts.
30+
31+
Two options:
32+
33+
- **Rename**. Call them "conditioning income and benefit columns" or "target income components." Do this at minimum; the current language is misleading.
34+
- **Add downstream validation**. Run `policyengine-us` (and/or TAXSIM, Tax-Calculator, TPC — whichever the reviewer population cares about most) on microplex-us output data and report computed federal tax, EITC disbursed, CTC disbursed, SNAP/SSI/ACA PTC aggregates against external benchmarks (IRS SOI tables, USDA SNAP totals, SSA SSI totals, CBO SNAP outlays). This is the test a tax-microsim reviewer actually wants.
35+
36+
Recommendation: do both. Rename immediately; add the downstream validation as a major new results subsection.
37+
38+
### B3. Four of seven body sections are stubs
39+
40+
Architecture (§3), Methods (§4), rare-cell subsection (§5.3), Discussion (§6), Conclusion (§8) are either parenthetical placeholders or explicit TBD. Not submittable in this state.
41+
42+
**Action**: work through these in order. Methods first (reviewer can't evaluate anything else until they know what was done). Architecture second. Results-rare-cell third. Discussion and Conclusion last.
43+
44+
### B4. No Code and Data Availability statement
45+
46+
Standard requirement at every target venue. Must state data source (HuggingFace URL with pinned revision), code repository, software versions, Python version, OS tested, hardware, expected wall time, license.
47+
48+
**Action**: add `## Code and Data Availability` section after Limitations. One paragraph.
49+
50+
### B5. Conflicts of Interest disclosure missing
51+
52+
Author founded PolicyEngine and previously led Enhanced CPS work (cited extensively in this paper). The `AFFILIATION.md` rule is followed in the byline and acknowledgments, but silence on the prior affiliation is a disclosure gap. Per domain reviewer: "Silence on the question will read worse than acknowledgement."
53+
54+
**Action**: add explicit COI statement. Template: "The author founded PolicyEngine and previously led work on Enhanced CPS [@ghenis2024ecps]. The present work is conducted at Cosilico, an independent commercial entity, and is not a joint product with PolicyEngine. PolicyEngine's Enhanced CPS is cited as the incumbent public tool against which microplex-us is measured."
55+
56+
## High-priority revisions (before review circulation)
57+
58+
### H1. Convert first-person plural to first-person singular (or third-person)
59+
60+
Solo-authored paper uses "we" throughout both documents. Per the project's global style rule and the target venues' conventions, this should be "I" or third-person recast. The stylistic reviewer identified ~20 instances needing judgment-based conversion (global find-and-replace won't work).
61+
62+
### H2. Self-contain the Related Work section
63+
64+
Line 56 of `index.qmd` says "A full literature review for this paper is maintained in `literature-review.qmd`." This is a documentation move, not an academic one. Self-contain §2 with 400–600 words of prose. Keep `literature-review.qmd` as supplementary material.
65+
66+
### H3. Remove all documentation-register artifacts
67+
68+
- `*(This section is being written against the spec-based-ecps-rewire branch...)*` — convert to outline-as-prose.
69+
- `[report low]` editorial marker at line ~100 — resolve.
70+
- `77,006 × 50 scale` — rewrite as "77,006 records across 50 columns."
71+
- "keeps every record alive" — "preserves all records" or "retains positive weight on every record."
72+
- "mainline" — "primary calibration mechanism."
73+
- Artifact paths referenced in body text — remove.
74+
75+
### H4. Tables need captions, numbers, cross-reference labels
76+
77+
All three tables are bare Markdown pipe-tables with no caption, no number, no Quarto `{#tbl-...}` label. Required for IJM / NTJ / JASA.
78+
79+
### H5. Add at least one figure
80+
81+
Pipeline schematic (source providers → donor blocks → chained QRF → calibration → L0 post-step) is the obvious first figure. Methods papers at the target tier with zero figures are unusual.
82+
83+
### H6. Quantify or soften "widely-used upstream benchmark base class"
84+
85+
Abstract claims the noise-injection defect "systematically biased earlier synthesizer comparisons." Evidence cited is one pre/post table on three methods using one base class. Either name the affected published benchmarks or soften to "introduced systematic bias into synthesizer comparisons using this base class."
86+
87+
### H7. Citation form consistency
88+
89+
Audit every `[@key]` vs `@key` for correct parenthetical vs textual intent. Pandoc renders them differently.
90+
91+
## Medium-priority revisions (quality improvements)
92+
93+
### M1. Uncertainty quantification
94+
95+
Every headline table is a single-seed point estimate. Methodology reviewer correctly notes this is weak for a methods paper. ZI-QRF runs in 37 seconds — running 5-10 seeds is trivial compute. Report means with standard errors, or at least ordering-stability counts ("ordering preserved in 10/10 seeds").
96+
97+
### M2. Rerun with calibration converged
98+
99+
All three entries in `artifacts/calibrate_on_synthesizer.json` have `"calibration_converged": false` at 200 epochs. The docs acknowledge this; the paper does not. Rerun at 1000-2000 epochs or report the epoch budget and frame as "fraction of pre-cal gap closed" rather than absolute post-cal error.
100+
101+
### M3. Formal definition of identity preservation
102+
103+
Currently asserted as an architectural property but never defined. Add Definition 1 in §3: *A weight-adjustment procedure $\phi: w \to w'$ is identity-preserving if $\forall i: w_i' > 0$ and $\phi$ does not drop records.* Either cite that `microcalibrate`'s gradient step satisfies this, or prove it.
104+
105+
### M4. Embedding-PRDC circularity
106+
107+
Autoencoder is fit on holdout only. Potential bias toward methods that match holdout idiosyncrasies. Re-run with AE fit on train (or an independent third partition). Report both.
108+
109+
### M5. Soften "novel to PolicyEngine" Forbes claim
110+
111+
Domain reviewer identified the SCF + Forbes precedent: Bricker-Henriques-Hansen-Moore (2016), Vermeulen (2018), Kennickell (2019). The tax-microsim integration remains novel; the broader pattern has precedent. Rewrite: "While top-wealth augmentation from Forbes-style lists is established practice in distributional national accounts [cites], its integration into a production tax-microsim pipeline is to our knowledge first done in policyengine-us-data."
112+
113+
### M6. Cross-sectional motivation for identity preservation
114+
115+
Domain reviewer: "Identity preservation also matters cross-sectionally for interpretability, subgroup analysis, confidentiality auditing, reproducibility and provenance." Add two paragraphs in Discussion making the cross-section case alongside the longitudinal case.
116+
117+
### M7. ZI-QRF substrate circularity
118+
119+
ECPS itself is QRF-constructed. ZI-QRF's win may be partly method-substrate match. Either add a non-ECPS robustness check (raw CPS ASEC or SCF) or explicitly note the circularity as a limitation.
120+
121+
### M8. Target-set expansion
122+
123+
Add Medicaid/CHIP, ACA PTC, mortgage interest, charitable contributions, medical expenses, property tax. Rerun at the expanded target set.
124+
125+
### M9. Snap heuristic cardinality guard
126+
127+
Stylistic and methodology reviewers flag that `_snap_categorical_shared_cols` fires on any integer-valued column, which could accidentally snap continuous-but-rounded columns (currency stored in dollars). Add cardinality threshold (e.g., snap only when `n_unique <= 50`).
128+
129+
### M10. Decouple PRDC seed from split seed
130+
131+
Currently both are `self.config.seed`. Use `seed + k` for the PRDC subsample. Average PRDC over 5+ subsample seeds per split to separate metric noise from split noise.
132+
133+
## Low-priority revisions (cosmetic)
134+
135+
### L1. Fix citation errors
136+
137+
- Synthcity: author list should be Qian, Davis, van der Schaar for the NeurIPS 2023 D&B paper (not Cebere). Citation reviewer flagged as MAJOR but fix is trivial.
138+
- Add TabPFGen (Ma et al., arXiv 2406.05216, 2024) — referenced in lit review but not cited.
139+
- Add CTAB-GAN+ (Zhao et al. 2023, Frontiers in Big Data).
140+
- Add Auten-Splinter (2024) as DINA counterweight to PSZ 2018.
141+
- Add Meyer-Mok-Sullivan on CPS benefit under-reporting.
142+
- Add Czajka-Hirabayashi-Moffitt-Scholz (1992) for statistical matching lineage.
143+
- Add Ruggles (2025 PNAS) as engagement point.
144+
- Remove `zhang2017privbayes` (unused) or cite.
145+
146+
### L2. URL / DOI completeness
147+
148+
Add URLs/DOIs for: patki2016sdv (IEEE DOI 10.1109/DSAA.2016.49), xu2019modeling (NeurIPS proceedings), naeem2020prdc (PMLR), kotelnikov2023tabddpm (PMLR), borisov2023great (OpenReview), and others listed by the citation reviewer.
149+
150+
### L3. Bibliography cleanup
151+
152+
- `solatorio2023realtabformer` should be `@misc` not `@article` with `journal = {arXiv preprint}`.
153+
- `dementen2014liam2` needs `{de Menten}, Gaetan` brace protection.
154+
- Standardize URL-only vs DOI-only policy (document the rule once).
155+
156+
### L4. Table formatting
157+
158+
- Pick one bolding rule (all best-per-column or none).
159+
- Spell out abbreviated headers ("Fit (s)" → "Fit time (s)") or footnote them.
160+
- Expand "Pre-cal" / "Post-cal" to "Before calibration" / "After calibration."
161+
162+
### L5. Abstract cleanup
163+
164+
- Expand ZI-QRF / ZI-QDNN / ZI-MAF / PRDC on first use.
165+
- Replace "keeps every record alive," "mainline," "77,006 × 50 scale" per H3.
166+
- Either support or drop "widely-used" (H6).
167+
168+
### L6. Remove unused references from `.bib`
169+
170+
`ruggles2025synth` (cited in lit review but not index.qmd; consider citing in index.qmd per domain reviewer M1), `zhang2017privbayes`.
171+
172+
### L7. Cite each data product on first reference
173+
174+
CPS ASEC, ACS, PUF, SCF, SIPP need primary-source citations on first use.
175+
176+
### L8. Repository hygiene
177+
178+
- Add `LICENSE` file at repo root.
179+
- Add regression test for ordering (e.g., `test_stage1_10k_ordering`).
180+
- Move paper tables to Quarto chunks that read from `../artifacts/*.json` to auto-update.
181+
182+
## Revision order
183+
184+
Roughly the sequence to work through:
185+
186+
1. **Rerun pre-snap artifacts** (B1). Half-hour compute.
187+
2. **Rename target columns + add downstream tax-output validation** (B2). Several days; the downstream run is non-trivial.
188+
3. **Draft §3 Architecture** (B3). One to two days.
189+
4. **Draft §4 Methods** (B3). One day.
190+
5. **Add Code and Data Availability statement + COI** (B4, B5). One hour.
191+
6. **Convert voice to first-person singular** (H1). Several hours, judgment-by-judgment.
192+
7. **Self-contain Related Work** (H2). Half-day.
193+
8. **Strip documentation register** (H3). Hours.
194+
9. **Table captions, numbering, labels** (H4). Hour.
195+
10. **Pipeline diagram** (H5). Hour (one TikZ / mermaid / svg figure).
196+
11. **Soften the "widely-used" claim** (H6). Minutes.
197+
12. **Citation form audit** (H7). Hour.
198+
13. **Draft §5.3 rare-cell + §6 Discussion + §8 Conclusion** (B3 cont.). Two days.
199+
14. **Medium-priority revisions** (M1–M10). Several days.
200+
15. **Low-priority / cosmetic** (L1–L8). Final pass.
201+
202+
Total budget estimate: 2–3 weeks to a submittable draft, assuming the downstream tax-output validation is the bottleneck.
203+
204+
## What the reviewers got wrong
205+
206+
Two minor issues where the reviews overstated the gap:
207+
208+
- Reproducibility reviewer said `zi_maf_tuning.json` is missing; it is present at `artifacts/zi_maf_tuning.json` (verified). The reviewer's grep missed it.
209+
- Citation reviewer flagged the identity-preservation framing as overstating the gap vs Dekkers (2015). Dekkers does discuss identity under static vs dynamic ageing; what the paper claims is novel is the cross-sectional calibration-layer framing, which Dekkers does NOT discuss. But the reviewer's point stands that the literature review should cite Dekkers and clarify which layer the claim refers to.
210+
211+
## Reviews kept for reference
212+
213+
Full reviewer outputs are preserved in the `a*` agent IDs noted by the subagent framework. If a rebuttal is needed later, those sessions can be resumed via `SendMessage`.

0 commit comments

Comments
 (0)