Add v6 post-mortem and calibrator decision for spec-based-ecps-rewire

MaxGhenis · claude · MaxGhenis · commit 699ea280be4c · 2026-04-16T23:10:12.000-04:00
Two docs that anchor the rewire direction with specific evidence from
today's run:

docs/v6-postmortem.md
  - Timeline of v6 from launch to OOM kill
  - Stage-marker localization of the killer:
    calibrate_policyengine_tables with backend=entropy on
    1.5M households × ~1.2k constraints on a 48 GB workstation
  - rusage comparison to v4 (nearly identical signature: 22 GB max RSS,
    293 GB peak phys_footprint)
  - What v6 ruled IN as working at scale (donor integration, tables build)
  - What v6 ruled OUT as the killer (synthesis, support enforcement,
    tables build)
  - How this becomes evidence for the rewire rather than against it

docs/calibrator-decision.md
  - Mainline: microcalibrate (gradient-descent chi-squared, identity
    preserving, production-proven by PE-US-data, aligns with SS-model
    longitudinal plan)
  - Optional sparse deployment step after mainline: microplex.reweighting
    (L0 / HardConcrete, for web-app-sized subsamples only)
  - Retire Calibrator(backend=entropy) at scales above ~200k records
  - Revises migration step 2 of core-wiring-audit accordingly

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/calibrator-decision.md b/docs/calibrator-decision.md
@@ -0,0 +1,113 @@
+# Calibrator decision
+
+*Decided: 2026-04-16. Applies to `spec-based-ecps-rewire` and every microplex-us pipeline that follows.*
+
+## Context
+
+Three calibration systems exist in the microplex / PolicyEngine ecosystem:
+
+| System | Location | Method | Scale notes |
+|---|---|---|---|
+| `microplex.calibration.Calibrator` | microplex core, ~2011 lines | Classical IPF / chi-square / entropy balancing, with `LinearConstraint` for explicit constraint rows | Entropy backend just killed v6 at 1.5M households |
+| `microplex.reweighting.Reweighter` | microplex core, 506 lines | Sparse L0/L1/L2 with scipy and cvxpy backends | Unused in production; designed for geographic-hierarchy reweighting; enforces sparsity by construction |
+| `microcalibrate` | PolicyEngine external package | Gradient-descent chi-squared with soft penalties and optional feasibility filtering | Used by PE-US-data for its main calibration; has production track record |
+
+v6 died inside `Calibrator.fit_transform(..., backend="entropy")` on a 1.5M-household frame. The underlying problem is not the Calibrator code — it is that entropy calibration instantiates dense-ish structures at `(n_households × n_constraints)` scale, and with ~1,255 constraints that exceeds what a 48 GB machine can hold once scratch memory is included.
+
+## Decision
+
+**Mainline calibrator for all production runs: `microcalibrate` (gradient-descent chi-squared).**
+
+**Optional sparse deployment selector applied *after* mainline calibration: `microplex.reweighting.Reweighter` with L0/HardConcrete backend**, used only when a deployment artifact (web app, embedded tool) needs a ~50k-record subsample of a national build.
+
+**Retire for production use: `microplex.calibration.Calibrator` with `backend="entropy"` at scales above ~200k records.** The classical Calibrator's IPF and chi-square backends stay available for small-scale work, diagnostics, and test harnesses where their explicit constraint semantics are convenient.
+
+## Why `microcalibrate` and not core `Calibrator`
+
+1. **Identity preservation.** `microcalibrate` adjusts per-record weights via gradient descent without materializing dense constraint Jacobians. Every input record survives to the output with a new weight. The rearchitecture's longitudinal extension (SS-model) requires stable entity identity across years; identity-preservation cannot be negotiable.
+2. **Scalability at the target scale.** `microcalibrate` is the calibration stack PE-US-data actually uses for production enhanced-CPS builds at full scale. v6's death at 1.5M is direct evidence the entropy path doesn't scale; `microcalibrate`'s gradient-descent pattern does.
+3. **Soft-penalty feasibility handling.** The 2026-03-30 review flagged that v2's calibration dropped 65 % of constraints as infeasible and then scored against the full target set, producing a systematic loss inflation. `microcalibrate` supports soft penalty weights on targets the solver cannot feasibly hit, giving principled rather than binary drop behavior.
+4. **External track record.** The SS-model methodology doc explicitly names `microcalibrate` as the calibration tool for the longitudinal extension. Picking it now aligns cross-section with the planned longitudinal path.
+
+## Why `Reweighter` stays as a post-mainline optional stage
+
+1. **L0 sparsity serves deployment, not accuracy.** The right use of L0 is to produce a small subsample of a well-calibrated national dataset for constrained deployment targets (web app UI, mobile, static hosting). It is the wrong tool for "calibrate to hit targets" because it sacrifices exact match for sparsity.
+2. **Apply after, not instead of, the mainline.** The mainline run produces ~1.5M records with adjusted weights. If a deployment needs 50k records, apply `Reweighter` with appropriate L0 λ as a second pass. The mainline artifact remains the ground-truth output for analysis.
+3. **`SparseCalibrator` + `HardConcreteCalibrator` analysis on the `codex/core-semantic-guards` paper work showed HardConcrete dominates the sparse-calibration Pareto frontier**, so when the sparse step does run, HardConcrete is the preferred backend. Core already ships this with multi-seed evaluation.
+
+## Why `Calibrator` is retired at scale
+
+1. v6 proves `Calibrator(backend="entropy")` OOMs at 1.5M × 1.2k-constraint scale on a 48 GB workstation. v4 proved it at 1.5M × similar scale.
+2. No architectural fix is cheap. To make entropy work at that scale we would have to rewrite the backend to use sparse constraint matrices and streaming gradient, which is effectively reimplementing `microcalibrate`.
+3. `Calibrator` stays available and useful for small-scale test harnesses. It is still the right tool for `n < ~200k`, for unit tests of the calibration layer, and for explicit-constraint diagnostics (the `LinearConstraint` API is clean).
+
+## Implementation implication
+
+The rewired pipeline in `spec-based-ecps-rewire` will import `microcalibrate` as a real dependency (not optional). This is a net-new dependency on microplex-us. The audit entry that proposed "retire `microcalibrate` if `Calibrator` covers the scalability requirement" is overruled by v6's evidence.
+
+## Calibration architecture, in order
+
+```
+raw seed data  ─►  donor integration  ─►  seed_ready
+                                          │
+                                          ▼
+                                  synthesize (seed backend = copy)
+                                          │
+                                          ▼
+                                  support enforcement
+                                          │
+                                          ▼
+                                  policyengine entity tables (households, persons, tax_units, ...)
+                                          │
+                                          ▼
+                      ┌──────────────────┴──────────────────┐
+                      │  MAINLINE (every run)               │
+                      │  microcalibrate.Calibrator          │
+                      │    - chi-squared distance           │
+                      │    - gradient descent               │
+                      │    - soft penalty for infeasibles   │
+                      │    - preserves all record IDs       │
+                      │                                     │
+                      │  Hierarchical in later phases:      │
+                      │    national → state → stratum       │
+                      └───────────────────┬─────────────────┘
+                                          │
+                                          ▼
+                                  calibrated artifact (full scale)
+                                          │
+                                          ▼
+                      ┌───────────────────┴─────────────────┐
+                      │  OPTIONAL SPARSE DEPLOYMENT STEP    │
+                      │  microplex.reweighting.Reweighter   │
+                      │    - L0 / HardConcrete              │
+                      │    - deployment-scale subsample     │
+                      │  Only when a deployment artifact    │
+                      │  needs to be small.                 │
+                      └─────────────────────────────────────┘
+```
+
+## Hierarchical calibration — separate decision, deferred
+
+This decision only picks the calibration *backend*. Hierarchical geographic calibration (national → state → stratum, with spatial smoothness priors, optional Fay-Herriot small-area composites) is a structure layered on top of `microcalibrate` and will be decided in its own doc at the start of the local-area gate (G2). Cross-section gate (G1) calibrates at national scale first.
+
+## Does this close out the three-way overlap?
+
+Yes, operationally:
+
+- Production runs: `microcalibrate`.
+- Deployment subsampling: `Reweighter`.
+- Tests and small-scale diagnostics: `Calibrator`.
+- No single-pipeline run crosses all three. Each tool has a distinct and non-overlapping job.
+
+## What this unblocks
+
+- Migration step 2 of `docs/core-wiring-audit.md`: "Adopt `Calibrator` end-to-end" is revised to "Adopt `microcalibrate` end-to-end as the production calibrator." That becomes the first real code change in `spec-based-ecps-rewire`.
+- The rewired cross-section pipeline can start being written against a concrete calibration contract.
+
+## Revisit conditions
+
+Revisit this decision if any of the following becomes true:
+
+1. A benchmark shows `microcalibrate` produces materially worse loss than a refactored `Calibrator` on representative constraint matrices. (Unlikely — PE uses it successfully.)
+2. Licensing / availability of `microcalibrate` becomes a blocker for external consumers of microplex-us. (Mitigate by forking the needed subset into microplex core.)
+3. The SS-model longitudinal extension requires a calibration primitive that `microcalibrate` does not provide (e.g., explicit spatial smoothness, per-year temporal regularization). Add the primitive at microplex level rather than swapping backends.
diff --git a/docs/v6-postmortem.md b/docs/v6-postmortem.md
@@ -0,0 +1,77 @@
+# v6 post-mortem — 2026-04-16
+
+Record of the `broader-donors-puf-native-challenger-v6` run (launched 2026-04-16 10:20:10 ET, died 22:56:05 ET).
+
+## Outcome
+
+**RUN_EXIT status=1** after 12h 36m of wall time. Killed by the kernel during entropy calibration. No artifact directory created; no final dataset persisted.
+
+## Timeline of the post-donor window
+
+The post-donor stage instrumentation (commit `960ac2f`) was the single highest-value diagnostic change of the session. It let us localize the OOM to a specific named stage for the first time.
+
+| Time (ET) | Stage marker |
+|---|---|
+| 10:20:10 | RUN_START |
+| ~19:29 (9h 9m in) | last donor block complete (`scf_2022/social_security_pension_income`) |
+| 21:04:03 | `seed ready` → `targets start`/`complete` → `synthesis variables ready` → `synthesis start`/`complete` → `support enforcement start`/`complete` → `policyengine tables start` (all in one burst; synthesis backend = seed-copy so the burst is dominated by the strip+cap pass between donor integration and tables) |
+| ~22:25 | `policyengine tables complete` [households=1,505,108, persons=3,373,378] |
+| ~22:25 | `policyengine calibration start [backend=entropy]` |
+| 22:56:05 | RUN_EXIT status=1, kernel signal (macOS `time -l` reported "signal: Invalid argument" on the wrapper) |
+
+## Memory signature
+
+From macOS `time -l` rusage at exit:
+
+| Metric | v6 | v4 (previous run) |
+|---|---|---|
+| Wall time | 45,355 s (12h 36m) | 39,476 s (10h 58m) |
+| Max RSS | 22.0 GB | 20.5 GB |
+| Peak phys_footprint | 293 GB | 287 GB |
+| Instructions retired | 614 T | 612 T |
+| Involuntary context switches | 317 K | 264 K |
+
+v6's signature is nearly identical to v4's — same killer, same point.
+
+## Diagnosis
+
+**`calibrate_policyengine_tables` with `backend=entropy` on 1.5M households is the OOM killer.**
+
+Proximate cause: a 48 GB machine cannot hold the working set the entropy solver needs for that scale. Peak phys_footprint of 293 GB on 48 GB RAM implies heavy compression and swap pressure; eventually the kernel kills the process.
+
+Likely underlying structural cost (not measured, but fits the profile):
+
+- Entropy calibration materializes a dense Jacobian-like matrix roughly `(n_households × n_constraints)` in float64.
+- With 1,505,108 households and ~1,255 constraints post-feasibility-filter (from the 2026-03-30 review), that's 15 GB for a single copy. Multiple working copies (gradient, Hessian approximation, line-search scratch) easily exceed RAM.
+- `_evaluate_policyengine_target_fit_context` then runs a full PolicyEngine simulation on the calibrated frame, which adds its own memory cost on top.
+
+## What survived
+
+v6 demonstrated that the **tables-build phase works at scale**: `build_policyengine_entity_tables` successfully produced a 1.5M-household × 3.4M-person entity bundle. This was an open question after v4. The stage isn't free (roughly 1h 25m at 180–210% CPU, RSS oscillating 0.2–16%), but it doesn't OOM.
+
+The donor integration also ran clean. All 129 donor blocks across CPS ASEC, IRS SOI PUF, SIPP tips, SIPP assets, and SCF completed without failure. The tax-unit entity-bundle construction took ~89 min (one-time cost per run). Multi-source donor imputation is not the bottleneck.
+
+## What v6 ruled out as the killer
+
+The initial v4 diagnosis hypothesized the silent post-donor window might be in synthesis, support enforcement, or tables-build. v6's instrumentation showed those all complete instantly or within ~1.5 hours. The killer is specifically **entropy calibration**, not an earlier stage.
+
+## What this means for the architecture direction
+
+v6 is an evidence point *for* the `spec-based-ecps-rewire` direction rather than against it:
+
+1. **Entropy calibration on a 1.5M-household monolithic solve is a dead end on a 48 GB machine.** The rearchitecture's hierarchical / identity-preserving calibration pattern (national → state → stratum, `microcalibrate`-style chi-squared) avoids the dense-matrix blow-up by chunking over strata.
+2. **Scaffold scale is the real lever.** The 3.4M-row ACS scaffold drives both tables-build size and calibration-matrix size. CPS-core at ~430k persons cuts this at the source.
+3. **The instrumentation pattern is reusable.** Keeping named stage markers at every pipeline boundary in the new pipeline will make any future OOM localizable in a single run rather than requiring multiple exploratory runs.
+
+## What v6 does NOT tell us
+
+- Whether the imputation quality would have beaten `enhanced_cps_2024` on PE-native broad loss had it finished. No parity artifact was produced.
+- Whether the `pe_plus_puf_native_challenger` condition selection is an improvement. Moot now that the pipeline direction is changing.
+- The actual numerical Calibrator's behavior on 1.5M households. The failure was upstream of any Calibrator numerical work — the process died while setting up the constraint matrices.
+
+## Status of v6 artifacts
+
+- Log file: `artifacts/live_pe_us_data_rebuild_checkpoint_20260414_pe_plus_puf_native_challenger_broader/broader-donors-puf-native-challenger-v6.log` (~2,224 lines)
+- No output artifact directory (build never completed persistence step)
+- tmux session: cleaned up
+- No action required on artifacts — they stay on disk as part of the experiment trail.