|
| 1 | +# Calibrator decision |
| 2 | + |
| 3 | +*Decided: 2026-04-16. Applies to `spec-based-ecps-rewire` and every microplex-us pipeline that follows.* |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +Three calibration systems exist in the microplex / PolicyEngine ecosystem: |
| 8 | + |
| 9 | +| System | Location | Method | Scale notes | |
| 10 | +|---|---|---|---| |
| 11 | +| `microplex.calibration.Calibrator` | microplex core, ~2011 lines | Classical IPF / chi-square / entropy balancing, with `LinearConstraint` for explicit constraint rows | Entropy backend just killed v6 at 1.5M households | |
| 12 | +| `microplex.reweighting.Reweighter` | microplex core, 506 lines | Sparse L0/L1/L2 with scipy and cvxpy backends | Unused in production; designed for geographic-hierarchy reweighting; enforces sparsity by construction | |
| 13 | +| `microcalibrate` | PolicyEngine external package | Gradient-descent chi-squared with soft penalties and optional feasibility filtering | Used by PE-US-data for its main calibration; has production track record | |
| 14 | + |
| 15 | +v6 died inside `Calibrator.fit_transform(..., backend="entropy")` on a 1.5M-household frame. The underlying problem is not the Calibrator code — it is that entropy calibration instantiates dense-ish structures at `(n_households × n_constraints)` scale, and with ~1,255 constraints that exceeds what a 48 GB machine can hold once scratch memory is included. |
| 16 | + |
| 17 | +## Decision |
| 18 | + |
| 19 | +**Mainline calibrator for all production runs: `microcalibrate` (gradient-descent chi-squared).** |
| 20 | + |
| 21 | +**Optional sparse deployment selector applied *after* mainline calibration: `microplex.reweighting.Reweighter` with L0/HardConcrete backend**, used only when a deployment artifact (web app, embedded tool) needs a ~50k-record subsample of a national build. |
| 22 | + |
| 23 | +**Retire for production use: `microplex.calibration.Calibrator` with `backend="entropy"` at scales above ~200k records.** The classical Calibrator's IPF and chi-square backends stay available for small-scale work, diagnostics, and test harnesses where their explicit constraint semantics are convenient. |
| 24 | + |
| 25 | +## Why `microcalibrate` and not core `Calibrator` |
| 26 | + |
| 27 | +1. **Identity preservation.** `microcalibrate` adjusts per-record weights via gradient descent without materializing dense constraint Jacobians. Every input record survives to the output with a new weight. The rearchitecture's longitudinal extension (SS-model) requires stable entity identity across years; identity-preservation cannot be negotiable. |
| 28 | +2. **Scalability at the target scale.** `microcalibrate` is the calibration stack PE-US-data actually uses for production enhanced-CPS builds at full scale. v6's death at 1.5M is direct evidence the entropy path doesn't scale; `microcalibrate`'s gradient-descent pattern does. |
| 29 | +3. **Soft-penalty feasibility handling.** The 2026-03-30 review flagged that v2's calibration dropped 65 % of constraints as infeasible and then scored against the full target set, producing a systematic loss inflation. `microcalibrate` supports soft penalty weights on targets the solver cannot feasibly hit, giving principled rather than binary drop behavior. |
| 30 | +4. **External track record.** The SS-model methodology doc explicitly names `microcalibrate` as the calibration tool for the longitudinal extension. Picking it now aligns cross-section with the planned longitudinal path. |
| 31 | + |
| 32 | +## Why `Reweighter` stays as a post-mainline optional stage |
| 33 | + |
| 34 | +1. **L0 sparsity serves deployment, not accuracy.** The right use of L0 is to produce a small subsample of a well-calibrated national dataset for constrained deployment targets (web app UI, mobile, static hosting). It is the wrong tool for "calibrate to hit targets" because it sacrifices exact match for sparsity. |
| 35 | +2. **Apply after, not instead of, the mainline.** The mainline run produces ~1.5M records with adjusted weights. If a deployment needs 50k records, apply `Reweighter` with appropriate L0 λ as a second pass. The mainline artifact remains the ground-truth output for analysis. |
| 36 | +3. **`SparseCalibrator` + `HardConcreteCalibrator` analysis on the `codex/core-semantic-guards` paper work showed HardConcrete dominates the sparse-calibration Pareto frontier**, so when the sparse step does run, HardConcrete is the preferred backend. Core already ships this with multi-seed evaluation. |
| 37 | + |
| 38 | +## Why `Calibrator` is retired at scale |
| 39 | + |
| 40 | +1. v6 proves `Calibrator(backend="entropy")` OOMs at 1.5M × 1.2k-constraint scale on a 48 GB workstation. v4 proved it at 1.5M × similar scale. |
| 41 | +2. No architectural fix is cheap. To make entropy work at that scale we would have to rewrite the backend to use sparse constraint matrices and streaming gradient, which is effectively reimplementing `microcalibrate`. |
| 42 | +3. `Calibrator` stays available and useful for small-scale test harnesses. It is still the right tool for `n < ~200k`, for unit tests of the calibration layer, and for explicit-constraint diagnostics (the `LinearConstraint` API is clean). |
| 43 | + |
| 44 | +## Implementation implication |
| 45 | + |
| 46 | +The rewired pipeline in `spec-based-ecps-rewire` will import `microcalibrate` as a real dependency (not optional). This is a net-new dependency on microplex-us. The audit entry that proposed "retire `microcalibrate` if `Calibrator` covers the scalability requirement" is overruled by v6's evidence. |
| 47 | + |
| 48 | +## Calibration architecture, in order |
| 49 | + |
| 50 | +``` |
| 51 | +raw seed data ─► donor integration ─► seed_ready |
| 52 | + │ |
| 53 | + ▼ |
| 54 | + synthesize (seed backend = copy) |
| 55 | + │ |
| 56 | + ▼ |
| 57 | + support enforcement |
| 58 | + │ |
| 59 | + ▼ |
| 60 | + policyengine entity tables (households, persons, tax_units, ...) |
| 61 | + │ |
| 62 | + ▼ |
| 63 | + ┌──────────────────┴──────────────────┐ |
| 64 | + │ MAINLINE (every run) │ |
| 65 | + │ microcalibrate.Calibrator │ |
| 66 | + │ - chi-squared distance │ |
| 67 | + │ - gradient descent │ |
| 68 | + │ - soft penalty for infeasibles │ |
| 69 | + │ - preserves all record IDs │ |
| 70 | + │ │ |
| 71 | + │ Hierarchical in later phases: │ |
| 72 | + │ national → state → stratum │ |
| 73 | + └───────────────────┬─────────────────┘ |
| 74 | + │ |
| 75 | + ▼ |
| 76 | + calibrated artifact (full scale) |
| 77 | + │ |
| 78 | + ▼ |
| 79 | + ┌───────────────────┴─────────────────┐ |
| 80 | + │ OPTIONAL SPARSE DEPLOYMENT STEP │ |
| 81 | + │ microplex.reweighting.Reweighter │ |
| 82 | + │ - L0 / HardConcrete │ |
| 83 | + │ - deployment-scale subsample │ |
| 84 | + │ Only when a deployment artifact │ |
| 85 | + │ needs to be small. │ |
| 86 | + └─────────────────────────────────────┘ |
| 87 | +``` |
| 88 | + |
| 89 | +## Hierarchical calibration — separate decision, deferred |
| 90 | + |
| 91 | +This decision only picks the calibration *backend*. Hierarchical geographic calibration (national → state → stratum, with spatial smoothness priors, optional Fay-Herriot small-area composites) is a structure layered on top of `microcalibrate` and will be decided in its own doc at the start of the local-area gate (G2). Cross-section gate (G1) calibrates at national scale first. |
| 92 | + |
| 93 | +## Does this close out the three-way overlap? |
| 94 | + |
| 95 | +Yes, operationally: |
| 96 | + |
| 97 | +- Production runs: `microcalibrate`. |
| 98 | +- Deployment subsampling: `Reweighter`. |
| 99 | +- Tests and small-scale diagnostics: `Calibrator`. |
| 100 | +- No single-pipeline run crosses all three. Each tool has a distinct and non-overlapping job. |
| 101 | + |
| 102 | +## What this unblocks |
| 103 | + |
| 104 | +- Migration step 2 of `docs/core-wiring-audit.md`: "Adopt `Calibrator` end-to-end" is revised to "Adopt `microcalibrate` end-to-end as the production calibrator." That becomes the first real code change in `spec-based-ecps-rewire`. |
| 105 | +- The rewired cross-section pipeline can start being written against a concrete calibration contract. |
| 106 | + |
| 107 | +## Revisit conditions |
| 108 | + |
| 109 | +Revisit this decision if any of the following becomes true: |
| 110 | + |
| 111 | +1. A benchmark shows `microcalibrate` produces materially worse loss than a refactored `Calibrator` on representative constraint matrices. (Unlikely — PE uses it successfully.) |
| 112 | +2. Licensing / availability of `microcalibrate` becomes a blocker for external consumers of microplex-us. (Mitigate by forking the needed subset into microplex core.) |
| 113 | +3. The SS-model longitudinal extension requires a calibration primitive that `microcalibrate` does not provide (e.g., explicit spatial smoothness, per-year temporal regularization). Add the primitive at microplex level rather than swapping backends. |
0 commit comments