Skip to content
This repository was archived by the owner on Jun 14, 2026. It is now read-only.

Commit 699ea28

Browse files
MaxGhenisclaude
andcommitted
Add v6 post-mortem and calibrator decision for spec-based-ecps-rewire
Two docs that anchor the rewire direction with specific evidence from today's run: docs/v6-postmortem.md - Timeline of v6 from launch to OOM kill - Stage-marker localization of the killer: calibrate_policyengine_tables with backend=entropy on 1.5M households × ~1.2k constraints on a 48 GB workstation - rusage comparison to v4 (nearly identical signature: 22 GB max RSS, 293 GB peak phys_footprint) - What v6 ruled IN as working at scale (donor integration, tables build) - What v6 ruled OUT as the killer (synthesis, support enforcement, tables build) - How this becomes evidence for the rewire rather than against it docs/calibrator-decision.md - Mainline: microcalibrate (gradient-descent chi-squared, identity preserving, production-proven by PE-US-data, aligns with SS-model longitudinal plan) - Optional sparse deployment step after mainline: microplex.reweighting (L0 / HardConcrete, for web-app-sized subsamples only) - Retire Calibrator(backend=entropy) at scales above ~200k records - Revises migration step 2 of core-wiring-audit accordingly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9c553d1 commit 699ea28

2 files changed

Lines changed: 190 additions & 0 deletions

File tree

docs/calibrator-decision.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Calibrator decision
2+
3+
*Decided: 2026-04-16. Applies to `spec-based-ecps-rewire` and every microplex-us pipeline that follows.*
4+
5+
## Context
6+
7+
Three calibration systems exist in the microplex / PolicyEngine ecosystem:
8+
9+
| System | Location | Method | Scale notes |
10+
|---|---|---|---|
11+
| `microplex.calibration.Calibrator` | microplex core, ~2011 lines | Classical IPF / chi-square / entropy balancing, with `LinearConstraint` for explicit constraint rows | Entropy backend just killed v6 at 1.5M households |
12+
| `microplex.reweighting.Reweighter` | microplex core, 506 lines | Sparse L0/L1/L2 with scipy and cvxpy backends | Unused in production; designed for geographic-hierarchy reweighting; enforces sparsity by construction |
13+
| `microcalibrate` | PolicyEngine external package | Gradient-descent chi-squared with soft penalties and optional feasibility filtering | Used by PE-US-data for its main calibration; has production track record |
14+
15+
v6 died inside `Calibrator.fit_transform(..., backend="entropy")` on a 1.5M-household frame. The underlying problem is not the Calibrator code — it is that entropy calibration instantiates dense-ish structures at `(n_households × n_constraints)` scale, and with ~1,255 constraints that exceeds what a 48 GB machine can hold once scratch memory is included.
16+
17+
## Decision
18+
19+
**Mainline calibrator for all production runs: `microcalibrate` (gradient-descent chi-squared).**
20+
21+
**Optional sparse deployment selector applied *after* mainline calibration: `microplex.reweighting.Reweighter` with L0/HardConcrete backend**, used only when a deployment artifact (web app, embedded tool) needs a ~50k-record subsample of a national build.
22+
23+
**Retire for production use: `microplex.calibration.Calibrator` with `backend="entropy"` at scales above ~200k records.** The classical Calibrator's IPF and chi-square backends stay available for small-scale work, diagnostics, and test harnesses where their explicit constraint semantics are convenient.
24+
25+
## Why `microcalibrate` and not core `Calibrator`
26+
27+
1. **Identity preservation.** `microcalibrate` adjusts per-record weights via gradient descent without materializing dense constraint Jacobians. Every input record survives to the output with a new weight. The rearchitecture's longitudinal extension (SS-model) requires stable entity identity across years; identity-preservation cannot be negotiable.
28+
2. **Scalability at the target scale.** `microcalibrate` is the calibration stack PE-US-data actually uses for production enhanced-CPS builds at full scale. v6's death at 1.5M is direct evidence the entropy path doesn't scale; `microcalibrate`'s gradient-descent pattern does.
29+
3. **Soft-penalty feasibility handling.** The 2026-03-30 review flagged that v2's calibration dropped 65 % of constraints as infeasible and then scored against the full target set, producing a systematic loss inflation. `microcalibrate` supports soft penalty weights on targets the solver cannot feasibly hit, giving principled rather than binary drop behavior.
30+
4. **External track record.** The SS-model methodology doc explicitly names `microcalibrate` as the calibration tool for the longitudinal extension. Picking it now aligns cross-section with the planned longitudinal path.
31+
32+
## Why `Reweighter` stays as a post-mainline optional stage
33+
34+
1. **L0 sparsity serves deployment, not accuracy.** The right use of L0 is to produce a small subsample of a well-calibrated national dataset for constrained deployment targets (web app UI, mobile, static hosting). It is the wrong tool for "calibrate to hit targets" because it sacrifices exact match for sparsity.
35+
2. **Apply after, not instead of, the mainline.** The mainline run produces ~1.5M records with adjusted weights. If a deployment needs 50k records, apply `Reweighter` with appropriate L0 λ as a second pass. The mainline artifact remains the ground-truth output for analysis.
36+
3. **`SparseCalibrator` + `HardConcreteCalibrator` analysis on the `codex/core-semantic-guards` paper work showed HardConcrete dominates the sparse-calibration Pareto frontier**, so when the sparse step does run, HardConcrete is the preferred backend. Core already ships this with multi-seed evaluation.
37+
38+
## Why `Calibrator` is retired at scale
39+
40+
1. v6 proves `Calibrator(backend="entropy")` OOMs at 1.5M × 1.2k-constraint scale on a 48 GB workstation. v4 proved it at 1.5M × similar scale.
41+
2. No architectural fix is cheap. To make entropy work at that scale we would have to rewrite the backend to use sparse constraint matrices and streaming gradient, which is effectively reimplementing `microcalibrate`.
42+
3. `Calibrator` stays available and useful for small-scale test harnesses. It is still the right tool for `n < ~200k`, for unit tests of the calibration layer, and for explicit-constraint diagnostics (the `LinearConstraint` API is clean).
43+
44+
## Implementation implication
45+
46+
The rewired pipeline in `spec-based-ecps-rewire` will import `microcalibrate` as a real dependency (not optional). This is a net-new dependency on microplex-us. The audit entry that proposed "retire `microcalibrate` if `Calibrator` covers the scalability requirement" is overruled by v6's evidence.
47+
48+
## Calibration architecture, in order
49+
50+
```
51+
raw seed data ─► donor integration ─► seed_ready
52+
53+
54+
synthesize (seed backend = copy)
55+
56+
57+
support enforcement
58+
59+
60+
policyengine entity tables (households, persons, tax_units, ...)
61+
62+
63+
┌──────────────────┴──────────────────┐
64+
│ MAINLINE (every run) │
65+
│ microcalibrate.Calibrator │
66+
│ - chi-squared distance │
67+
│ - gradient descent │
68+
│ - soft penalty for infeasibles │
69+
│ - preserves all record IDs │
70+
│ │
71+
│ Hierarchical in later phases: │
72+
│ national → state → stratum │
73+
└───────────────────┬─────────────────┘
74+
75+
76+
calibrated artifact (full scale)
77+
78+
79+
┌───────────────────┴─────────────────┐
80+
│ OPTIONAL SPARSE DEPLOYMENT STEP │
81+
│ microplex.reweighting.Reweighter │
82+
│ - L0 / HardConcrete │
83+
│ - deployment-scale subsample │
84+
│ Only when a deployment artifact │
85+
│ needs to be small. │
86+
└─────────────────────────────────────┘
87+
```
88+
89+
## Hierarchical calibration — separate decision, deferred
90+
91+
This decision only picks the calibration *backend*. Hierarchical geographic calibration (national → state → stratum, with spatial smoothness priors, optional Fay-Herriot small-area composites) is a structure layered on top of `microcalibrate` and will be decided in its own doc at the start of the local-area gate (G2). Cross-section gate (G1) calibrates at national scale first.
92+
93+
## Does this close out the three-way overlap?
94+
95+
Yes, operationally:
96+
97+
- Production runs: `microcalibrate`.
98+
- Deployment subsampling: `Reweighter`.
99+
- Tests and small-scale diagnostics: `Calibrator`.
100+
- No single-pipeline run crosses all three. Each tool has a distinct and non-overlapping job.
101+
102+
## What this unblocks
103+
104+
- Migration step 2 of `docs/core-wiring-audit.md`: "Adopt `Calibrator` end-to-end" is revised to "Adopt `microcalibrate` end-to-end as the production calibrator." That becomes the first real code change in `spec-based-ecps-rewire`.
105+
- The rewired cross-section pipeline can start being written against a concrete calibration contract.
106+
107+
## Revisit conditions
108+
109+
Revisit this decision if any of the following becomes true:
110+
111+
1. A benchmark shows `microcalibrate` produces materially worse loss than a refactored `Calibrator` on representative constraint matrices. (Unlikely — PE uses it successfully.)
112+
2. Licensing / availability of `microcalibrate` becomes a blocker for external consumers of microplex-us. (Mitigate by forking the needed subset into microplex core.)
113+
3. The SS-model longitudinal extension requires a calibration primitive that `microcalibrate` does not provide (e.g., explicit spatial smoothness, per-year temporal regularization). Add the primitive at microplex level rather than swapping backends.

docs/v6-postmortem.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# v6 post-mortem — 2026-04-16
2+
3+
Record of the `broader-donors-puf-native-challenger-v6` run (launched 2026-04-16 10:20:10 ET, died 22:56:05 ET).
4+
5+
## Outcome
6+
7+
**RUN_EXIT status=1** after 12h 36m of wall time. Killed by the kernel during entropy calibration. No artifact directory created; no final dataset persisted.
8+
9+
## Timeline of the post-donor window
10+
11+
The post-donor stage instrumentation (commit `960ac2f`) was the single highest-value diagnostic change of the session. It let us localize the OOM to a specific named stage for the first time.
12+
13+
| Time (ET) | Stage marker |
14+
|---|---|
15+
| 10:20:10 | RUN_START |
16+
| ~19:29 (9h 9m in) | last donor block complete (`scf_2022/social_security_pension_income`) |
17+
| 21:04:03 | `seed ready``targets start`/`complete``synthesis variables ready``synthesis start`/`complete``support enforcement start`/`complete``policyengine tables start` (all in one burst; synthesis backend = seed-copy so the burst is dominated by the strip+cap pass between donor integration and tables) |
18+
| ~22:25 | `policyengine tables complete` [households=1,505,108, persons=3,373,378] |
19+
| ~22:25 | `policyengine calibration start [backend=entropy]` |
20+
| 22:56:05 | RUN_EXIT status=1, kernel signal (macOS `time -l` reported "signal: Invalid argument" on the wrapper) |
21+
22+
## Memory signature
23+
24+
From macOS `time -l` rusage at exit:
25+
26+
| Metric | v6 | v4 (previous run) |
27+
|---|---|---|
28+
| Wall time | 45,355 s (12h 36m) | 39,476 s (10h 58m) |
29+
| Max RSS | 22.0 GB | 20.5 GB |
30+
| Peak phys_footprint | 293 GB | 287 GB |
31+
| Instructions retired | 614 T | 612 T |
32+
| Involuntary context switches | 317 K | 264 K |
33+
34+
v6's signature is nearly identical to v4's — same killer, same point.
35+
36+
## Diagnosis
37+
38+
**`calibrate_policyengine_tables` with `backend=entropy` on 1.5M households is the OOM killer.**
39+
40+
Proximate cause: a 48 GB machine cannot hold the working set the entropy solver needs for that scale. Peak phys_footprint of 293 GB on 48 GB RAM implies heavy compression and swap pressure; eventually the kernel kills the process.
41+
42+
Likely underlying structural cost (not measured, but fits the profile):
43+
44+
- Entropy calibration materializes a dense Jacobian-like matrix roughly `(n_households × n_constraints)` in float64.
45+
- With 1,505,108 households and ~1,255 constraints post-feasibility-filter (from the 2026-03-30 review), that's 15 GB for a single copy. Multiple working copies (gradient, Hessian approximation, line-search scratch) easily exceed RAM.
46+
- `_evaluate_policyengine_target_fit_context` then runs a full PolicyEngine simulation on the calibrated frame, which adds its own memory cost on top.
47+
48+
## What survived
49+
50+
v6 demonstrated that the **tables-build phase works at scale**: `build_policyengine_entity_tables` successfully produced a 1.5M-household × 3.4M-person entity bundle. This was an open question after v4. The stage isn't free (roughly 1h 25m at 180–210% CPU, RSS oscillating 0.2–16%), but it doesn't OOM.
51+
52+
The donor integration also ran clean. All 129 donor blocks across CPS ASEC, IRS SOI PUF, SIPP tips, SIPP assets, and SCF completed without failure. The tax-unit entity-bundle construction took ~89 min (one-time cost per run). Multi-source donor imputation is not the bottleneck.
53+
54+
## What v6 ruled out as the killer
55+
56+
The initial v4 diagnosis hypothesized the silent post-donor window might be in synthesis, support enforcement, or tables-build. v6's instrumentation showed those all complete instantly or within ~1.5 hours. The killer is specifically **entropy calibration**, not an earlier stage.
57+
58+
## What this means for the architecture direction
59+
60+
v6 is an evidence point *for* the `spec-based-ecps-rewire` direction rather than against it:
61+
62+
1. **Entropy calibration on a 1.5M-household monolithic solve is a dead end on a 48 GB machine.** The rearchitecture's hierarchical / identity-preserving calibration pattern (national → state → stratum, `microcalibrate`-style chi-squared) avoids the dense-matrix blow-up by chunking over strata.
63+
2. **Scaffold scale is the real lever.** The 3.4M-row ACS scaffold drives both tables-build size and calibration-matrix size. CPS-core at ~430k persons cuts this at the source.
64+
3. **The instrumentation pattern is reusable.** Keeping named stage markers at every pipeline boundary in the new pipeline will make any future OOM localizable in a single run rather than requiring multiple exploratory runs.
65+
66+
## What v6 does NOT tell us
67+
68+
- Whether the imputation quality would have beaten `enhanced_cps_2024` on PE-native broad loss had it finished. No parity artifact was produced.
69+
- Whether the `pe_plus_puf_native_challenger` condition selection is an improvement. Moot now that the pipeline direction is changing.
70+
- The actual numerical Calibrator's behavior on 1.5M households. The failure was upstream of any Calibrator numerical work — the process died while setting up the constraint matrices.
71+
72+
## Status of v6 artifacts
73+
74+
- Log file: `artifacts/live_pe_us_data_rebuild_checkpoint_20260414_pe_plus_puf_native_challenger_broader/broader-donors-puf-native-challenger-v6.log` (~2,224 lines)
75+
- No output artifact directory (build never completed persistence step)
76+
- tmux session: cleaned up
77+
- No action required on artifacts — they stay on disk as part of the experiment trail.

0 commit comments

Comments
 (0)