|
| 1 | +# Core wiring audit |
| 2 | + |
| 3 | +*Snapshot: 2026-04-16. Audit of `microplex` core against the H+ rearchitecture proposal for `microplex-us`.* |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +The architectural thinking already happened. `microplex` core has ~80% of the primitives the rearchitecture needs — most of them unused. `microplex-us` has grown a parallel set of donor-block, calibration, and entity-table machinery that duplicates what core already provides. |
| 8 | + |
| 9 | +The project is **wire + complete + deprecate**, not **design + build**: |
| 10 | + |
| 11 | +1. Wire `microplex-us` pipelines to use existing core primitives. |
| 12 | +2. Complete half-baked primitives where "thought went in" but production-readiness did not. |
| 13 | +3. Deprecate `microplex-us` duplicates as each replacement lands. |
| 14 | + |
| 15 | +**Blocker:** `microplex` core is on a stale `codex/core-semantic-guards` branch (last commit 2026-04-02) with ~200 uncommitted/deleted files. Nothing destructive should land in core until that state resolves. |
| 16 | + |
| 17 | +## What exists in core (status by category) |
| 18 | + |
| 19 | +Legend: |
| 20 | + |
| 21 | +- **WIRED** — used by microplex-us today |
| 22 | +- **READY** — implemented, untested in production, no obvious gaps |
| 23 | +- **PARTIAL** — implemented with gaps or known rough edges |
| 24 | +- **PROTOTYPE** — substantial design but probably needs finishing for production |
| 25 | +- **UNKNOWN** — needs hands-on testing to classify |
| 26 | + |
| 27 | +### Spec primitives (`microplex.core`) |
| 28 | + |
| 29 | +| Primitive | File | Status | Notes | |
| 30 | +|---|---|---|---| |
| 31 | +| `Period`, `PeriodType` | `core/periods.py` | READY | Pydantic. DAY/MONTH/QUARTER/YEAR arithmetic and containment. microplex-us does not use. | |
| 32 | +| `EntityType` | `core/entities.py` | WIRED | — | |
| 33 | +| `SourceArchetype`, `TimeStructure`, `Shareability` | `core/sources.py` | WIRED | Country-agnostic source taxonomy. `LONGITUDINAL_SOCIOECONOMIC`, `PANEL`, `EVENT_HISTORY` values already defined. | |
| 34 | +| `SourceProvider`, `SourceQuery`, `ObservationFrame` | `core/sources.py` | WIRED | — | |
| 35 | +| `SourceManifest` | `core/source_manifests.py` | WIRED | — | |
| 36 | +| `FrameSemanticTransform` | `core/semantics.py` | PARTIAL | Declarative frame transforms with POST_SYNTHESIS / POST_IMPUTATION / POST_DONOR_INTEGRATION / POST_CALIBRATION / POST_EXPORT stages. microplex-us imports the module but coverage is unclear. | |
| 37 | +| `SourceVariableCapability` | `core/variables.py` | WIRED | — | |
| 38 | + |
| 39 | +### Transitions (`microplex.transitions`) |
| 40 | + |
| 41 | +| Primitive | File | Status | Notes | |
| 42 | +|---|---|---|---| |
| 43 | +| `Mortality` | `transitions/mortality.py` | PROTOTYPE | Hardcoded SSA 2021 period life tables (male + female qx arrays for ages 0–119). US-specific data in a "generic" module — likely belongs in microplex-us or a country-pack seam. | |
| 44 | +| `MarriageTransition`, `DivorceTransition` | `transitions/demographic.py` | PROTOTYPE | Hardcoded rate tables from CPS/ACS. Same US-specificity concern. | |
| 45 | +| `DisabilityOnset`, `DisabilityRecovery`, `DisabilityTransitionModel` | `transitions/disability.py` | PROTOTYPE | Hardcoded SSA DI rates. | |
| 46 | + |
| 47 | +**Decision point:** the hardcoded US rates in `microplex.transitions` violate the core/country split. Either (a) move these to microplex-us and leave core as pure interface, or (b) make the rate tables pluggable with country-specific providers. |
| 48 | + |
| 49 | +### Neural trajectory models (`microplex.models`) |
| 50 | + |
| 51 | +| Primitive | File | Status | Notes | |
| 52 | +|---|---|---|---| |
| 53 | +| `TrajectoryTransformer` | `models/trajectory_transformer.py` | PROTOTYPE | Autoregressive Transformer for panel synthesis. ZI-QDNN candidate per SS-model docs. | |
| 54 | +| `TrajectoryVAE` | `models/trajectory_vae.py` | PROTOTYPE | — | |
| 55 | +| `SequenceSynthesizer` | `models/sequence_synthesizer.py` | PROTOTYPE | Variable-length sequence synthesizer. | |
| 56 | +| `PanelEvolutionModel` | `models/panel_evolution.py` | PROTOTYPE | **Unified autoregressive replacement** for separate `transitions/*` classes. Docstring explicitly frames it as the replacement: `state[t+1] ~ state[t], state[t-1], ..., X`. | |
| 57 | +| `BaseSynthesisModel`, `BaseTrajectoryModel`, `BaseGraphModel` | `models/base.py` | PROTOTYPE | Abstract bases. | |
| 58 | + |
| 59 | +**Decision point:** `transitions/*` classes and `PanelEvolutionModel` overlap. If `PanelEvolutionModel` is the intended canonical form, the separate transitions either become (a) feature-engineering helpers for it, or (b) deleted. Right now both coexist, neither is wired, and microplex-us uses neither. |
| 60 | + |
| 61 | +### Fusion (`microplex.fusion`) |
| 62 | + |
| 63 | +| Primitive | File | Status | Notes | |
| 64 | +|---|---|---|---| |
| 65 | +| `FusionPlan`, `VariableCoverage` | `fusion/planning.py` | WIRED | microplex-us already uses for planning. Good design: tracks source-by-variable coverage, shareability, time structure. | |
| 66 | +| `MaskedMAF` | `fusion/masked_maf.py` | PROTOTYPE | Masked normalizing flow over stacked multi-survey data with per-record observed masks. Country-agnostic. | |
| 67 | +| `MultiSourceFusion` | `fusion/multi_source_fusion.py` | PROTOTYPE | Per-source + cross-source + unified three-model pipeline. Direct alternative to microplex-us's donor-block system. | |
| 68 | +| `harmonize_surveys`, `stack_surveys`, `COMMON_SCHEMA` | `fusion/harmonize.py` | PROTOTYPE | CPS/PUF-specific mappings baked in — needs generalization before being called "core". | |
| 69 | +| `FusionSynthesizer`, `FusionConfig`, `FusionResult` | `fusion/pipeline.py` | PROTOTYPE | High-level convenience over MaskedMAF. | |
| 70 | + |
| 71 | +**Decision point:** the `harmonize.py` `COMMON_SCHEMA` has US-specific variable names. Either move to microplex-us or make the mappings country-configurable. |
| 72 | + |
| 73 | +### Calibration (three modules, overlapping) |
| 74 | + |
| 75 | +| Primitive | File | Status | Notes | |
| 76 | +|---|---|---|---| |
| 77 | +| `Calibrator` (IPF, chi-square, entropy) | `calibration.py` (2011 lines) | WIRED | Core calibration class. Classical survey calibration. | |
| 78 | +| `LinearConstraint` | `calibration.py` | WIRED | Explicit linear constraint rows. | |
| 79 | +| `Reweighter` (L0/L1/L2 sparse) | `reweighting.py` (506 lines) | PROTOTYPE | Sparse L0/L1/L2 with scipy and cvxpy backends. Geographic hierarchy support. | |
| 80 | +| `microcalibrate` (external) | PolicyEngine package | WIRED (via microplex-us callers externally) | PolicyEngine's gradient-descent chi-squared library. | |
| 81 | + |
| 82 | +**Decision point (load-bearing):** three calibrators partly cover the same problem. |
| 83 | + |
| 84 | +- **Recommendation:** `Calibrator` (classical, identity-preserving) is the mainline for the cross-section pipeline, because it preserves all entity IDs by construction. `Reweighter` is the **optional sparse deployment selector** applied *after* Calibrator to produce a web-app-sized subsample. `microcalibrate` stays as an external dependency only if it offers something `Calibrator` does not (gradient-descent scalability beyond ~1M rows?) — otherwise retire it. |
| 85 | +- **Must settle before any wiring commit lands** because migration step 2 depends on choosing the mainline. |
| 86 | + |
| 87 | +### Hierarchical synthesis (`microplex.hierarchical`) |
| 88 | + |
| 89 | +| Primitive | File | Status | Notes | |
| 90 | +|---|---|---|---| |
| 91 | +| `HouseholdSchema`, hierarchical household→person two-pass | `hierarchical.py` (1155 lines) | PROTOTYPE | Different meaning than "hierarchical calibration." This is two-pass synthesis: household skeleton first, then person attributes conditioned on household context. | |
| 92 | +| `TaxUnitOptimizer` | `hierarchical.py` | WIRED | Already used by microplex-us. | |
| 93 | + |
| 94 | +### Geography (`microplex.geography`) |
| 95 | + |
| 96 | +| Primitive | File | Status | Notes | |
| 97 | +|---|---|---|---| |
| 98 | +| `AtomicGeographyCrosswalk` | `geography.py` | WIRED | — | |
| 99 | +| `GeographyProvider`, `StaticGeographyProvider` | `geography.py` | WIRED | — | |
| 100 | +| `ProbabilisticAtomicGeographyAssigner` | `geography.py` | WIRED | — | |
| 101 | +| `GeographyAssignmentPlan` | `geography.py` | WIRED | — | |
| 102 | + |
| 103 | +**Note:** US-specific GEOID constants (`STATE_LEN`, `COUNTY_LEN`, `TRACT_LEN`, `BLOCK_LEN`) are in core. Comment says "kept as compatibility constants" — probably deletable after UK port proves the abstraction is truly country-agnostic. |
| 104 | + |
| 105 | +### Generative building blocks |
| 106 | + |
| 107 | +| Primitive | File | Status | Notes | |
| 108 | +|---|---|---|---| |
| 109 | +| `Synthesizer` | `synthesizer.py` (728 lines) | WIRED | Main conditional synthesis class. Uses normalizing flows. | |
| 110 | +| `ConditionalMAF` | `flows.py` (526 lines) | PROTOTYPE | Conditional MAF normalizing flow primitive. | |
| 111 | +| `DGP` learning | `dgp.py`, `dgp_methods.py` | UNKNOWN | Population data-generating-process learning from multiple partial surveys. Distinct from fusion; claims to be "not statistical matching" and "not imputation" but learn true joint. | |
| 112 | +| `StatMatchSynthesizer` | `statmatch_backend.py` | PROTOTYPE | Wraps py-statmatch NND hot-deck. Useful for PUMS ↔ CPS graft. | |
| 113 | +| `MultiVariableTransformer` | `transforms.py` | WIRED | — | |
| 114 | +| `BinaryModel`, `DiscreteModelCollection` | `discrete.py` | WIRED | — | |
| 115 | + |
| 116 | +### Data sources (`microplex.data_sources`) |
| 117 | + |
| 118 | +| Source | Location | Country-appropriate? | Notes | |
| 119 | +|---|---|---|---| |
| 120 | +| `cps`, `cps_mappings`, `cps_transform` | `data_sources/cps.py` et al | **No** (US-specific) | Should move to microplex-us. | |
| 121 | +| `puf` | `data_sources/puf.py` | **No** (US-specific) | Should move to microplex-us. | |
| 122 | +| `psid` | `data_sources/psid.py` | **No** (US-specific) | Should move to microplex-us. | |
| 123 | + |
| 124 | +**Cleanup:** these three belong in `microplex-us/src/microplex_us/data_sources/` (where microplex-us already has its own `cps.py`, `puf.py`, etc.). Core has US-specific data loaders sitting in what should be a country-agnostic package. |
| 125 | + |
| 126 | +### Validation (`microplex.validation`) |
| 127 | + |
| 128 | +| Primitive | File | Country-appropriate? | Notes | |
| 129 | +|---|---|---|---| |
| 130 | +| `baseline` | `validation/baseline.py` | Likely generic | Needs review. | |
| 131 | +| `soi` | `validation/soi.py` | **No** (US-specific) | Should move to microplex-us. | |
| 132 | + |
| 133 | +### Targets (`microplex.targets`) |
| 134 | + |
| 135 | +| Primitive | File | Status | Notes | |
| 136 | +|---|---|---|---| |
| 137 | +| `TargetSpec`, `TargetSet` | `targets/spec.py` | WIRED | — | |
| 138 | +| `TargetProvider` protocol | `targets/provider.py` | WIRED | — | |
| 139 | +| `TargetQuery` | `targets/provider.py` | WIRED | — | |
| 140 | +| `assert_valid_benchmark_artifact_manifest` | `targets/artifacts.py` | WIRED | — | |
| 141 | +| `rac_mapping`, `database`, `bundles`, `benchmarking` | `targets/*` | UNKNOWN | Need review. | |
| 142 | + |
| 143 | +## What microplex-us currently imports from core |
| 144 | + |
| 145 | +Used (from grep of imports): |
| 146 | + |
| 147 | +``` |
| 148 | +microplex.calibration (Calibrator, LinearConstraint) |
| 149 | +microplex.core (EntityType, ObservationFrame, SourceProvider, SourceQuery, |
| 150 | + SourceManifest, SourceArchetype, SourceVariableCapability) |
| 151 | +microplex.core.semantics (subset of exports) |
| 152 | +microplex.fusion (FusionPlan only — not the actual fusion synthesizers) |
| 153 | +microplex.geography (subset) |
| 154 | +microplex.hierarchical (TaxUnitOptimizer) |
| 155 | +microplex.synthesizer (Synthesizer base) |
| 156 | +microplex.targets (TargetQuery, TargetSpec, TargetSet, |
| 157 | + assert_valid_benchmark_artifact_manifest) |
| 158 | +``` |
| 159 | + |
| 160 | +Unused but implemented in core: |
| 161 | + |
| 162 | +``` |
| 163 | +microplex.transitions (all of it — Mortality, Marriage, Divorce, Disability) |
| 164 | +microplex.models (all trajectory / panel evolution models) |
| 165 | +microplex.fusion.MaskedMAF (neural fusion synthesizer) |
| 166 | +microplex.fusion.MultiSourceFusion |
| 167 | +microplex.fusion.harmonize (stack_surveys, harmonize_surveys) |
| 168 | +microplex.reweighting (Reweighter — sparse L0) |
| 169 | +microplex.statmatch_backend (StatMatchSynthesizer — for PUMS graft) |
| 170 | +microplex.hierarchical (HouseholdSchema, hierarchical synthesis pipeline) |
| 171 | +microplex.core.periods.Period (period axis) |
| 172 | +microplex.data_sources.psid |
| 173 | +microplex.dgp (DGP learning) |
| 174 | +``` |
| 175 | + |
| 176 | +## Gaps — what genuinely needs to be built |
| 177 | + |
| 178 | +Against the H+ proposal, what is NOT already in core (in any form): |
| 179 | + |
| 180 | +1. **Identity-preserving calibrator protocol.** Concept only exists as a note; `Calibrator` and `Reweighter` are concrete classes with different contracts. A shared protocol that declares "output retains all input entity IDs" is missing. |
| 181 | +2. **Spatial smoothness regularization** for local-area calibration. Neither `Calibrator` nor `Reweighter` currently penalizes weight differences across adjacent geographies. |
| 182 | +3. **Fay-Herriot / composite estimator** for small-area estimation. Not present. |
| 183 | +4. **Held-out target evaluation harness.** Calibrate-on vs validate-on split is not a first-class concept in the existing harness. |
| 184 | +5. **Forbes backbone integration** for top-income records. PE is adding this upstream; microplex has no equivalent. |
| 185 | +6. **`TemporalDonorSpec` unification.** `transitions/*` classes and `PanelEvolutionModel` are two overlapping takes; a reconciled canonical abstraction does not exist. |
| 186 | + |
| 187 | +Everything else in the H+ proposal is at minimum a PROTOTYPE in core. |
| 188 | + |
| 189 | +## Three-way calibration overlap — decision required |
| 190 | + |
| 191 | +``` |
| 192 | +microplex.calibration.Calibrator classical: IPF / chi-square / entropy WIRED |
| 193 | +microplex.reweighting.Reweighter sparse: L0 / L1 / L2 UNUSED |
| 194 | +microcalibrate (external) gradient-descent chi-squared UNUSED |
| 195 | +``` |
| 196 | + |
| 197 | +Recommended resolution: |
| 198 | + |
| 199 | +- **Mainline:** `Calibrator` (identity-preserving, classical). Used for every production calibration. |
| 200 | +- **Optional sparse post-step:** `Reweighter` (L0). Applied after `Calibrator` when a deployment subsample is needed (e.g., 50k-record web app artifact). |
| 201 | +- **Retire:** `microcalibrate` external dependency, unless benchmarking shows it does something `Calibrator` does not (e.g., gradient-descent scalability past ~1M rows on realistic constraint matrices). |
| 202 | + |
| 203 | +This choice is load-bearing for migration step 2. It needs a yes/no before any wiring commits land. |
| 204 | + |
| 205 | +## Migration order |
| 206 | + |
| 207 | +| # | Swap | Gate | Blocked by | |
| 208 | +|---|---|---|---| |
| 209 | +| 0 | Resolve `codex/core-semantic-guards` branch state in microplex | microplex core tree clean on main | — | |
| 210 | +| 1 | Adopt `microplex.core.periods.Period` in microplex-us | microplex-us compiles with single period type | 0 | |
| 211 | +| 2 | Adopt `Calibrator` end-to-end, retire staged solve_now/solve_later | Cross-section beats current checkpoint on PE-native loss | 0, calibrator decision | |
| 212 | +| 3 | Adopt `MultiSourceFusion` + `MaskedMAF`; retire donor-block system | Neural fusion parity-evaluated vs block donors | 2 | |
| 213 | +| 4 | Adopt `statmatch_backend` for ACS PUMS ↔ CPS graft | PUMA-level local scaffold exists | 3 | |
| 214 | +| 5 | Adopt `Reweighter` as optional sparse deployment selector | 50k-record web-app artifact | 4 | |
| 215 | +| 6 | Adopt `transitions/*` for Phase 2 trivial forward projection | 1-year forward projection runs | 5 | |
| 216 | +| 7 | Consolidate `transitions/*` and `PanelEvolutionModel` into one canonical form | Unified AR model beats separate hazards on PSID validation | 6 | |
| 217 | +| 8 | Adopt `TrajectoryTransformer` / `TrajectoryVAE` | Neural trajectory beats interval-specific QRF on age-earnings | 7 | |
| 218 | + |
| 219 | +Steps 1–2 alone could clear G1 (national cross-section beats ECPS). |
| 220 | + |
| 221 | +## Prerequisite cleanup (microplex core) |
| 222 | + |
| 223 | +Before any wiring commits land in core: |
| 224 | + |
| 225 | +1. **Review `codex/core-semantic-guards` branch** (last commit 2026-04-02). It has useful-looking work (semantic transforms, sparse calibration frontier analysis, referee feedback) but ~200 uncommitted/deleted files. Either: |
| 226 | + - Land the useful pieces, or |
| 227 | + - Hard-reset to clean origin/main and cherry-pick, or |
| 228 | + - Abandon the branch and start fresh. |
| 229 | +2. **Relocate US-specific code out of core:** |
| 230 | + - `microplex/data_sources/cps*`, `puf.py`, `psid.py` → microplex-us |
| 231 | + - `microplex/validation/soi.py` → microplex-us |
| 232 | + - SSA hardcoded tables in `transitions/*` → microplex-us (or make pluggable) |
| 233 | + - GEOID length constants in `geography.py` → microplex-us or private helper |
| 234 | +3. **Delete the compatibility shims** in core root (`unified_calibration.py`, `target_registry.py`, `pe_targets.py`, `data.py`, `cps_synthetic.py`, `calibration_harness.py`) once all callers have migrated to microplex-us imports. Right now they stay as shims. |
| 235 | + |
| 236 | +## Risks |
| 237 | + |
| 238 | +1. **"Unused" ≠ "ready."** Every PROTOTYPE entry above likely has at least one production-blocking gap. Expect 20–40% of wiring effort to be "finish the core primitive" rather than "integrate." |
| 239 | +2. **US-specific rates baked into "generic" core.** `transitions/*` has SSA life tables and CPS rates hardcoded at core level. Wiring microplex-us to those is easy; porting microplex-uk to them is impossible without first decoupling. |
| 240 | +3. **Three-way calibrator overlap may hide performance differences.** Before choosing `Calibrator` as mainline, run one apples-to-apples benchmark against `Reweighter` and `microcalibrate` on a representative constraint matrix. |
| 241 | +4. **`codex/core-semantic-guards` abandonment.** The stale branch may contain work that materially improves these primitives. Losing it to a hard-reset could waste thought. Reviewing before discarding is cheap insurance. |
| 242 | + |
| 243 | +## Concrete next actions |
| 244 | + |
| 245 | +1. Decide the codex branch's fate (land / rebase / abandon). |
| 246 | +2. Settle the three-way calibrator question (benchmark or decision document). |
| 247 | +3. Write PSID → ObservationFrame adapter in microplex-us data_sources (if not already done — needs check). |
| 248 | +4. Prototype migration step 2 on a small slice: CPS + QRF via `MultiSourceFusion` + `Calibrator` → compare to current microplex-us pipeline at 2000-record smoke scale. |
| 249 | +5. Once smoke passes, land step 1 (Period adoption) as the first wiring commit. |
| 250 | + |
| 251 | +## Provenance |
| 252 | + |
| 253 | +This audit reads core as of commit `71f270e` on branch `codex/core-semantic-guards` (microplex core). It does not execute any of the primitives, so READY / PARTIAL / PROTOTYPE classifications are based on interface inspection and file-size heuristics. Each classification needs empirical confirmation before commitment. |
0 commit comments