Skip to content

Commit aeb24f7

Browse files
MaxGhenisclaude
andauthored
Add core wiring audit for microplex primitives (#2)
Maps microplex core primitives to microplex-us usage. Classifies each as WIRED / READY / PARTIAL / PROTOTYPE / UNKNOWN and identifies the three-way calibrator overlap as the load-bearing decision point. Flags US-specific code sitting in generic core (transitions rates, data_sources/cps|puf|psid, validation/soi) that should move to microplex-us. Lays out migration order for the spec-based-ecps-rewire path. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9c553d1 commit aeb24f7

1 file changed

Lines changed: 253 additions & 0 deletions

File tree

docs/core-wiring-audit.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Core wiring audit
2+
3+
*Snapshot: 2026-04-16. Audit of `microplex` core against the H+ rearchitecture proposal for `microplex-us`.*
4+
5+
## TL;DR
6+
7+
The architectural thinking already happened. `microplex` core has ~80% of the primitives the rearchitecture needs — most of them unused. `microplex-us` has grown a parallel set of donor-block, calibration, and entity-table machinery that duplicates what core already provides.
8+
9+
The project is **wire + complete + deprecate**, not **design + build**:
10+
11+
1. Wire `microplex-us` pipelines to use existing core primitives.
12+
2. Complete half-baked primitives where "thought went in" but production-readiness did not.
13+
3. Deprecate `microplex-us` duplicates as each replacement lands.
14+
15+
**Blocker:** `microplex` core is on a stale `codex/core-semantic-guards` branch (last commit 2026-04-02) with ~200 uncommitted/deleted files. Nothing destructive should land in core until that state resolves.
16+
17+
## What exists in core (status by category)
18+
19+
Legend:
20+
21+
- **WIRED** — used by microplex-us today
22+
- **READY** — implemented, untested in production, no obvious gaps
23+
- **PARTIAL** — implemented with gaps or known rough edges
24+
- **PROTOTYPE** — substantial design but probably needs finishing for production
25+
- **UNKNOWN** — needs hands-on testing to classify
26+
27+
### Spec primitives (`microplex.core`)
28+
29+
| Primitive | File | Status | Notes |
30+
|---|---|---|---|
31+
| `Period`, `PeriodType` | `core/periods.py` | READY | Pydantic. DAY/MONTH/QUARTER/YEAR arithmetic and containment. microplex-us does not use. |
32+
| `EntityType` | `core/entities.py` | WIRED ||
33+
| `SourceArchetype`, `TimeStructure`, `Shareability` | `core/sources.py` | WIRED | Country-agnostic source taxonomy. `LONGITUDINAL_SOCIOECONOMIC`, `PANEL`, `EVENT_HISTORY` values already defined. |
34+
| `SourceProvider`, `SourceQuery`, `ObservationFrame` | `core/sources.py` | WIRED ||
35+
| `SourceManifest` | `core/source_manifests.py` | WIRED ||
36+
| `FrameSemanticTransform` | `core/semantics.py` | PARTIAL | Declarative frame transforms with POST_SYNTHESIS / POST_IMPUTATION / POST_DONOR_INTEGRATION / POST_CALIBRATION / POST_EXPORT stages. microplex-us imports the module but coverage is unclear. |
37+
| `SourceVariableCapability` | `core/variables.py` | WIRED ||
38+
39+
### Transitions (`microplex.transitions`)
40+
41+
| Primitive | File | Status | Notes |
42+
|---|---|---|---|
43+
| `Mortality` | `transitions/mortality.py` | PROTOTYPE | Hardcoded SSA 2021 period life tables (male + female qx arrays for ages 0–119). US-specific data in a "generic" module — likely belongs in microplex-us or a country-pack seam. |
44+
| `MarriageTransition`, `DivorceTransition` | `transitions/demographic.py` | PROTOTYPE | Hardcoded rate tables from CPS/ACS. Same US-specificity concern. |
45+
| `DisabilityOnset`, `DisabilityRecovery`, `DisabilityTransitionModel` | `transitions/disability.py` | PROTOTYPE | Hardcoded SSA DI rates. |
46+
47+
**Decision point:** the hardcoded US rates in `microplex.transitions` violate the core/country split. Either (a) move these to microplex-us and leave core as pure interface, or (b) make the rate tables pluggable with country-specific providers.
48+
49+
### Neural trajectory models (`microplex.models`)
50+
51+
| Primitive | File | Status | Notes |
52+
|---|---|---|---|
53+
| `TrajectoryTransformer` | `models/trajectory_transformer.py` | PROTOTYPE | Autoregressive Transformer for panel synthesis. ZI-QDNN candidate per SS-model docs. |
54+
| `TrajectoryVAE` | `models/trajectory_vae.py` | PROTOTYPE ||
55+
| `SequenceSynthesizer` | `models/sequence_synthesizer.py` | PROTOTYPE | Variable-length sequence synthesizer. |
56+
| `PanelEvolutionModel` | `models/panel_evolution.py` | PROTOTYPE | **Unified autoregressive replacement** for separate `transitions/*` classes. Docstring explicitly frames it as the replacement: `state[t+1] ~ state[t], state[t-1], ..., X`. |
57+
| `BaseSynthesisModel`, `BaseTrajectoryModel`, `BaseGraphModel` | `models/base.py` | PROTOTYPE | Abstract bases. |
58+
59+
**Decision point:** `transitions/*` classes and `PanelEvolutionModel` overlap. If `PanelEvolutionModel` is the intended canonical form, the separate transitions either become (a) feature-engineering helpers for it, or (b) deleted. Right now both coexist, neither is wired, and microplex-us uses neither.
60+
61+
### Fusion (`microplex.fusion`)
62+
63+
| Primitive | File | Status | Notes |
64+
|---|---|---|---|
65+
| `FusionPlan`, `VariableCoverage` | `fusion/planning.py` | WIRED | microplex-us already uses for planning. Good design: tracks source-by-variable coverage, shareability, time structure. |
66+
| `MaskedMAF` | `fusion/masked_maf.py` | PROTOTYPE | Masked normalizing flow over stacked multi-survey data with per-record observed masks. Country-agnostic. |
67+
| `MultiSourceFusion` | `fusion/multi_source_fusion.py` | PROTOTYPE | Per-source + cross-source + unified three-model pipeline. Direct alternative to microplex-us's donor-block system. |
68+
| `harmonize_surveys`, `stack_surveys`, `COMMON_SCHEMA` | `fusion/harmonize.py` | PROTOTYPE | CPS/PUF-specific mappings baked in — needs generalization before being called "core". |
69+
| `FusionSynthesizer`, `FusionConfig`, `FusionResult` | `fusion/pipeline.py` | PROTOTYPE | High-level convenience over MaskedMAF. |
70+
71+
**Decision point:** the `harmonize.py` `COMMON_SCHEMA` has US-specific variable names. Either move to microplex-us or make the mappings country-configurable.
72+
73+
### Calibration (three modules, overlapping)
74+
75+
| Primitive | File | Status | Notes |
76+
|---|---|---|---|
77+
| `Calibrator` (IPF, chi-square, entropy) | `calibration.py` (2011 lines) | WIRED | Core calibration class. Classical survey calibration. |
78+
| `LinearConstraint` | `calibration.py` | WIRED | Explicit linear constraint rows. |
79+
| `Reweighter` (L0/L1/L2 sparse) | `reweighting.py` (506 lines) | PROTOTYPE | Sparse L0/L1/L2 with scipy and cvxpy backends. Geographic hierarchy support. |
80+
| `microcalibrate` (external) | PolicyEngine package | WIRED (via microplex-us callers externally) | PolicyEngine's gradient-descent chi-squared library. |
81+
82+
**Decision point (load-bearing):** three calibrators partly cover the same problem.
83+
84+
- **Recommendation:** `Calibrator` (classical, identity-preserving) is the mainline for the cross-section pipeline, because it preserves all entity IDs by construction. `Reweighter` is the **optional sparse deployment selector** applied *after* Calibrator to produce a web-app-sized subsample. `microcalibrate` stays as an external dependency only if it offers something `Calibrator` does not (gradient-descent scalability beyond ~1M rows?) — otherwise retire it.
85+
- **Must settle before any wiring commit lands** because migration step 2 depends on choosing the mainline.
86+
87+
### Hierarchical synthesis (`microplex.hierarchical`)
88+
89+
| Primitive | File | Status | Notes |
90+
|---|---|---|---|
91+
| `HouseholdSchema`, hierarchical household→person two-pass | `hierarchical.py` (1155 lines) | PROTOTYPE | Different meaning than "hierarchical calibration." This is two-pass synthesis: household skeleton first, then person attributes conditioned on household context. |
92+
| `TaxUnitOptimizer` | `hierarchical.py` | WIRED | Already used by microplex-us. |
93+
94+
### Geography (`microplex.geography`)
95+
96+
| Primitive | File | Status | Notes |
97+
|---|---|---|---|
98+
| `AtomicGeographyCrosswalk` | `geography.py` | WIRED ||
99+
| `GeographyProvider`, `StaticGeographyProvider` | `geography.py` | WIRED ||
100+
| `ProbabilisticAtomicGeographyAssigner` | `geography.py` | WIRED ||
101+
| `GeographyAssignmentPlan` | `geography.py` | WIRED ||
102+
103+
**Note:** US-specific GEOID constants (`STATE_LEN`, `COUNTY_LEN`, `TRACT_LEN`, `BLOCK_LEN`) are in core. Comment says "kept as compatibility constants" — probably deletable after UK port proves the abstraction is truly country-agnostic.
104+
105+
### Generative building blocks
106+
107+
| Primitive | File | Status | Notes |
108+
|---|---|---|---|
109+
| `Synthesizer` | `synthesizer.py` (728 lines) | WIRED | Main conditional synthesis class. Uses normalizing flows. |
110+
| `ConditionalMAF` | `flows.py` (526 lines) | PROTOTYPE | Conditional MAF normalizing flow primitive. |
111+
| `DGP` learning | `dgp.py`, `dgp_methods.py` | UNKNOWN | Population data-generating-process learning from multiple partial surveys. Distinct from fusion; claims to be "not statistical matching" and "not imputation" but learn true joint. |
112+
| `StatMatchSynthesizer` | `statmatch_backend.py` | PROTOTYPE | Wraps py-statmatch NND hot-deck. Useful for PUMS ↔ CPS graft. |
113+
| `MultiVariableTransformer` | `transforms.py` | WIRED ||
114+
| `BinaryModel`, `DiscreteModelCollection` | `discrete.py` | WIRED ||
115+
116+
### Data sources (`microplex.data_sources`)
117+
118+
| Source | Location | Country-appropriate? | Notes |
119+
|---|---|---|---|
120+
| `cps`, `cps_mappings`, `cps_transform` | `data_sources/cps.py` et al | **No** (US-specific) | Should move to microplex-us. |
121+
| `puf` | `data_sources/puf.py` | **No** (US-specific) | Should move to microplex-us. |
122+
| `psid` | `data_sources/psid.py` | **No** (US-specific) | Should move to microplex-us. |
123+
124+
**Cleanup:** these three belong in `microplex-us/src/microplex_us/data_sources/` (where microplex-us already has its own `cps.py`, `puf.py`, etc.). Core has US-specific data loaders sitting in what should be a country-agnostic package.
125+
126+
### Validation (`microplex.validation`)
127+
128+
| Primitive | File | Country-appropriate? | Notes |
129+
|---|---|---|---|
130+
| `baseline` | `validation/baseline.py` | Likely generic | Needs review. |
131+
| `soi` | `validation/soi.py` | **No** (US-specific) | Should move to microplex-us. |
132+
133+
### Targets (`microplex.targets`)
134+
135+
| Primitive | File | Status | Notes |
136+
|---|---|---|---|
137+
| `TargetSpec`, `TargetSet` | `targets/spec.py` | WIRED ||
138+
| `TargetProvider` protocol | `targets/provider.py` | WIRED ||
139+
| `TargetQuery` | `targets/provider.py` | WIRED ||
140+
| `assert_valid_benchmark_artifact_manifest` | `targets/artifacts.py` | WIRED ||
141+
| `rac_mapping`, `database`, `bundles`, `benchmarking` | `targets/*` | UNKNOWN | Need review. |
142+
143+
## What microplex-us currently imports from core
144+
145+
Used (from grep of imports):
146+
147+
```
148+
microplex.calibration (Calibrator, LinearConstraint)
149+
microplex.core (EntityType, ObservationFrame, SourceProvider, SourceQuery,
150+
SourceManifest, SourceArchetype, SourceVariableCapability)
151+
microplex.core.semantics (subset of exports)
152+
microplex.fusion (FusionPlan only — not the actual fusion synthesizers)
153+
microplex.geography (subset)
154+
microplex.hierarchical (TaxUnitOptimizer)
155+
microplex.synthesizer (Synthesizer base)
156+
microplex.targets (TargetQuery, TargetSpec, TargetSet,
157+
assert_valid_benchmark_artifact_manifest)
158+
```
159+
160+
Unused but implemented in core:
161+
162+
```
163+
microplex.transitions (all of it — Mortality, Marriage, Divorce, Disability)
164+
microplex.models (all trajectory / panel evolution models)
165+
microplex.fusion.MaskedMAF (neural fusion synthesizer)
166+
microplex.fusion.MultiSourceFusion
167+
microplex.fusion.harmonize (stack_surveys, harmonize_surveys)
168+
microplex.reweighting (Reweighter — sparse L0)
169+
microplex.statmatch_backend (StatMatchSynthesizer — for PUMS graft)
170+
microplex.hierarchical (HouseholdSchema, hierarchical synthesis pipeline)
171+
microplex.core.periods.Period (period axis)
172+
microplex.data_sources.psid
173+
microplex.dgp (DGP learning)
174+
```
175+
176+
## Gaps — what genuinely needs to be built
177+
178+
Against the H+ proposal, what is NOT already in core (in any form):
179+
180+
1. **Identity-preserving calibrator protocol.** Concept only exists as a note; `Calibrator` and `Reweighter` are concrete classes with different contracts. A shared protocol that declares "output retains all input entity IDs" is missing.
181+
2. **Spatial smoothness regularization** for local-area calibration. Neither `Calibrator` nor `Reweighter` currently penalizes weight differences across adjacent geographies.
182+
3. **Fay-Herriot / composite estimator** for small-area estimation. Not present.
183+
4. **Held-out target evaluation harness.** Calibrate-on vs validate-on split is not a first-class concept in the existing harness.
184+
5. **Forbes backbone integration** for top-income records. PE is adding this upstream; microplex has no equivalent.
185+
6. **`TemporalDonorSpec` unification.** `transitions/*` classes and `PanelEvolutionModel` are two overlapping takes; a reconciled canonical abstraction does not exist.
186+
187+
Everything else in the H+ proposal is at minimum a PROTOTYPE in core.
188+
189+
## Three-way calibration overlap — decision required
190+
191+
```
192+
microplex.calibration.Calibrator classical: IPF / chi-square / entropy WIRED
193+
microplex.reweighting.Reweighter sparse: L0 / L1 / L2 UNUSED
194+
microcalibrate (external) gradient-descent chi-squared UNUSED
195+
```
196+
197+
Recommended resolution:
198+
199+
- **Mainline:** `Calibrator` (identity-preserving, classical). Used for every production calibration.
200+
- **Optional sparse post-step:** `Reweighter` (L0). Applied after `Calibrator` when a deployment subsample is needed (e.g., 50k-record web app artifact).
201+
- **Retire:** `microcalibrate` external dependency, unless benchmarking shows it does something `Calibrator` does not (e.g., gradient-descent scalability past ~1M rows on realistic constraint matrices).
202+
203+
This choice is load-bearing for migration step 2. It needs a yes/no before any wiring commits land.
204+
205+
## Migration order
206+
207+
| # | Swap | Gate | Blocked by |
208+
|---|---|---|---|
209+
| 0 | Resolve `codex/core-semantic-guards` branch state in microplex | microplex core tree clean on main ||
210+
| 1 | Adopt `microplex.core.periods.Period` in microplex-us | microplex-us compiles with single period type | 0 |
211+
| 2 | Adopt `Calibrator` end-to-end, retire staged solve_now/solve_later | Cross-section beats current checkpoint on PE-native loss | 0, calibrator decision |
212+
| 3 | Adopt `MultiSourceFusion` + `MaskedMAF`; retire donor-block system | Neural fusion parity-evaluated vs block donors | 2 |
213+
| 4 | Adopt `statmatch_backend` for ACS PUMS ↔ CPS graft | PUMA-level local scaffold exists | 3 |
214+
| 5 | Adopt `Reweighter` as optional sparse deployment selector | 50k-record web-app artifact | 4 |
215+
| 6 | Adopt `transitions/*` for Phase 2 trivial forward projection | 1-year forward projection runs | 5 |
216+
| 7 | Consolidate `transitions/*` and `PanelEvolutionModel` into one canonical form | Unified AR model beats separate hazards on PSID validation | 6 |
217+
| 8 | Adopt `TrajectoryTransformer` / `TrajectoryVAE` | Neural trajectory beats interval-specific QRF on age-earnings | 7 |
218+
219+
Steps 1–2 alone could clear G1 (national cross-section beats ECPS).
220+
221+
## Prerequisite cleanup (microplex core)
222+
223+
Before any wiring commits land in core:
224+
225+
1. **Review `codex/core-semantic-guards` branch** (last commit 2026-04-02). It has useful-looking work (semantic transforms, sparse calibration frontier analysis, referee feedback) but ~200 uncommitted/deleted files. Either:
226+
- Land the useful pieces, or
227+
- Hard-reset to clean origin/main and cherry-pick, or
228+
- Abandon the branch and start fresh.
229+
2. **Relocate US-specific code out of core:**
230+
- `microplex/data_sources/cps*`, `puf.py`, `psid.py` → microplex-us
231+
- `microplex/validation/soi.py` → microplex-us
232+
- SSA hardcoded tables in `transitions/*` → microplex-us (or make pluggable)
233+
- GEOID length constants in `geography.py` → microplex-us or private helper
234+
3. **Delete the compatibility shims** in core root (`unified_calibration.py`, `target_registry.py`, `pe_targets.py`, `data.py`, `cps_synthetic.py`, `calibration_harness.py`) once all callers have migrated to microplex-us imports. Right now they stay as shims.
235+
236+
## Risks
237+
238+
1. **"Unused" ≠ "ready."** Every PROTOTYPE entry above likely has at least one production-blocking gap. Expect 20–40% of wiring effort to be "finish the core primitive" rather than "integrate."
239+
2. **US-specific rates baked into "generic" core.** `transitions/*` has SSA life tables and CPS rates hardcoded at core level. Wiring microplex-us to those is easy; porting microplex-uk to them is impossible without first decoupling.
240+
3. **Three-way calibrator overlap may hide performance differences.** Before choosing `Calibrator` as mainline, run one apples-to-apples benchmark against `Reweighter` and `microcalibrate` on a representative constraint matrix.
241+
4. **`codex/core-semantic-guards` abandonment.** The stale branch may contain work that materially improves these primitives. Losing it to a hard-reset could waste thought. Reviewing before discarding is cheap insurance.
242+
243+
## Concrete next actions
244+
245+
1. Decide the codex branch's fate (land / rebase / abandon).
246+
2. Settle the three-way calibrator question (benchmark or decision document).
247+
3. Write PSID → ObservationFrame adapter in microplex-us data_sources (if not already done — needs check).
248+
4. Prototype migration step 2 on a small slice: CPS + QRF via `MultiSourceFusion` + `Calibrator` → compare to current microplex-us pipeline at 2000-record smoke scale.
249+
5. Once smoke passes, land step 1 (Period adoption) as the first wiring commit.
250+
251+
## Provenance
252+
253+
This audit reads core as of commit `71f270e` on branch `codex/core-semantic-guards` (microplex core). It does not execute any of the primitives, so READY / PARTIAL / PROTOTYPE classifications are based on interface inspection and file-size heuristics. Each classification needs empirical confirmation before commitment.

0 commit comments

Comments
 (0)