microplex-us is moving away from pipeline-level source special cases and
toward declarative source and variable semantics.
There are two different questions for any variable in any source:
- Is this source authoritative for that variable?
- Is that variable safe to use as a donor conditioning feature?
Those are not the same question.
Examples:
- A source can be authoritative for a variable while still being a bad shared conditioning feature.
- A variable can exist in two sources but be semantically incompatible as a conditioning feature because one side is derived or placeholder-filled.
Core type:
microplex.core.SourceVariableCapability
Attached through:
microplex.core.SourceDescriptor.variable_capabilities
Current public helpers on SourceDescriptor:
capability_for(variable_name)is_authoritative_for(variable_name)allows_conditioning_on(variable_name)
Country-specific policy lives in:
src/microplex_us/source_registry.py
Main types:
SourceVariablePolicySourceVariablePolicySpecresolve_source_variable_capabilities(...)
This lets a source provider declare policy without embedding donor logic in the pipeline itself.
Generic atomic-vs-derived semantics live in:
src/microplex_us/variables.py
Main types and helpers:
VariableSemanticSpecDonorImputationBlockSpecDonorMatchStrategyVariableSupportFamilyVARIABLE_SEMANTIC_SPECSresolve_variable_semantic_capabilities(...)prune_redundant_variables(...)donor_imputation_blocks(...)donor_imputation_block_specs(...)
The donor path also now selects conditioning features per donor block rather than using the same shared-variable set for every imputation target. There are two distinct selection modes:
- generic selection:
- start from variables allowed by source and variable capability metadata
- score shared variables against the donor block
- keep the strongest conditioning features instead of every available overlap
pe_prespecifiedselection:- build a PE-style structural predictor surface when the variable semantics ask for it
- use the variable's declared
preferred_condition_varsas the structural backbone - optionally admit a narrow
supplemental_shared_condition_varsset from the actual shared overlap, instead of reopening the full common-predictor pool
pe_plus_puf_native_challengerselection:- keep the same PE structural backbone
- for the explicitly marked problematic PUF tax-leaf blocks only, append a narrow set of source-native raw-overlap predictors declared in semantics
- treat that lane as an opt-in challenger, not as a PE-alignment update
For the problematic PUF tax-leaf family, the PE-aligned default is still the
structural backbone only. The local policyengine-us-data
calibration/puf_impute.py path trains the PUF clone QRF on demographic /
tax-unit-role predictors only, and the PUF source policy intentionally marks
derived convenience columns like income, employment_status, and synthetic
state_fips as not usable for donor conditioning.
That keeps the donor path closer to the intended Microplex shape:
- declarative semantics define what is valid
- the pipeline chooses what is useful from the data
- source-specific predictor policy lives in semantics metadata rather than expanding ad hoc pipeline branches
The donor blocks themselves are also now declarative:
- native entity
- allowed condition entities
- projection aggregation for person-native controls when projected to a group
- block model variables
- restored output variables
- match strategy per modeled variable
- preferred conditioning variables
- supplemental shared conditioning variables
- optional frame preparation / restoration hooks
That means us.py now executes donor block specs rather than deciding inline
which blocks need special handling.
Artifacts now also record synthesis.donor_conditioning_diagnostics for each
executed donor block, including:
- donor source
- modeled/restored variables
- raw shared overlap before block preparation
- block-level shared overlap after model-variable exclusion
- whether entity projection ran, and which shared vars survived projection
- selected condition vars
- shared vars that were available but dropped
- requested supplemental shared vars
- requested challenger shared vars
- raw-stage supplemental rejection reasons
- raw-stage challenger rejection reasons
- prepared-stage supplemental rejection reasons
- prepared-stage challenger rejection reasons
- whether the block used a prepared condition surface
Use python -m microplex_us.pipelines.summarize_donor_conditioning <artifact>
to inspect those diagnostics from a finished artifact.
When a donor block declares a non-person native entity and those IDs are available in the working frame, the pipeline now:
- projects scaffold and donor rows to that entity
- filters donor conditioning features through the block's declared
condition_entitiespolicy - projects person-native conditioning variables using their semantic aggregation rule instead of blindly taking the first row
- fits the donor block once per native entity
- broadcasts imputed values back to person rows after matching
Current example:
dividend_incomeandordinary_dividend_incomeare treated as derived whenqualified_dividend_incomeandnon_qualified_dividend_incomeare present.
That means the system can automatically:
- avoid learning redundant totals as donor targets
- avoid using redundant totals as donor conditioning features
- keep the atomic basis as the source of truth
- distinguish variable families that should use household-only controls from those that should use person + household + native-entity controls
- respect tax-unit-native donor blocks without forcing all tax-unit variables through the same condition policy
This is the beginning of the general rule we want:
- source-specific policy should be declarative
- variable-level atomic/derived semantics should be generic
- the donor integration pipeline should consume metadata, not source names
That is what makes the approach portable to future country packs.
Current PUF policy expresses that:
state_fipsis not a real usable donor geography in the current buildtenureis scaffold fillerincomeis a derived convenience field, not an atomic donor targetemployment_statusis derived, not directly observedemployment_incomeis source-native but should not be used as a shared donor condition
CPS now resolves capabilities through the same registry path, so it also picks up generic variable semantics such as redundant dividend totals.
When adding a new source, prefer:
- Declare source-specific overrides in the source registry.
- Declare atomic-vs-derived relationships in variable semantics.
- Let the pipeline consume those capabilities generically.
Avoid:
- source-name
if/elsebranches in the donor path - learning overlapping derived variables independently
- using placeholder or derived variables as donor conditions just because they are numeric and present in both tables
- forcing tax-unit-native donor variables through person-native conditioning only because the seed frame happens to be person-indexed
- assuming every tax-unit-native variable should share the same household-only or person-level conditioning policy