Add source-backed PE target profile

MaxGhenis · MaxGhenis · commit b4140de395b6 · 2026-05-21T21:57:01.000-04:00
diff --git a/docs/arch-target-gap-queue.md b/docs/arch-target-gap-queue.md
@@ -40,65 +40,85 @@ source aging, reconciliation, activation, and model-variable aliases remain in
 `gap_category` for agent routing and `loader_status` for debugging why a cell
 landed there.
 
-## Current PolicyEngine Broad Profile Boundary
-
-The current Arch-backed PE broad profile coverage intentionally stops before
-survey-heavy or model-input cells such as rent, net worth, child support,
-medical-premium subcomponents, SPM expenses, and `ssn_card_type`. Those rows are
-not ready for automated source-loader agents under the primary-source-first
-policy.
+## Current PolicyEngine Profile Boundary
+
+`pe_native_broad` keeps the raw PolicyEngine parity surface intact. It includes
+all currently tracked broad target cells, including survey/model-input rows and
+cells whose publisher-source semantics still need review.
+
+`pe_native_broad_source_backed` is the Arch-backed calibration/profile boundary.
+It excludes only cells with explicit reasons in
+`src/microplex_us/policyengine/target_profiles.py`, such as:
+
+- SOI multi-domain cells that would require joint AGI, filing status, and
+  positive income-tax-before-credits facts not currently published by the loaded
+  SOI packages
+- survey-heavy or model-input cells such as rent, net worth, child support,
+  medical-premium subcomponents, SPM capped expenses, and `ssn_card_type`
+- source-near but non-equivalent rows such as `childcare_expenses`, where IRS
+  credit expenses and W-2 dependent-care benefits are narrower tax concepts
+- pregnancy stock by state, where live births are a flow rather than a direct
+  source fact for the PolicyEngine target
 
 ## Current Local Snapshot
 
-Snapshot date: 2026-05-19.
+Snapshot date: 2026-05-22.
 
 Inputs:
 
 - `/Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl`
 - `/Users/maxghenis/CosilicoAI/arch/macro/targets.db`
+- `/tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl`
+- `/tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl`
+- `/tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl`
+- `/tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl`
+- `/tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl`
+- `/tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl`
 
 Command:
 
 ```bash
 uv run microplex-us-arch-target-refresh \
-  --artifact-root /Users/maxghenis/CosilicoAI/arch \
+  --arch-targets-db /Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl \
+  --arch-targets-db /Users/maxghenis/CosilicoAI/arch/macro/targets.db \
+  --arch-targets-db /tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl \
+  --arch-targets-db /tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl \
+  --arch-targets-db /tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl \
+  --arch-targets-db /tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl \
+  --arch-targets-db /tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl \
+  --arch-targets-db /tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl \
   --period 2024 \
-  --profile pe_native_broad \
+  --profile pe_native_broad_source_backed \
   --output-dir artifacts/arch-target-coverage
 ```
 
 Coverage:
 
-- 189 target cells in `pe_native_broad`
-- 138 covered
-- 51 uncovered
-- 73.0% coverage
-- national: 79 of 116 covered
-- state: 59 of 73 covered
+- 172 target cells in `pe_native_broad_source_backed`
+- 172 covered
+- 0 uncovered
+- 100.0% coverage
 
-Gap categories:
+The raw `pe_native_broad` profile remains at 172 of 189 covered with 17
+explicitly reviewed rows outside the source-backed boundary:
 
 | Category | Rows |
 | --- | ---: |
-| `source_mapping_review` | 26 |
 | `survey_or_model_input_deprioritized` | 12 |
-| `adapter_or_constraint_review` | 10 |
-| `ready_rollup_or_geography` | 3 |
+| `adapter_or_constraint_review` | 3 |
+| `source_mapping_review` | 2 |
 
 Generated outputs:
 
-- `artifacts/arch-target-coverage/pe_native_broad_2024_coverage.json`
-- `artifacts/arch-target-coverage/pe_native_broad_2024_gaps.json`
-- `artifacts/arch-target-coverage/pe_native_broad_2024_gaps.csv`
-- `artifacts/arch-target-coverage/pe_native_broad_2024_summary.md`
+- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_coverage.json`
+- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_gaps.json`
+- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_gaps.csv`
+- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_summary.md`
 
 Remaining work is concentrated in:
 
-- source-mapping review for the newly expanded PE parity cells, especially
-  domains whose expected Arch concept is not yet encoded in the gap taxonomy
-- adapter or constraint review where Arch has the variable at the right
-  geography but the Microplex adapter does not yet match the PE target cell
-- a small rollup/geography queue for variables loaded in Arch but not at the
-  requested national or state target geography
-- survey/model-input proxy cells that remain deprioritized until a primary
-  publisher source is identified
+- the raw `pe_native_broad` cells excluded from the source-backed profile, if a
+  future primary publisher source can support them without changing semantics
+- UK profile parity, which should follow the same pattern: keep the raw PE
+  target surface intact and expose a source-backed profile with explicit
+  exclusions where source equivalence is not defensible
diff --git a/src/microplex_us/policyengine/target_profiles.py b/src/microplex_us/policyengine/target_profiles.py
@@ -23,6 +23,18 @@ def to_provider_filter(self) -> dict[str, str | None]:
         }
 
 
+PolicyEngineUSTargetCellKey = tuple[str, str | None, str | None, str | None]
+
+
+def _target_cell_key(cell: PolicyEngineUSTargetCell) -> PolicyEngineUSTargetCellKey:
+    return (
+        cell.variable,
+        cell.geo_level,
+        cell.domain_variable,
+        cell.geographic_id,
+    )
+
+
 PE_NATIVE_BROAD_TARGET_CELLS: tuple[PolicyEngineUSTargetCell, ...] = (
     PolicyEngineUSTargetCell(
         "aca_ptc", geo_level="national", domain_variable="aca_ptc"
@@ -694,18 +706,200 @@ def to_provider_filter(self) -> dict[str, str | None]:
 PE_NATIVE_BROAD_NO_STATE_ACA_TARGET_CELLS: tuple[PolicyEngineUSTargetCell, ...] = tuple(
     cell
     for cell in PE_NATIVE_BROAD_TARGET_CELLS
-    if (
-        cell.variable,
-        cell.geo_level,
-        cell.domain_variable,
-        cell.geographic_id,
-    )
-    not in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
+    if _target_cell_key(cell) not in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
+)
+
+PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS: dict[
+    PolicyEngineUSTargetCellKey,
+    str,
+] = {
+    (
+        "adjusted_gross_income",
+        "national",
+        "adjusted_gross_income,filing_status,income_tax_before_credits",
+        None,
+    ): (
+        "SOI source packages currently loaded by Arch do not publish adjusted "
+        "gross income jointly by AGI band, filing status, and returns with "
+        "positive income tax before credits."
+    ),
+    (
+        "adjusted_gross_income",
+        "national",
+        "adjusted_gross_income,income_tax_before_credits",
+        None,
+    ): (
+        "SOI source packages currently loaded by Arch publish AGI bands and "
+        "income-tax-before-credits returns separately, not AGI amounts "
+        "restricted to returns with positive income tax before credits."
+    ),
+    (
+        "tax_unit_count",
+        "national",
+        "adjusted_gross_income,filing_status,income_tax_before_credits",
+        None,
+    ): (
+        "SOI Historic Table 2 does not provide the full AGI by filing-status "
+        "by positive-income-tax-before-credits joint count required by this "
+        "PolicyEngine cell."
+    ),
+    (
+        "person_count",
+        "national",
+        "ssn_card_type",
+        None,
+    ): (
+        "PolicyEngine ssn_card_type is a modeled legal-status input; no "
+        "accepted primary aggregate source mapping is encoded for Arch."
+    ),
+    (
+        "person_count",
+        "state",
+        "is_pregnant",
+        None,
+    ): (
+        "The PolicyEngine cell is a pregnancy stock by state; live births are "
+        "a flow and are not a defensible direct source fact for this target."
+    ),
+    (
+        "alimony_expense",
+        "national",
+        None,
+        None,
+    ): (
+        "No accepted primary source mapping is encoded for this "
+        "survey/model-input expense variable."
+    ),
+    (
+        "child_support_expense",
+        "national",
+        None,
+        None,
+    ): (
+        "No accepted primary source mapping is encoded for this "
+        "survey/model-input expense variable."
+    ),
+    (
+        "child_support_received",
+        "national",
+        None,
+        None,
+    ): (
+        "No accepted primary source mapping is encoded for this "
+        "survey/model-input receipt variable."
+    ),
+    (
+        "childcare_expenses",
+        "national",
+        None,
+        None,
+    ): (
+        "IRS child-care credit expenses and W-2 dependent-care benefits are "
+        "narrower tax concepts than PolicyEngine childcare_expenses, so they "
+        "are not treated as source-equivalent."
+    ),
+    (
+        "health_insurance_premiums_without_medicare_part_b",
+        "national",
+        None,
+        None,
+    ): (
+        "This premium component is a modeled/survey input; no accepted primary "
+        "aggregate source mapping is encoded for Arch."
+    ),
+    (
+        "medicare_part_b_premiums",
+        "national",
+        None,
+        None,
+    ): (
+        "PolicyEngine Medicare Part B premiums depend on person-level "
+        "enrollment and IRMAA status; no accepted aggregate source fact is "
+        "encoded for this modeled input."
+    ),
+    (
+        "net_worth",
+        "national",
+        None,
+        None,
+    ): (
+        "Net worth is a wealth survey/model input; no accepted primary "
+        "administrative aggregate source mapping is encoded for Arch."
+    ),
+    (
+        "other_medical_expenses",
+        "national",
+        None,
+        None,
+    ): (
+        "This out-of-pocket medical expense component is a survey/model input "
+        "without an accepted primary aggregate source mapping."
+    ),
+    (
+        "over_the_counter_health_expenses",
+        "national",
+        None,
+        None,
+    ): (
+        "This out-of-pocket medical expense component is a survey/model input "
+        "without an accepted primary aggregate source mapping."
+    ),
+    (
+        "rent",
+        "national",
+        None,
+        None,
+    ): (
+        "PolicyEngine rent is a household survey/model input; ACS rent tables "
+        "do not provide a direct aggregate source fact for this exact variable."
+    ),
+    (
+        "spm_unit_capped_housing_subsidy",
+        "national",
+        None,
+        None,
+    ): (
+        "This is a capped SPM model amount rather than a direct publisher "
+        "source fact."
+    ),
+    (
+        "spm_unit_capped_work_childcare_expenses",
+        "national",
+        None,
+        None,
+    ): (
+        "This is a capped SPM model amount rather than a direct publisher "
+        "source fact."
+    ),
+}
+
+PE_NATIVE_BROAD_SOURCE_BACKED_TARGET_CELLS: tuple[
+    PolicyEngineUSTargetCell, ...
+] = tuple(
+    cell
+    for cell in PE_NATIVE_BROAD_TARGET_CELLS
+    if _target_cell_key(cell)
+    not in PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS
 )
 
 _TARGET_PROFILES: dict[str, tuple[PolicyEngineUSTargetCell, ...]] = {
     "pe_native_broad": PE_NATIVE_BROAD_TARGET_CELLS,
     "pe_native_broad_no_state_aca": PE_NATIVE_BROAD_NO_STATE_ACA_TARGET_CELLS,
+    "pe_native_broad_source_backed": PE_NATIVE_BROAD_SOURCE_BACKED_TARGET_CELLS,
+}
+
+_TARGET_PROFILE_EXCLUSION_REASONS: dict[
+    str,
+    dict[PolicyEngineUSTargetCellKey, str],
+] = {
+    "pe_native_broad": {},
+    "pe_native_broad_no_state_aca": {
+        cell_key: "State ACA cells are excluded from this profile variant."
+        for cell_key in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
+    },
+    "pe_native_broad_source_backed": (
+        PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS
+    ),
 }
 
 
@@ -723,3 +917,14 @@ def resolve_policyengine_us_target_profile(
         raise ValueError(
             f"Unknown PolicyEngine US target profile '{name}'. Known profiles: {known}"
         ) from exc
+
+
+def policyengine_us_target_profile_exclusion_reasons(
+    name: str,
+) -> dict[PolicyEngineUSTargetCellKey, str]:
+    if name not in _TARGET_PROFILES:
+        known = ", ".join(policyengine_us_target_profile_names())
+        raise ValueError(
+            f"Unknown PolicyEngine US target profile '{name}'. Known profiles: {known}"
+        )
+    return dict(_TARGET_PROFILE_EXCLUSION_REASONS.get(name, {}))
diff --git a/tests/policyengine/test_target_profiles.py b/tests/policyengine/test_target_profiles.py