Skip to content
This repository was archived by the owner on Jun 14, 2026. It is now read-only.

Commit b4140de

Browse files
committed
Add source-backed PE target profile
1 parent 7e22bb9 commit b4140de

3 files changed

Lines changed: 310 additions & 39 deletions

File tree

docs/arch-target-gap-queue.md

Lines changed: 52 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -40,65 +40,85 @@ source aging, reconciliation, activation, and model-variable aliases remain in
4040
`gap_category` for agent routing and `loader_status` for debugging why a cell
4141
landed there.
4242

43-
## Current PolicyEngine Broad Profile Boundary
44-
45-
The current Arch-backed PE broad profile coverage intentionally stops before
46-
survey-heavy or model-input cells such as rent, net worth, child support,
47-
medical-premium subcomponents, SPM expenses, and `ssn_card_type`. Those rows are
48-
not ready for automated source-loader agents under the primary-source-first
49-
policy.
43+
## Current PolicyEngine Profile Boundary
44+
45+
`pe_native_broad` keeps the raw PolicyEngine parity surface intact. It includes
46+
all currently tracked broad target cells, including survey/model-input rows and
47+
cells whose publisher-source semantics still need review.
48+
49+
`pe_native_broad_source_backed` is the Arch-backed calibration/profile boundary.
50+
It excludes only cells with explicit reasons in
51+
`src/microplex_us/policyengine/target_profiles.py`, such as:
52+
53+
- SOI multi-domain cells that would require joint AGI, filing status, and
54+
positive income-tax-before-credits facts not currently published by the loaded
55+
SOI packages
56+
- survey-heavy or model-input cells such as rent, net worth, child support,
57+
medical-premium subcomponents, SPM capped expenses, and `ssn_card_type`
58+
- source-near but non-equivalent rows such as `childcare_expenses`, where IRS
59+
credit expenses and W-2 dependent-care benefits are narrower tax concepts
60+
- pregnancy stock by state, where live births are a flow rather than a direct
61+
source fact for the PolicyEngine target
5062

5163
## Current Local Snapshot
5264

53-
Snapshot date: 2026-05-19.
65+
Snapshot date: 2026-05-22.
5466

5567
Inputs:
5668

5769
- `/Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl`
5870
- `/Users/maxghenis/CosilicoAI/arch/macro/targets.db`
71+
- `/tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl`
72+
- `/tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl`
73+
- `/tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl`
74+
- `/tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl`
75+
- `/tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl`
76+
- `/tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl`
5977

6078
Command:
6179

6280
```bash
6381
uv run microplex-us-arch-target-refresh \
64-
--artifact-root /Users/maxghenis/CosilicoAI/arch \
82+
--arch-targets-db /Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl \
83+
--arch-targets-db /Users/maxghenis/CosilicoAI/arch/macro/targets.db \
84+
--arch-targets-db /tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl \
85+
--arch-targets-db /tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl \
86+
--arch-targets-db /tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl \
87+
--arch-targets-db /tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl \
88+
--arch-targets-db /tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl \
89+
--arch-targets-db /tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl \
6590
--period 2024 \
66-
--profile pe_native_broad \
91+
--profile pe_native_broad_source_backed \
6792
--output-dir artifacts/arch-target-coverage
6893
```
6994

7095
Coverage:
7196

72-
- 189 target cells in `pe_native_broad`
73-
- 138 covered
74-
- 51 uncovered
75-
- 73.0% coverage
76-
- national: 79 of 116 covered
77-
- state: 59 of 73 covered
97+
- 172 target cells in `pe_native_broad_source_backed`
98+
- 172 covered
99+
- 0 uncovered
100+
- 100.0% coverage
78101

79-
Gap categories:
102+
The raw `pe_native_broad` profile remains at 172 of 189 covered with 17
103+
explicitly reviewed rows outside the source-backed boundary:
80104

81105
| Category | Rows |
82106
| --- | ---: |
83-
| `source_mapping_review` | 26 |
84107
| `survey_or_model_input_deprioritized` | 12 |
85-
| `adapter_or_constraint_review` | 10 |
86-
| `ready_rollup_or_geography` | 3 |
108+
| `adapter_or_constraint_review` | 3 |
109+
| `source_mapping_review` | 2 |
87110

88111
Generated outputs:
89112

90-
- `artifacts/arch-target-coverage/pe_native_broad_2024_coverage.json`
91-
- `artifacts/arch-target-coverage/pe_native_broad_2024_gaps.json`
92-
- `artifacts/arch-target-coverage/pe_native_broad_2024_gaps.csv`
93-
- `artifacts/arch-target-coverage/pe_native_broad_2024_summary.md`
113+
- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_coverage.json`
114+
- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_gaps.json`
115+
- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_gaps.csv`
116+
- `artifacts/arch-target-coverage/pe_native_broad_source_backed_2024_summary.md`
94117

95118
Remaining work is concentrated in:
96119

97-
- source-mapping review for the newly expanded PE parity cells, especially
98-
domains whose expected Arch concept is not yet encoded in the gap taxonomy
99-
- adapter or constraint review where Arch has the variable at the right
100-
geography but the Microplex adapter does not yet match the PE target cell
101-
- a small rollup/geography queue for variables loaded in Arch but not at the
102-
requested national or state target geography
103-
- survey/model-input proxy cells that remain deprioritized until a primary
104-
publisher source is identified
120+
- the raw `pe_native_broad` cells excluded from the source-backed profile, if a
121+
future primary publisher source can support them without changing semantics
122+
- UK profile parity, which should follow the same pattern: keep the raw PE
123+
target surface intact and expose a source-backed profile with explicit
124+
exclusions where source equivalence is not defensible

src/microplex_us/policyengine/target_profiles.py

Lines changed: 212 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,18 @@ def to_provider_filter(self) -> dict[str, str | None]:
2323
}
2424

2525

26+
PolicyEngineUSTargetCellKey = tuple[str, str | None, str | None, str | None]
27+
28+
29+
def _target_cell_key(cell: PolicyEngineUSTargetCell) -> PolicyEngineUSTargetCellKey:
30+
return (
31+
cell.variable,
32+
cell.geo_level,
33+
cell.domain_variable,
34+
cell.geographic_id,
35+
)
36+
37+
2638
PE_NATIVE_BROAD_TARGET_CELLS: tuple[PolicyEngineUSTargetCell, ...] = (
2739
PolicyEngineUSTargetCell(
2840
"aca_ptc", geo_level="national", domain_variable="aca_ptc"
@@ -694,18 +706,200 @@ def to_provider_filter(self) -> dict[str, str | None]:
694706
PE_NATIVE_BROAD_NO_STATE_ACA_TARGET_CELLS: tuple[PolicyEngineUSTargetCell, ...] = tuple(
695707
cell
696708
for cell in PE_NATIVE_BROAD_TARGET_CELLS
697-
if (
698-
cell.variable,
699-
cell.geo_level,
700-
cell.domain_variable,
701-
cell.geographic_id,
702-
)
703-
not in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
709+
if _target_cell_key(cell) not in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
710+
)
711+
712+
PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS: dict[
713+
PolicyEngineUSTargetCellKey,
714+
str,
715+
] = {
716+
(
717+
"adjusted_gross_income",
718+
"national",
719+
"adjusted_gross_income,filing_status,income_tax_before_credits",
720+
None,
721+
): (
722+
"SOI source packages currently loaded by Arch do not publish adjusted "
723+
"gross income jointly by AGI band, filing status, and returns with "
724+
"positive income tax before credits."
725+
),
726+
(
727+
"adjusted_gross_income",
728+
"national",
729+
"adjusted_gross_income,income_tax_before_credits",
730+
None,
731+
): (
732+
"SOI source packages currently loaded by Arch publish AGI bands and "
733+
"income-tax-before-credits returns separately, not AGI amounts "
734+
"restricted to returns with positive income tax before credits."
735+
),
736+
(
737+
"tax_unit_count",
738+
"national",
739+
"adjusted_gross_income,filing_status,income_tax_before_credits",
740+
None,
741+
): (
742+
"SOI Historic Table 2 does not provide the full AGI by filing-status "
743+
"by positive-income-tax-before-credits joint count required by this "
744+
"PolicyEngine cell."
745+
),
746+
(
747+
"person_count",
748+
"national",
749+
"ssn_card_type",
750+
None,
751+
): (
752+
"PolicyEngine ssn_card_type is a modeled legal-status input; no "
753+
"accepted primary aggregate source mapping is encoded for Arch."
754+
),
755+
(
756+
"person_count",
757+
"state",
758+
"is_pregnant",
759+
None,
760+
): (
761+
"The PolicyEngine cell is a pregnancy stock by state; live births are "
762+
"a flow and are not a defensible direct source fact for this target."
763+
),
764+
(
765+
"alimony_expense",
766+
"national",
767+
None,
768+
None,
769+
): (
770+
"No accepted primary source mapping is encoded for this "
771+
"survey/model-input expense variable."
772+
),
773+
(
774+
"child_support_expense",
775+
"national",
776+
None,
777+
None,
778+
): (
779+
"No accepted primary source mapping is encoded for this "
780+
"survey/model-input expense variable."
781+
),
782+
(
783+
"child_support_received",
784+
"national",
785+
None,
786+
None,
787+
): (
788+
"No accepted primary source mapping is encoded for this "
789+
"survey/model-input receipt variable."
790+
),
791+
(
792+
"childcare_expenses",
793+
"national",
794+
None,
795+
None,
796+
): (
797+
"IRS child-care credit expenses and W-2 dependent-care benefits are "
798+
"narrower tax concepts than PolicyEngine childcare_expenses, so they "
799+
"are not treated as source-equivalent."
800+
),
801+
(
802+
"health_insurance_premiums_without_medicare_part_b",
803+
"national",
804+
None,
805+
None,
806+
): (
807+
"This premium component is a modeled/survey input; no accepted primary "
808+
"aggregate source mapping is encoded for Arch."
809+
),
810+
(
811+
"medicare_part_b_premiums",
812+
"national",
813+
None,
814+
None,
815+
): (
816+
"PolicyEngine Medicare Part B premiums depend on person-level "
817+
"enrollment and IRMAA status; no accepted aggregate source fact is "
818+
"encoded for this modeled input."
819+
),
820+
(
821+
"net_worth",
822+
"national",
823+
None,
824+
None,
825+
): (
826+
"Net worth is a wealth survey/model input; no accepted primary "
827+
"administrative aggregate source mapping is encoded for Arch."
828+
),
829+
(
830+
"other_medical_expenses",
831+
"national",
832+
None,
833+
None,
834+
): (
835+
"This out-of-pocket medical expense component is a survey/model input "
836+
"without an accepted primary aggregate source mapping."
837+
),
838+
(
839+
"over_the_counter_health_expenses",
840+
"national",
841+
None,
842+
None,
843+
): (
844+
"This out-of-pocket medical expense component is a survey/model input "
845+
"without an accepted primary aggregate source mapping."
846+
),
847+
(
848+
"rent",
849+
"national",
850+
None,
851+
None,
852+
): (
853+
"PolicyEngine rent is a household survey/model input; ACS rent tables "
854+
"do not provide a direct aggregate source fact for this exact variable."
855+
),
856+
(
857+
"spm_unit_capped_housing_subsidy",
858+
"national",
859+
None,
860+
None,
861+
): (
862+
"This is a capped SPM model amount rather than a direct publisher "
863+
"source fact."
864+
),
865+
(
866+
"spm_unit_capped_work_childcare_expenses",
867+
"national",
868+
None,
869+
None,
870+
): (
871+
"This is a capped SPM model amount rather than a direct publisher "
872+
"source fact."
873+
),
874+
}
875+
876+
PE_NATIVE_BROAD_SOURCE_BACKED_TARGET_CELLS: tuple[
877+
PolicyEngineUSTargetCell, ...
878+
] = tuple(
879+
cell
880+
for cell in PE_NATIVE_BROAD_TARGET_CELLS
881+
if _target_cell_key(cell)
882+
not in PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS
704883
)
705884

706885
_TARGET_PROFILES: dict[str, tuple[PolicyEngineUSTargetCell, ...]] = {
707886
"pe_native_broad": PE_NATIVE_BROAD_TARGET_CELLS,
708887
"pe_native_broad_no_state_aca": PE_NATIVE_BROAD_NO_STATE_ACA_TARGET_CELLS,
888+
"pe_native_broad_source_backed": PE_NATIVE_BROAD_SOURCE_BACKED_TARGET_CELLS,
889+
}
890+
891+
_TARGET_PROFILE_EXCLUSION_REASONS: dict[
892+
str,
893+
dict[PolicyEngineUSTargetCellKey, str],
894+
] = {
895+
"pe_native_broad": {},
896+
"pe_native_broad_no_state_aca": {
897+
cell_key: "State ACA cells are excluded from this profile variant."
898+
for cell_key in _PE_NATIVE_BROAD_NO_STATE_ACA_EXCLUDED_CELLS
899+
},
900+
"pe_native_broad_source_backed": (
901+
PE_NATIVE_BROAD_SOURCE_BACKED_EXCLUDED_CELL_REASONS
902+
),
709903
}
710904

711905

@@ -723,3 +917,14 @@ def resolve_policyengine_us_target_profile(
723917
raise ValueError(
724918
f"Unknown PolicyEngine US target profile '{name}'. Known profiles: {known}"
725919
) from exc
920+
921+
922+
def policyengine_us_target_profile_exclusion_reasons(
923+
name: str,
924+
) -> dict[PolicyEngineUSTargetCellKey, str]:
925+
if name not in _TARGET_PROFILES:
926+
known = ", ".join(policyengine_us_target_profile_names())
927+
raise ValueError(
928+
f"Unknown PolicyEngine US target profile '{name}'. Known profiles: {known}"
929+
)
930+
return dict(_TARGET_PROFILE_EXCLUSION_REASONS.get(name, {}))

0 commit comments

Comments
 (0)