|
| 1 | +# Arch Target Gap Queue |
| 2 | + |
| 3 | +The Arch target gap queue is a Microplex-side review tool. It compares a |
| 4 | +Microplex target profile to a queryable Arch target DB and emits rows that help |
| 5 | +humans or agents decide what Arch source work is missing. |
| 6 | + |
| 7 | +The queue does not make Arch own Microplex target selection. Profile membership, |
| 8 | +source aging, reconciliation, activation, and model-variable aliases remain in |
| 9 | +`microplex-us`. |
| 10 | + |
| 11 | +## Boundary Rules |
| 12 | + |
| 13 | +- Arch stores publisher/source facts with provenance, constraints, periods, |
| 14 | + geography, and source lineage. |
| 15 | +- Arch should not duplicate a source fact only because Microplex names a model |
| 16 | + variable differently. |
| 17 | +- Microplex adapters may map one Arch source fact into simulator-specific target |
| 18 | + semantics. For example, Arch |
| 19 | + `irs_soi.returns_with_income_tax_after_credits` can satisfy the |
| 20 | + PolicyEngine `income_tax_positive` count target because SOI Table 1.1 reports |
| 21 | + the count of returns with positive income tax after credits. |
| 22 | +- A gap row is an authoring hint, not proof that a source exists. |
| 23 | +- Rows marked as source-mapping review or deprioritized must be reviewed before |
| 24 | + assigning loader work to agents. |
| 25 | + |
| 26 | +## Categories |
| 27 | + |
| 28 | +`gap_category` is the high-level agent-readiness taxonomy: |
| 29 | + |
| 30 | +| Category | Meaning | Default action | |
| 31 | +| --- | --- | --- | |
| 32 | +| `covered` | An Arch target record already satisfies the target cell. | No task. | |
| 33 | +| `ready_primary_loader` | The expected publisher source and Arch variable shape are known, but the record is missing. | Assign source-loader/spec work. | |
| 34 | +| `ready_rollup_or_geography` | The Arch variable exists but not at the requested geography. | Add rollup/geography records or review source geography. | |
| 35 | +| `adapter_or_constraint_review` | The Arch variable exists at the geography, but filters or adapter matching do not cover the cell. | Review constraints and adapter mapping. | |
| 36 | +| `source_mapping_review` | The queue cannot identify a defensible source fact or Arch variable shape. | Human source-mapping review first. | |
| 37 | +| `survey_or_model_input_deprioritized` | The cell is currently treated as a survey/model-input proxy rather than a primary administrative source task. | Defer unless a primary source is identified. | |
| 38 | + |
| 39 | +`loader_status` is the lower-level diagnostic used to derive the category. Use |
| 40 | +`gap_category` for agent routing and `loader_status` for debugging why a cell |
| 41 | +landed there. |
| 42 | + |
| 43 | +## Current PolicyEngine Profile Boundary |
| 44 | + |
| 45 | +`pe_native_broad` keeps the raw PolicyEngine parity surface intact. It includes |
| 46 | +all currently tracked broad target cells, including survey/model-input rows and |
| 47 | +cells whose publisher-source semantics still need review. |
| 48 | + |
| 49 | +`pe_native_broad_source_backed` is the Arch-backed calibration/profile boundary. |
| 50 | +It excludes only cells with explicit reasons in |
| 51 | +`src/microplex_us/policyengine/target_profiles.py`, such as: |
| 52 | + |
| 53 | +- SOI multi-domain cells that would require joint AGI, filing status, and |
| 54 | + positive income-tax-before-credits facts not currently published by the loaded |
| 55 | + SOI packages |
| 56 | +- survey-heavy or model-input cells such as rent, child support, |
| 57 | + non-Part-B medical premium/expense components, SPM capped expenses, and |
| 58 | + `ssn_card_type` |
| 59 | +- source-near but non-equivalent rows such as `childcare_expenses`, where IRS |
| 60 | + credit expenses and W-2 dependent-care benefits are narrower tax concepts |
| 61 | +- pregnancy stock by state, where live births are a flow rather than a direct |
| 62 | + source fact for the PolicyEngine target |
| 63 | + |
| 64 | +## Current Local Snapshot |
| 65 | + |
| 66 | +Snapshot date: 2026-05-22. |
| 67 | + |
| 68 | +Inputs: |
| 69 | + |
| 70 | +- `/Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl` |
| 71 | +- `/Users/maxghenis/CosilicoAI/arch/macro/targets.db` |
| 72 | +- `/tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl` |
| 73 | +- `/tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl` |
| 74 | +- `/tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl` |
| 75 | +- `/tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl` |
| 76 | +- `/tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl` |
| 77 | +- `/tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl` |
| 78 | +- `/tmp/arch-suite-federal-reserve-z1-household-net-worth/consumer_facts.jsonl` |
| 79 | +- `/tmp/arch-suite-cms-medicare-trustees-report-2025-part-b-premium-income/consumer_facts.jsonl` |
| 80 | + |
| 81 | +Command: |
| 82 | + |
| 83 | +```bash |
| 84 | +uv run --extra policyengine microplex-us-arch-target-refresh \ |
| 85 | + --arch-targets-db /Users/maxghenis/CosilicoAI/arch/arch/fixtures/consumer_facts.jsonl \ |
| 86 | + --arch-targets-db /Users/maxghenis/CosilicoAI/arch/macro/targets.db \ |
| 87 | + --arch-targets-db /tmp/arch-suite-hhs-acf-tanf-caseload-2024/consumer_facts.jsonl \ |
| 88 | + --arch-targets-db /tmp/arch-suite-soi-historic-table-2-2022/consumer_facts.jsonl \ |
| 89 | + --arch-targets-db /tmp/arch-suite-hhs-acf-liheap-fy2024-national-profile/consumer_facts.jsonl \ |
| 90 | + --arch-targets-db /tmp/arch-suite-soi-historic-table-2-state-agi-2022/consumer_facts.jsonl \ |
| 91 | + --arch-targets-db /tmp/arch-suite-soi-w2-statistics-2020/consumer_facts.jsonl \ |
| 92 | + --arch-targets-db /tmp/arch-suite-soi-table-1-4-2023/consumer_facts.jsonl \ |
| 93 | + --arch-targets-db /tmp/arch-suite-federal-reserve-z1-household-net-worth/consumer_facts.jsonl \ |
| 94 | + --arch-targets-db /tmp/arch-suite-cms-medicare-trustees-report-2025-part-b-premium-income/consumer_facts.jsonl \ |
| 95 | + --period 2024 \ |
| 96 | + --profile pe_native_broad_source_backed \ |
| 97 | + --output-dir artifacts/arch-target-coverage-source-backed |
| 98 | +``` |
| 99 | + |
| 100 | +Coverage: |
| 101 | + |
| 102 | +- 174 target cells in `pe_native_broad_source_backed` |
| 103 | +- 174 covered |
| 104 | +- 0 uncovered |
| 105 | +- 100.0% coverage |
| 106 | + |
| 107 | +The raw `pe_native_broad` profile is at 174 of 189 covered with 15 explicitly |
| 108 | +reviewed rows outside the source-backed boundary. Federal Reserve Z.1 household |
| 109 | +net worth and CMS Medicare Trustees Report Part B premium income are now |
| 110 | +source-backed. |
| 111 | + |
| 112 | +| Category | Rows | |
| 113 | +| --- | ---: | |
| 114 | +| `adapter_or_constraint_review` | 3 | |
| 115 | +| `source_mapping_review` | 2 | |
| 116 | +| `survey_or_model_input_deprioritized` | 10 | |
| 117 | + |
| 118 | +Generated outputs: |
| 119 | + |
| 120 | +- `artifacts/arch-target-coverage-source-backed/pe_native_broad_source_backed_2024_coverage.json` |
| 121 | +- `artifacts/arch-target-coverage-source-backed/pe_native_broad_source_backed_2024_gaps.json` |
| 122 | +- `artifacts/arch-target-coverage-source-backed/pe_native_broad_source_backed_2024_gaps.csv` |
| 123 | +- `artifacts/arch-target-coverage-source-backed/pe_native_broad_source_backed_2024_summary.md` |
| 124 | +- `artifacts/arch-target-coverage-broad-plus/pe_native_broad_2024_coverage.json` |
| 125 | +- `artifacts/arch-target-coverage-broad-plus/pe_native_broad_2024_gaps.json` |
| 126 | +- `artifacts/arch-target-coverage-broad-plus/pe_native_broad_2024_gaps.csv` |
| 127 | +- `artifacts/arch-target-coverage-broad-plus/pe_native_broad_2024_summary.md` |
| 128 | + |
| 129 | +Remaining work is concentrated in: |
| 130 | + |
| 131 | +- the raw `pe_native_broad` cells excluded from the source-backed profile, if a |
| 132 | + future primary publisher source can support them without changing semantics |
| 133 | +- keeping the UK source-backed/raw boundary aligned with the same rule: leave |
| 134 | + raw PE target rows visible, and exclude only rows where source equivalence is |
| 135 | + not defensible |
0 commit comments