|
| 1 | +# Build Outputs Stage AI Guide |
| 2 | + |
| 3 | +This guide is for AI agents and maintainers modifying Stage 4 |
| 4 | +(`4_build_outputs`) code. Stage 4 turns calibrated and staged pipeline artifacts |
| 5 | +into publishable outputs, including local-area H5 files, national H5 files, |
| 6 | +diagnostics, and release-staging artifacts. |
| 7 | + |
| 8 | +The active local H5 seams live under `policyengine_us_data/build_outputs/`. |
| 9 | +Treat this package as the place for reusable Stage 4 library boundaries. Keep |
| 10 | +Modal orchestration, worker entrypoints, and release promotion behavior outside |
| 11 | +these library seams unless a stage plan explicitly says otherwise. |
| 12 | + |
| 13 | +## Local H5 Build Path |
| 14 | + |
| 15 | +The transitional runtime entrypoint is still |
| 16 | +`policyengine_us_data.calibration.publish_local_area.build_h5()`. It should stay |
| 17 | +as a facade while Stage 4 is being migrated. New implementation logic should |
| 18 | +move behind narrower build-output library seams instead of growing this facade. |
| 19 | + |
| 20 | +The current in-memory local H5 path is: |
| 21 | + |
| 22 | +1. `AreaSelector` selects active clone-household rows from clone weights and |
| 23 | + geography filters. |
| 24 | +2. `EntityReindexer` creates output household, person, and subentity IDs. |
| 25 | +3. `VariableCloner` copies allowed source variables into a period-grouped |
| 26 | + payload. |
| 27 | +4. `LocalAreaDatasetBuilder` applies payload postprocessors in declared order. |
| 28 | +5. `H5Writer` writes the final `H5Payload` and verifies summary counts. |
| 29 | + |
| 30 | +When adding behavior to this path, decide whether it is a selection, reindexing, |
| 31 | +source-variable cloning, postprocessing, or writing concern. Do not place |
| 32 | +country-specific payload mutation in `build_h5()` when it can be represented as |
| 33 | +a postprocessor. |
| 34 | + |
| 35 | +## Payload Postprocessors |
| 36 | + |
| 37 | +Payload postprocessors are ordered, country- or product-specific transformations |
| 38 | +that consume an `H5Payload` and return either another `H5Payload` or a structured |
| 39 | +result object exposing a `.payload` attribute. |
| 40 | + |
| 41 | +Use a postprocessor when the operation: |
| 42 | + |
| 43 | +- Mutates or adds payload variables after generic source-variable cloning. |
| 44 | +- Depends on country-specific business rules. |
| 45 | +- Needs focused unit tests independent of the full H5 builder. |
| 46 | +- Should run after some other payload construction step. |
| 47 | + |
| 48 | +Do not use a postprocessor for: |
| 49 | + |
| 50 | +- Selecting active clones. Use `AreaSelector`. |
| 51 | +- Reindexing entities. Use `EntityReindexer`. |
| 52 | +- Copying source variables unchanged. Use `VariableCloner`. |
| 53 | +- Writing H5 files. Use `H5Writer`. |
| 54 | +- Modal orchestration, volume setup, or publication promotion. |
| 55 | + |
| 56 | +## Postprocessor Spec Contract |
| 57 | + |
| 58 | +Every postprocessor should expose a stable `spec`: |
| 59 | + |
| 60 | +```python |
| 61 | +spec = PayloadPostProcessorSpec( |
| 62 | + key="stable_unique_key", |
| 63 | + requires=("upstream_key",), |
| 64 | +) |
| 65 | +``` |
| 66 | + |
| 67 | +The `key` is a durable identifier for the processing step. Prefer short, |
| 68 | +stage-specific names such as `us_entity`, `us_geography`, or `us_takeup`. |
| 69 | +Do not use display names, class names, or generated values as the key when the |
| 70 | +processor is part of a stable runtime path. |
| 71 | + |
| 72 | +The `requires` tuple lists postprocessor keys that must already have run. This |
| 73 | +declares ordering explicitly. It is not a substitute for validating the concrete |
| 74 | +payload fields the postprocessor consumes. |
| 75 | + |
| 76 | +`LocalAreaDatasetBuilder` validates the configured postprocessor sequence before |
| 77 | +building: |
| 78 | + |
| 79 | +- Duplicate `spec.key` values are rejected. |
| 80 | +- A postprocessor whose `requires` keys have not appeared earlier is rejected. |
| 81 | +- Processors without an explicit `spec` receive a fallback key based on class |
| 82 | + name. This fallback is for tests or transitional code only; production |
| 83 | + postprocessors should define stable keys. |
| 84 | + |
| 85 | +If a processor consumes fields written by an earlier processor, define both the |
| 86 | +dependency and a payload validation. The dependency catches bad builder |
| 87 | +configuration early; payload validation catches direct processor use and |
| 88 | +malformed payloads. |
| 89 | + |
| 90 | +## Current US Postprocessors |
| 91 | + |
| 92 | +The production US postprocessor sequence is defined by |
| 93 | +`default_us_postprocessors()`: |
| 94 | + |
| 95 | +1. `USEntityPostProcessor` |
| 96 | + - Key: `us_entity` |
| 97 | + - Dependencies: none |
| 98 | + - Adds output entity IDs and `household_weight`. |
| 99 | + |
| 100 | +2. `USGeographyPostProcessor` |
| 101 | + - Key: `us_geography` |
| 102 | + - Dependencies: none |
| 103 | + - Derives geography from selected block GEOIDs and writes geography |
| 104 | + variables such as `state_fips`, `county_fips`, `zip_code`, and |
| 105 | + `congressional_district_geoid`. |
| 106 | + |
| 107 | +3. `USTakeupPostProcessor` |
| 108 | + - Key: `us_takeup` |
| 109 | + - Dependencies: `us_entity`, `us_geography` |
| 110 | + - Applies take-up draws and writes take-up variables. |
| 111 | + - Validates that required reindexed subentities exist. |
| 112 | + - Validates that `state_fips` exists in the payload. |
| 113 | + - Validates that `person_tax_unit_id` and `tax_unit_id` exist when reported |
| 114 | + ACA anchors are present. |
| 115 | + |
| 116 | +Keep this ordering unless you also update specs, structural validations, and |
| 117 | +unit tests. |
| 118 | + |
| 119 | +## Adding A Postprocessor |
| 120 | + |
| 121 | +When adding a postprocessor: |
| 122 | + |
| 123 | +1. Define a result dataclass if callers need metadata beyond the payload. |
| 124 | +2. Define a stable `PayloadPostProcessorSpec`. |
| 125 | +3. Add direct payload precondition checks for every field the processor consumes. |
| 126 | +4. Preserve the incoming payload's `time_period`, `entity_lengths`, and |
| 127 | + `variable_entities` unless intentionally changing them. |
| 128 | +5. When adding variables with non-obvious entity lengths, update |
| 129 | + `variable_entities` so `H5Payload` can validate their shapes. |
| 130 | +6. Add the postprocessor to the production factory only if it belongs in the |
| 131 | + runtime path. |
| 132 | +7. Add unit tests for the processor in `tests/unit/build_outputs/`. |
| 133 | +8. Add or update a builder-order test if the processor has dependencies. |
| 134 | + |
| 135 | +Prefer dependency injection for expensive or external behavior. For example, |
| 136 | +`USTakeupPostProcessor` accepts a `takeup_applier` so unit tests can verify the |
| 137 | +contract without loading rates or running the full pipeline. |
| 138 | + |
| 139 | +## Testing Expectations |
| 140 | + |
| 141 | +Unit tests should cover each new postprocessor directly. At minimum, test: |
| 142 | + |
| 143 | +- The variables it writes. |
| 144 | +- The payload fields it consumes. |
| 145 | +- Its declared `spec.key` and `spec.requires` ordering. |
| 146 | +- Failure for missing required payload fields. |
| 147 | +- Failure for wrong-length generated arrays when the output entity is known. |
| 148 | + |
| 149 | +Builder tests should cover: |
| 150 | + |
| 151 | +- Missing dependency rejection. |
| 152 | +- Duplicate postprocessor key rejection. |
| 153 | +- Result recording through `PayloadPostProcessorRun`. |
| 154 | + |
| 155 | +Integration tests should only be added when the behavior crosses module or |
| 156 | +runtime boundaries that unit tests cannot represent. Do not add a second |
| 157 | +integration test that proves the same seam. |
| 158 | + |
| 159 | +## Documentation Expectations |
| 160 | + |
| 161 | +When Stage 4 behavior changes, update the durable documentation surface: |
| 162 | + |
| 163 | +- Add or update `@pipeline_node` metadata for new stable library seams. |
| 164 | +- Update `docs/pipeline_map.yaml` when the stage graph or durable artifacts |
| 165 | + change. |
| 166 | +- Keep generated docs out of manual PR edits unless the repository workflow |
| 167 | + specifically requires them. |
| 168 | + |
| 169 | +Do not put PR-specific rationale in docstrings. Put durable behavior in source |
| 170 | +docs and put review or migration rationale in PR descriptions, issues, or stage |
| 171 | +planning docs. |
0 commit comments