Skip to content

Commit 3f1802a

Browse files
committed
Document build outputs postprocessor specs
1 parent 610cc70 commit 3f1802a

2 files changed

Lines changed: 175 additions & 0 deletions

File tree

docs/engineering/skills/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,7 @@ Current skills:
2020
pipeline status and durable error records.
2121
- `testing.md`: test layout, fixture scope, helper placement, and quality guard
2222
expectations.
23+
24+
Stage-specific AI-facing engineering guides live under `docs/engineering/stages/`.
25+
Use them alongside these cross-cutting skills when modifying a stage-specific
26+
pipeline path.
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Build Outputs Stage AI Guide
2+
3+
This guide is for AI agents and maintainers modifying Stage 4
4+
(`4_build_outputs`) code. Stage 4 turns calibrated and staged pipeline artifacts
5+
into publishable outputs, including local-area H5 files, national H5 files,
6+
diagnostics, and release-staging artifacts.
7+
8+
The active local H5 seams live under `policyengine_us_data/build_outputs/`.
9+
Treat this package as the place for reusable Stage 4 library boundaries. Keep
10+
Modal orchestration, worker entrypoints, and release promotion behavior outside
11+
these library seams unless a stage plan explicitly says otherwise.
12+
13+
## Local H5 Build Path
14+
15+
The transitional runtime entrypoint is still
16+
`policyengine_us_data.calibration.publish_local_area.build_h5()`. It should stay
17+
as a facade while Stage 4 is being migrated. New implementation logic should
18+
move behind narrower build-output library seams instead of growing this facade.
19+
20+
The current in-memory local H5 path is:
21+
22+
1. `AreaSelector` selects active clone-household rows from clone weights and
23+
geography filters.
24+
2. `EntityReindexer` creates output household, person, and subentity IDs.
25+
3. `VariableCloner` copies allowed source variables into a period-grouped
26+
payload.
27+
4. `LocalAreaDatasetBuilder` applies payload postprocessors in declared order.
28+
5. `H5Writer` writes the final `H5Payload` and verifies summary counts.
29+
30+
When adding behavior to this path, decide whether it is a selection, reindexing,
31+
source-variable cloning, postprocessing, or writing concern. Do not place
32+
country-specific payload mutation in `build_h5()` when it can be represented as
33+
a postprocessor.
34+
35+
## Payload Postprocessors
36+
37+
Payload postprocessors are ordered, country- or product-specific transformations
38+
that consume an `H5Payload` and return either another `H5Payload` or a structured
39+
result object exposing a `.payload` attribute.
40+
41+
Use a postprocessor when the operation:
42+
43+
- Mutates or adds payload variables after generic source-variable cloning.
44+
- Depends on country-specific business rules.
45+
- Needs focused unit tests independent of the full H5 builder.
46+
- Should run after some other payload construction step.
47+
48+
Do not use a postprocessor for:
49+
50+
- Selecting active clones. Use `AreaSelector`.
51+
- Reindexing entities. Use `EntityReindexer`.
52+
- Copying source variables unchanged. Use `VariableCloner`.
53+
- Writing H5 files. Use `H5Writer`.
54+
- Modal orchestration, volume setup, or publication promotion.
55+
56+
## Postprocessor Spec Contract
57+
58+
Every postprocessor should expose a stable `spec`:
59+
60+
```python
61+
spec = PayloadPostProcessorSpec(
62+
key="stable_unique_key",
63+
requires=("upstream_key",),
64+
)
65+
```
66+
67+
The `key` is a durable identifier for the processing step. Prefer short,
68+
stage-specific names such as `us_entity`, `us_geography`, or `us_takeup`.
69+
Do not use display names, class names, or generated values as the key when the
70+
processor is part of a stable runtime path.
71+
72+
The `requires` tuple lists postprocessor keys that must already have run. This
73+
declares ordering explicitly. It is not a substitute for validating the concrete
74+
payload fields the postprocessor consumes.
75+
76+
`LocalAreaDatasetBuilder` validates the configured postprocessor sequence before
77+
building:
78+
79+
- Duplicate `spec.key` values are rejected.
80+
- A postprocessor whose `requires` keys have not appeared earlier is rejected.
81+
- Processors without an explicit `spec` receive a fallback key based on class
82+
name. This fallback is for tests or transitional code only; production
83+
postprocessors should define stable keys.
84+
85+
If a processor consumes fields written by an earlier processor, define both the
86+
dependency and a payload validation. The dependency catches bad builder
87+
configuration early; payload validation catches direct processor use and
88+
malformed payloads.
89+
90+
## Current US Postprocessors
91+
92+
The production US postprocessor sequence is defined by
93+
`default_us_postprocessors()`:
94+
95+
1. `USEntityPostProcessor`
96+
- Key: `us_entity`
97+
- Dependencies: none
98+
- Adds output entity IDs and `household_weight`.
99+
100+
2. `USGeographyPostProcessor`
101+
- Key: `us_geography`
102+
- Dependencies: none
103+
- Derives geography from selected block GEOIDs and writes geography
104+
variables such as `state_fips`, `county_fips`, `zip_code`, and
105+
`congressional_district_geoid`.
106+
107+
3. `USTakeupPostProcessor`
108+
- Key: `us_takeup`
109+
- Dependencies: `us_entity`, `us_geography`
110+
- Applies take-up draws and writes take-up variables.
111+
- Validates that required reindexed subentities exist.
112+
- Validates that `state_fips` exists in the payload.
113+
- Validates that `person_tax_unit_id` and `tax_unit_id` exist when reported
114+
ACA anchors are present.
115+
116+
Keep this ordering unless you also update specs, structural validations, and
117+
unit tests.
118+
119+
## Adding A Postprocessor
120+
121+
When adding a postprocessor:
122+
123+
1. Define a result dataclass if callers need metadata beyond the payload.
124+
2. Define a stable `PayloadPostProcessorSpec`.
125+
3. Add direct payload precondition checks for every field the processor consumes.
126+
4. Preserve the incoming payload's `time_period`, `entity_lengths`, and
127+
`variable_entities` unless intentionally changing them.
128+
5. When adding variables with non-obvious entity lengths, update
129+
`variable_entities` so `H5Payload` can validate their shapes.
130+
6. Add the postprocessor to the production factory only if it belongs in the
131+
runtime path.
132+
7. Add unit tests for the processor in `tests/unit/build_outputs/`.
133+
8. Add or update a builder-order test if the processor has dependencies.
134+
135+
Prefer dependency injection for expensive or external behavior. For example,
136+
`USTakeupPostProcessor` accepts a `takeup_applier` so unit tests can verify the
137+
contract without loading rates or running the full pipeline.
138+
139+
## Testing Expectations
140+
141+
Unit tests should cover each new postprocessor directly. At minimum, test:
142+
143+
- The variables it writes.
144+
- The payload fields it consumes.
145+
- Its declared `spec.key` and `spec.requires` ordering.
146+
- Failure for missing required payload fields.
147+
- Failure for wrong-length generated arrays when the output entity is known.
148+
149+
Builder tests should cover:
150+
151+
- Missing dependency rejection.
152+
- Duplicate postprocessor key rejection.
153+
- Result recording through `PayloadPostProcessorRun`.
154+
155+
Integration tests should only be added when the behavior crosses module or
156+
runtime boundaries that unit tests cannot represent. Do not add a second
157+
integration test that proves the same seam.
158+
159+
## Documentation Expectations
160+
161+
When Stage 4 behavior changes, update the durable documentation surface:
162+
163+
- Add or update `@pipeline_node` metadata for new stable library seams.
164+
- Update `docs/pipeline_map.yaml` when the stage graph or durable artifacts
165+
change.
166+
- Keep generated docs out of manual PR edits unless the repository workflow
167+
specifically requires them.
168+
169+
Do not put PR-specific rationale in docstrings. Put durable behavior in source
170+
docs and put review or migration rationale in PR descriptions, issues, or stage
171+
planning docs.

0 commit comments

Comments
 (0)