Skip to content

Commit 937404a

Browse files
authored
Merge pull request #1038 from PolicyEngine/agent/stage-1/pr-1-specs-foundation
Add Stage 1 dataset build specs
2 parents 1017bb8 + ec1b9c7 commit 937404a

16 files changed

Lines changed: 760 additions & 230 deletions

File tree

changelog.d/1036.added

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added canonical Stage 1 dataset-build substep and artifact specifications.

docs/engineering/pipeline-map.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1362,6 +1362,22 @@ def impute_source_variables(data: Dict[str, Dict[int, np.ndarray]], state_fips:
13621362

13631363
Re-impute ACS/SIPP/ORG/SCF variables from donor surveys.
13641364

1365+
### `policyengine_us_data.build_datasets.artifacts.stage_1_artifact_specs`
1366+
1367+
```python
1368+
def stage_1_artifact_specs() -> tuple[DatasetArtifactSpec, ...]
1369+
```
1370+
1371+
Return all artifact specs known to the Stage 1 dataset build.
1372+
1373+
### `policyengine_us_data.build_datasets.specs.stage_1_step_specs`
1374+
1375+
```python
1376+
def stage_1_step_specs() -> tuple[DatasetBuildStepSpec, ...]
1377+
```
1378+
1379+
Return the canonical Stage 1 dataset-build substage specs.
1380+
13651381
### `policyengine_us_data.calibration.unified_matrix_builder.UnifiedMatrixBuilder`
13661382

13671383
```python

docs/engineering/skills/documentation_review.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@ Check that changed pipeline behavior has a durable documentation surface:
4444
- Edges describe real data, artifact, validation, or orchestration relationships.
4545
- `status` and `stability` values are honest for transitional code.
4646
- `validation_commands` are focused and point to existing tests or scripts.
47-
- Generated docs build when decorator, Pydoc, or map source changes. PRs do not
48-
need to refresh checked-in generated artifacts manually; the push workflow
49-
publishes those artifacts from automation.
47+
- Generated docs build when decorator, Pydoc, or map source changes. PRs that
48+
change decorator metadata, Pydoc-facing source, or `docs/pipeline_map.yaml`
49+
should refresh the checked-in generated artifacts in the same change.
5050
- Stale architecture names, folder names, and artifact names are not preserved in
5151
durable documentation sources or generated output.
5252

docs/engineering/skills/pipeline_docs.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ those flows.
1616
- `docs/engineering/pipeline-map.md`
1717

1818
The generated JSON and Markdown files are published artifacts, not hand-authored
19-
source. PRs should update decorators, docstrings, and `docs/pipeline_map.yaml`;
20-
CI checks that the generated artifacts build. On pushes to `main`, automation
21-
regenerates and commits the published artifacts with the version/changelog
22-
commit.
19+
source. PRs should update decorators, docstrings, and `docs/pipeline_map.yaml`,
20+
then regenerate the checked-in artifacts in the same change so reviewers see the
21+
pipeline docs that will ship. On pushes to `main`, automation may refresh those
22+
artifacts again with the version/changelog commit.
2323

2424
## Annotation Rules
2525

@@ -50,10 +50,15 @@ waypoint is being migrated, set `status="transitional"` and use
5050

5151
## Update Workflow
5252

53-
After adding or changing annotations or `docs/pipeline_map.yaml`, rely on the PR
54-
`Pipeline docs build` check to prove the generated artifacts can be produced. To
55-
inspect the generated outputs locally without touching tracked files, write them
56-
to a temporary directory:
53+
After adding or changing annotations or `docs/pipeline_map.yaml`, regenerate the
54+
tracked pipeline docs:
55+
56+
```bash
57+
uv run --no-sync --with pyyaml python scripts/extract_pipeline_docs.py
58+
```
59+
60+
If you only need to inspect the generated outputs locally without touching
61+
tracked files, write them to a temporary directory:
5762

5863
```bash
5964
out_dir="$(mktemp -d)"

docs/generated/pipeline_api.json

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -474,7 +474,7 @@
474474
"docstring": "Build all datasets with preemption-resilient checkpointing.\n\nArgs:\n upload: Whether to upload completed datasets.\n branch: Git branch to build from.\n sequential: Use sequential (non-parallel) execution.\n clear_checkpoints: Clear existing checkpoints before starting.\n skip_tests: Skip running the test suite (useful for calibration runs).\n skip_enhanced_cps: Skip enhanced_cps.py and small_enhanced_cps.py\n (useful for calibration runs that only need source_imputed H5).\n skip_stage_5: Skip source-imputed CPS and small enhanced CPS after\n enhanced_cps_2024.h5 is built.\n stage_only: Upload to HF staging only, without promoting a release.\n version: policyengine-us-data package version used for staging and\n dataset-build contracts.",
475475
"id": "build_datasets",
476476
"kind": "function",
477-
"line": 569,
477+
"line": 536,
478478
"metadata": {
479479
"api_refs": [
480480
"modal_app.data_build.build_datasets"
@@ -999,7 +999,7 @@
999999
"docstring": "Build CPS before PUF because PUF pension imputation loads CPS_2024.",
10001000
"id": "cps_puf_build_phase",
10011001
"kind": "function",
1002-
"line": 437,
1002+
"line": 404,
10031003
"metadata": {
10041004
"api_refs": [
10051005
"modal_app.data_build.run_cps_then_puf_phase"
@@ -3463,6 +3463,82 @@
34633463
"signature": "def reconcile_ss_subcomponents(data: Dict[str, Dict[int, np.ndarray]], n_cps: int, time_period: int) -> None",
34643464
"source_file": "policyengine_us_data/calibration/puf_impute.py"
34653465
},
3466+
"stage_1_dataset_artifact_specs": {
3467+
"docstring": "Return all artifact specs known to the Stage 1 dataset build.",
3468+
"id": "stage_1_dataset_artifact_specs",
3469+
"kind": "function",
3470+
"line": 230,
3471+
"metadata": {
3472+
"api_refs": [
3473+
"policyengine_us_data.build_datasets.artifacts.stage_1_artifact_specs"
3474+
],
3475+
"artifacts_out": [
3476+
"uprating_factors.csv",
3477+
"acs_2022.h5",
3478+
"irs_puf_2015.h5",
3479+
"cps_2024.h5",
3480+
"puf_2024.h5",
3481+
"extended_cps_2024.h5",
3482+
"enhanced_cps_2024.h5",
3483+
"enhanced_cps_2024.clone_diagnostics.json",
3484+
"calibration_log.csv",
3485+
"stratified_extended_cps_2024.h5",
3486+
"source_imputed_stratified_extended_cps_2024.h5",
3487+
"small_enhanced_cps_2024.h5",
3488+
"source_imputed_stratified_extended_cps.h5",
3489+
"policy_data.db",
3490+
"build_log.txt",
3491+
"data_build_checkpoint_stats.json"
3492+
],
3493+
"description": "Canonical artifact inventory for Stage 1 dataset-build outputs.",
3494+
"id": "stage_1_dataset_artifact_specs",
3495+
"label": "Stage 1 Dataset Artifact Specs",
3496+
"node_type": "library",
3497+
"pathways": [
3498+
"data_build",
3499+
"stage_contracts",
3500+
"pipeline_docs"
3501+
],
3502+
"source_file": "policyengine_us_data/build_datasets/artifacts.py",
3503+
"stability": "stable",
3504+
"status": "current",
3505+
"validation_commands": [
3506+
"uv run pytest tests/unit/test_build_dataset_specs.py"
3507+
]
3508+
},
3509+
"object_path": "policyengine_us_data.build_datasets.artifacts.stage_1_artifact_specs",
3510+
"signature": "def stage_1_artifact_specs() -> tuple[DatasetArtifactSpec, ...]",
3511+
"source_file": "policyengine_us_data/build_datasets/artifacts.py"
3512+
},
3513+
"stage_1_dataset_build_specs": {
3514+
"docstring": "Return the canonical Stage 1 dataset-build substage specs.",
3515+
"id": "stage_1_dataset_build_specs",
3516+
"kind": "function",
3517+
"line": 87,
3518+
"metadata": {
3519+
"api_refs": [
3520+
"policyengine_us_data.build_datasets.specs.stage_1_step_specs"
3521+
],
3522+
"description": "Canonical substage taxonomy for Stage 1 dataset-build contracts, step manifests, and pipeline documentation.",
3523+
"id": "stage_1_dataset_build_specs",
3524+
"label": "Stage 1 Dataset Build Specs",
3525+
"node_type": "library",
3526+
"pathways": [
3527+
"data_build",
3528+
"stage_contracts",
3529+
"pipeline_docs"
3530+
],
3531+
"source_file": "policyengine_us_data/build_datasets/specs.py",
3532+
"stability": "stable",
3533+
"status": "current",
3534+
"validation_commands": [
3535+
"uv run pytest tests/unit/test_build_dataset_specs.py"
3536+
]
3537+
},
3538+
"object_path": "policyengine_us_data.build_datasets.specs.stage_1_step_specs",
3539+
"signature": "def stage_1_step_specs() -> tuple[DatasetBuildStepSpec, ...]",
3540+
"source_file": "policyengine_us_data/build_datasets/specs.py"
3541+
},
34663542
"staging_upload": {
34673543
"docstring": "Upload files to HuggingFace staging only.\n\nGCS is updated during promote_publish, not here.\nPromote must be run separately via promote_publish.",
34683544
"id": "staging_upload",

docs/generated/pipeline_map.json

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1498,6 +1498,64 @@
14981498
"uv run pytest tests/unit/calibration/test_source_impute.py"
14991499
]
15001500
},
1501+
{
1502+
"api_refs": [
1503+
"policyengine_us_data.build_datasets.artifacts.stage_1_artifact_specs"
1504+
],
1505+
"artifacts_out": [
1506+
"uprating_factors.csv",
1507+
"acs_2022.h5",
1508+
"irs_puf_2015.h5",
1509+
"cps_2024.h5",
1510+
"puf_2024.h5",
1511+
"extended_cps_2024.h5",
1512+
"enhanced_cps_2024.h5",
1513+
"enhanced_cps_2024.clone_diagnostics.json",
1514+
"calibration_log.csv",
1515+
"stratified_extended_cps_2024.h5",
1516+
"source_imputed_stratified_extended_cps_2024.h5",
1517+
"small_enhanced_cps_2024.h5",
1518+
"source_imputed_stratified_extended_cps.h5",
1519+
"policy_data.db",
1520+
"build_log.txt",
1521+
"data_build_checkpoint_stats.json"
1522+
],
1523+
"description": "Canonical artifact inventory for Stage 1 dataset-build outputs.",
1524+
"id": "stage_1_dataset_artifact_specs",
1525+
"label": "Stage 1 Dataset Artifact Specs",
1526+
"node_type": "library",
1527+
"pathways": [
1528+
"data_build",
1529+
"stage_contracts",
1530+
"pipeline_docs"
1531+
],
1532+
"source_file": "policyengine_us_data/build_datasets/artifacts.py",
1533+
"stability": "stable",
1534+
"status": "current",
1535+
"validation_commands": [
1536+
"uv run pytest tests/unit/test_build_dataset_specs.py"
1537+
]
1538+
},
1539+
{
1540+
"api_refs": [
1541+
"policyengine_us_data.build_datasets.specs.stage_1_step_specs"
1542+
],
1543+
"description": "Canonical substage taxonomy for Stage 1 dataset-build contracts, step manifests, and pipeline documentation.",
1544+
"id": "stage_1_dataset_build_specs",
1545+
"label": "Stage 1 Dataset Build Specs",
1546+
"node_type": "library",
1547+
"pathways": [
1548+
"data_build",
1549+
"stage_contracts",
1550+
"pipeline_docs"
1551+
],
1552+
"source_file": "policyengine_us_data/build_datasets/specs.py",
1553+
"stability": "stable",
1554+
"status": "current",
1555+
"validation_commands": [
1556+
"uv run pytest tests/unit/test_build_dataset_specs.py"
1557+
]
1558+
},
15011559
{
15021560
"api_refs": [
15031561
"policyengine_us_data.calibration.unified_matrix_builder.UnifiedMatrixBuilder"

modal_app/data_build.py

Lines changed: 4 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222

2323
from modal_app.images import cpu_image as image # noqa: E402
2424
from policyengine_us_data.__version__ import __version__ as DATA_PACKAGE_VERSION # noqa: E402
25+
from policyengine_us_data.build_datasets import stage_1_script_outputs # noqa: E402
2526
from policyengine_us_data.pipeline_metadata import pipeline_node # noqa: E402
2627
from policyengine_us_data.pipeline_schema import PipelineNode # noqa: E402
2728
from policyengine_us_data.stage_contracts import ( # noqa: E402
@@ -95,43 +96,9 @@ def snapshot(self) -> dict[str, int]:
9596
}
9697

9798

98-
# Script to output file mapping for checkpointing
99-
# Values can be a single file path (str) or a list of file paths
100-
SCRIPT_OUTPUTS = {
101-
"policyengine_us_data/utils/uprating.py": (
102-
"policyengine_us_data/storage/uprating_factors.csv"
103-
),
104-
"policyengine_us_data/datasets/acs/acs.py": (
105-
"policyengine_us_data/storage/acs_2022.h5"
106-
),
107-
"policyengine_us_data/datasets/puf/irs_puf.py": (
108-
"policyengine_us_data/storage/irs_puf_2015.h5"
109-
),
110-
"policyengine_us_data/datasets/cps/cps.py": (
111-
"policyengine_us_data/storage/cps_2024.h5"
112-
),
113-
"policyengine_us_data/datasets/puf/puf.py": (
114-
"policyengine_us_data/storage/puf_2024.h5"
115-
),
116-
"policyengine_us_data/datasets/cps/extended_cps.py": (
117-
"policyengine_us_data/storage/extended_cps_2024.h5"
118-
),
119-
# enhanced_cps.py produces both the dataset and calibration log
120-
"policyengine_us_data/datasets/cps/enhanced_cps.py": [
121-
"policyengine_us_data/storage/enhanced_cps_2024.h5",
122-
"policyengine_us_data/storage/enhanced_cps_2024.clone_diagnostics.json",
123-
"calibration_log.csv",
124-
],
125-
"policyengine_us_data/calibration/create_stratified_cps.py": (
126-
"policyengine_us_data/storage/stratified_extended_cps_2024.h5"
127-
),
128-
"policyengine_us_data/calibration/create_source_imputed_cps.py": (
129-
"policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5"
130-
),
131-
"policyengine_us_data/datasets/cps/small_enhanced_cps.py": (
132-
"policyengine_us_data/storage/small_enhanced_cps_2024.h5"
133-
),
134-
}
99+
# Script to output file mapping for checkpointing.
100+
# Values can be a single file path (str) or a list of file paths.
101+
SCRIPT_OUTPUTS = stage_1_script_outputs()
135102

136103
CPS_BUILD_SCRIPT = "policyengine_us_data/datasets/cps/cps.py"
137104
PUF_BUILD_SCRIPT = "policyengine_us_data/datasets/puf/puf.py"

modal_app/step_manifests/specs.py

Lines changed: 9 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,11 @@
55
from dataclasses import dataclass
66
from typing import TypeAlias
77

8+
from policyengine_us_data.build_datasets import (
9+
STAGE_1_BUILD_DATASETS,
10+
STAGE_1_BUILD_STEP_SPECS,
11+
)
12+
813

914
@dataclass(frozen=True)
1015
class PipelineSubstepSpec:
@@ -32,44 +37,11 @@ def _substep(id: str, title: str, parent_id: str) -> PipelineSubstepSpec:
3237

3338

3439
BUILD_DATASETS = PipelineStepSpec(
35-
id="1_build_datasets",
40+
id=STAGE_1_BUILD_DATASETS,
3641
title="Build datasets",
37-
substeps=(
38-
_substep(
39-
"1a_raw_data_download",
40-
"Raw data download",
41-
"1_build_datasets",
42-
),
43-
_substep(
44-
"1b_base_dataset_construction",
45-
"Base dataset construction",
46-
"1_build_datasets",
47-
),
48-
_substep(
49-
"1c_extended_cps_puf_clone",
50-
"Extended CPS PUF clone",
51-
"1_build_datasets",
52-
),
53-
_substep(
54-
"1d_enhanced_cps_reweighting",
55-
"Enhanced CPS reweighting",
56-
"1_build_datasets",
57-
),
58-
_substep(
59-
"1e_stratified_cps",
60-
"Stratified CPS",
61-
"1_build_datasets",
62-
),
63-
_substep(
64-
"1f_source_imputation",
65-
"Source imputation",
66-
"1_build_datasets",
67-
),
68-
_substep(
69-
"1g_stage_base_datasets",
70-
"Stage base datasets",
71-
"1_build_datasets",
72-
),
42+
substeps=tuple(
43+
_substep(spec.id, spec.title, spec.parent_id)
44+
for spec in STAGE_1_BUILD_STEP_SPECS
7345
),
7446
)
7547
RAW_DATA_DOWNLOAD = BUILD_DATASETS.substeps[0]
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
"""Canonical Stage 1 dataset-build specifications."""
2+
3+
from .artifacts import (
4+
DatasetArtifactSpec,
5+
STAGE_1_ARTIFACT_SPECS,
6+
stage_1_artifact_specs,
7+
stage_1_contract_artifact_specs,
8+
stage_1_script_outputs,
9+
)
10+
from .specs import (
11+
DatasetBuildStepSpec,
12+
STAGE_1_BUILD_DATASETS,
13+
STAGE_1_BUILD_STEP_SPECS,
14+
stage_1_step_specs,
15+
)
16+
17+
__all__ = [
18+
"DatasetArtifactSpec",
19+
"DatasetBuildStepSpec",
20+
"STAGE_1_ARTIFACT_SPECS",
21+
"STAGE_1_BUILD_DATASETS",
22+
"STAGE_1_BUILD_STEP_SPECS",
23+
"stage_1_artifact_specs",
24+
"stage_1_contract_artifact_specs",
25+
"stage_1_script_outputs",
26+
"stage_1_step_specs",
27+
]

0 commit comments

Comments
 (0)