|
| 1 | +# Stage 1: Build Datasets |
| 2 | + |
| 3 | +Stage 1 builds the public dataset artifacts consumed by later pipeline stages. |
| 4 | +Its public status boundary is organized around the `1a_` through `1f_` |
| 5 | +substeps, while the transitional Modal runtime still executes several |
| 6 | +command-backed units inside some of those public substeps. |
| 7 | + |
| 8 | +## Rerun And Reuse Model |
| 9 | + |
| 10 | +Checkpoint reuse has two gates: |
| 11 | + |
| 12 | +- The semantic gate compares current `Stage1IdentityMaterial` with a persisted |
| 13 | + identity from the checkpoint-scoped Stage 1 reuse manifest. |
| 14 | +- The physical gate verifies that every expected checkpoint output exists and is |
| 15 | + non-empty before a unit is restored. |
| 16 | + |
| 17 | +The physical checkpoint layout remains `/checkpoints/{branch}/{commit_sha}`. |
| 18 | +The Stage 1 reuse manifest is adapter state in that same scope. Missing, |
| 19 | +malformed, or unreadable manifest content must fail closed to recompute; it must |
| 20 | +not authorize reuse by itself. |
| 21 | + |
| 22 | +Keep reuse explanations in the existing Stage 1 contract metadata under |
| 23 | +`dataset_build_output.json -> metadata.stage_1_status.reuse_reasoning`. |
| 24 | +That metadata should distinguish semantic identity results from physical |
| 25 | +checkpoint availability, including missing prior identity, identity mismatch, |
| 26 | +identity match, missing checkpoint output, empty checkpoint output, and restored |
| 27 | +checkpoint output. |
| 28 | + |
| 29 | +## Identity Granularity |
| 30 | + |
| 31 | +`substep_id` is the public reporting group. It is not always the right durable |
| 32 | +manifest lookup key, because transitional Stage 1 substeps can contain multiple |
| 33 | +independently runnable command or script units. For example, raw-data download |
| 34 | +and uprating both report through `1a_raw_data_download`, while the base dataset |
| 35 | +substep can run several dataset builders. |
| 36 | + |
| 37 | +When persisting or looking up reuse identities for a command-backed unit, use a |
| 38 | +stable execution identity key that is unique within the checkpoint scope. The |
| 39 | +key should include the public `substep_id` plus enough stable execution material |
| 40 | +to distinguish the command or script and its expected reusable outputs. Keep |
| 41 | +`substep_id` on the record for public status grouping. |
| 42 | + |
| 43 | +Do not key multiple manifest records only by `substep_id` unless the record |
| 44 | +represents an intentionally aggregated identity for the whole public substep. |
| 45 | +Otherwise, later units in the same substep can overwrite earlier units and make |
| 46 | +future reruns recompute despite valid checkpoints. |
| 47 | + |
| 48 | +## Conditional Running |
| 49 | + |
| 50 | +Unit-level conditional running is the compatibility path while Stage 1 is still |
| 51 | +command-backed: |
| 52 | + |
| 53 | +1. Build current identity material for the runnable unit. |
| 54 | +2. Compare it with the previous manifest identity for that unit's identity key. |
| 55 | +3. Consult physical checkpoints only when the semantic decision is `reuse`. |
| 56 | +4. Restore and skip only that unit when both gates pass. |
| 57 | +5. Recompute the unit and update the manifest only after successful output |
| 58 | + restoration or successful checkpoint save. |
| 59 | + |
| 60 | +Public substep status should be aggregated from its unit results. A public |
| 61 | +substep is fully `reused` only when every required unit in that substep was |
| 62 | +reused. If any unit recomputes successfully, report the substep as completed |
| 63 | +with reuse reasoning that explains the mixed path. |
| 64 | + |
| 65 | +Stage-level conditional running is the same idea one level higher. Stage 1 may |
| 66 | +skip all builder execution only when every required unit for the requested run |
| 67 | +flags has a matching semantic identity and valid physical checkpoint outputs. |
| 68 | +Until the canonical Stage 1 coordinator owns whole-stage planning, do not infer |
| 69 | +stage-level reuse from a single substep or unit record. |
| 70 | + |
| 71 | +## Documentation Expectations |
| 72 | + |
| 73 | +When changing Stage 1 identity material, checkpoint reuse decisions, artifact |
| 74 | +outputs, substep aggregation, or contract metadata, keep the durable |
| 75 | +documentation surface synchronized: |
| 76 | + |
| 77 | +- Update this guide when the Stage 1 rerun or checkpoint model changes. |
| 78 | +- Update `docs/pipeline_map.yaml` and regenerate generated pipeline docs when |
| 79 | + the stage graph, artifact names, or pipeline-node metadata change. |
| 80 | +- Keep `dataset_build_output.json` metadata documentation aligned with the |
| 81 | + status and reuse reasoning actually emitted by the Modal adapter. |
| 82 | +- Put PR-specific migration rationale in the PR description, not in durable |
| 83 | + docs or docstrings. |
| 84 | + |
| 85 | +Tests for Stage 1 reuse changes should cover missing and malformed manifests, |
| 86 | +semantic mismatch, physical checkpoint miss or empty output, same-public-substep |
| 87 | +units with distinct identity keys, and contract metadata explaining both gates. |
0 commit comments