Skip to content

Commit 95a15b9

Browse files
committed
Add Stage 1 checkpoint reuse boundary
1 parent 0924580 commit 95a15b9

16 files changed

Lines changed: 2077 additions & 71 deletions

File tree

changelog.d/1074.added

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added Stage 1 checkpoint adapter and rerun reuse planning boundaries.

docs/engineering/skills/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ pipeline path.
2929

3030
Current stage guides:
3131

32+
- `build_datasets.md`: Stage 1 build-dataset identity, checkpoint reuse,
33+
conditional running, and contract metadata guidance.
3234
- `build_outputs.md`: Stage 4 output-build library boundaries and test
3335
expectations.
3436
- `release_promotion.md`: Stage 5 release candidate identity, validation-report
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Stage 1: Build Datasets
2+
3+
Stage 1 builds the public dataset artifacts consumed by later pipeline stages.
4+
Its public status boundary is organized around the `1a_` through `1f_`
5+
substeps, while the transitional Modal runtime still executes several
6+
command-backed units inside some of those public substeps.
7+
8+
## Rerun And Reuse Model
9+
10+
Checkpoint reuse has two gates:
11+
12+
- The semantic gate compares current `Stage1IdentityMaterial` with a persisted
13+
identity from the checkpoint-scoped Stage 1 reuse manifest.
14+
- The physical gate verifies that every expected checkpoint output exists and is
15+
non-empty before a unit is restored.
16+
17+
The physical checkpoint layout remains `/checkpoints/{branch}/{commit_sha}`.
18+
The Stage 1 reuse manifest is adapter state in that same scope. Missing,
19+
malformed, or unreadable manifest content must fail closed to recompute; it must
20+
not authorize reuse by itself.
21+
22+
Keep reuse explanations in the existing Stage 1 contract metadata under
23+
`dataset_build_output.json -> metadata.stage_1_status.reuse_reasoning`.
24+
That metadata should distinguish semantic identity results from physical
25+
checkpoint availability, including missing prior identity, identity mismatch,
26+
identity match, missing checkpoint output, empty checkpoint output, and restored
27+
checkpoint output.
28+
29+
## Identity Granularity
30+
31+
`substep_id` is the public reporting group. It is not always the right durable
32+
manifest lookup key, because transitional Stage 1 substeps can contain multiple
33+
independently runnable command or script units. For example, raw-data download
34+
and uprating both report through `1a_raw_data_download`, while the base dataset
35+
substep can run several dataset builders.
36+
37+
When persisting or looking up reuse identities for a command-backed unit,
38+
`identity_key` is the stable execution identity key within the checkpoint scope.
39+
It includes the public `substep_id` plus enough stable execution material to
40+
distinguish the command or script and its expected reusable outputs. Keep
41+
`substep_id` on the record for public status grouping.
42+
43+
Do not key multiple manifest records only by `substep_id` unless the record
44+
represents an intentionally aggregated identity for the whole public substep.
45+
Otherwise, later units in the same substep can overwrite earlier units and make
46+
future reruns recompute despite valid checkpoints.
47+
48+
## Conditional Running
49+
50+
Unit-level conditional running is the compatibility path while Stage 1 is still
51+
command-backed:
52+
53+
1. Build current identity material for the runnable unit.
54+
2. Compare it with the previous manifest identity for that unit's identity key.
55+
3. Consult physical checkpoints only when the semantic decision is `reuse`.
56+
4. Restore and skip only that unit when both gates pass.
57+
5. Recompute the unit and update the manifest only after successful output
58+
restoration or successful checkpoint save.
59+
60+
Public substep status should be aggregated from its unit results. A public
61+
substep is fully `reused` only when every required unit in that substep was
62+
reused. If any unit recomputes successfully, report the substep as completed
63+
with reuse reasoning that explains the mixed path.
64+
65+
Stage-level conditional running is the same idea one level higher. Stage 1 may
66+
skip all builder execution only when every required unit for the requested run
67+
flags has a matching semantic identity and valid physical checkpoint outputs.
68+
Until the canonical Stage 1 coordinator owns whole-stage planning, do not infer
69+
stage-level reuse from a single substep or unit record.
70+
71+
## Documentation Expectations
72+
73+
When changing Stage 1 identity material, checkpoint reuse decisions, artifact
74+
outputs, substep aggregation, or contract metadata, keep the durable
75+
documentation surface synchronized:
76+
77+
- Update this guide when the Stage 1 rerun or checkpoint model changes.
78+
- Update `docs/pipeline_map.yaml` and regenerate generated pipeline docs when
79+
the stage graph, artifact names, or pipeline-node metadata change.
80+
- Keep `dataset_build_output.json` metadata documentation aligned with the
81+
status and reuse reasoning actually emitted by the Modal adapter.
82+
- Put PR-specific migration rationale in the PR description, not in durable
83+
docs or docstrings.
84+
85+
Tests for Stage 1 reuse changes should cover missing and malformed manifests,
86+
semantic mismatch, physical checkpoint miss or empty output, same-public-substep
87+
units with distinct identity keys, and contract metadata explaining both gates.

0 commit comments

Comments
 (0)