Skip to content

feat(checkpoint): add file-path identity checkpoint mode#6869

Open
chenghuichen wants to merge 12 commits into
Eventual-Inc:mainfrom
chenghuichen:path-checkpoint
Open

feat(checkpoint): add file-path identity checkpoint mode#6869
chenghuichen wants to merge 12 commits into
Eventual-Inc:mainfrom
chenghuichen:path-checkpoint

Conversation

@chenghuichen
Copy link
Copy Markdown
Contributor

@chenghuichen chenghuichen commented May 2, 2026

Changes Made

Add a file-path checkpoint mode as a fast-path that derives checkpoint keys from scan-task metadata (file paths + chunk specs like row groups and byte ranges) instead of requiring a user-specified key column. Activated when on= is omitted from CheckpointConfig.

For the full design doc, see #6446 (comment)

@chenghuichen chenghuichen requested a review from a team as a code owner May 2, 2026 19:02
@chenghuichen chenghuichen marked this pull request as draft May 2, 2026 19:02
@github-actions github-actions Bot added the feat label May 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 2, 2026

Greptile Summary

This PR adds a file-path checkpoint mode that derives checkpoint keys from scan-task metadata (file paths + chunk specs) instead of requiring a user-supplied on= column, making any file-based source checkpointable out of the box. The implementation correctly threads a FilePathRegistry from ScanTaskSource (where atom keys are registered) through to the sink/terminus (where they are staged and sealed), and properly enables filter/projection pushdown that was previously blocked in row-level mode.

  • P1 — stage_checkpoint_keys.rs (distributed): all errors from get_checkpointed_keys() are silently swallowed with if let Ok(...); on a transient store failure the checkpoint set is treated as empty and every task re-executes, risking duplicate writes for non-idempotent sinks. The local-execution path propagates these errors with ?.

Confidence Score: 4/5

Safe to merge after addressing the silent error swallowing in the distributed file-path checkpoint path.

One P1 finding in the distributed path where store errors are silently ignored, which can cause duplicate processing on transient failures. All other components are well-structured with good test coverage.

src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs — error handling in the file-path checkpoint set loading.

Important Files Changed

Filename Overview
src/common/checkpoint-config/src/lib.rs Introduces CheckpointKeyMode enum, migrates key_column to key_mode, and adds FilePathRegistry with solid unit tests including a concurrency test.
src/daft-local-execution/src/sources/scan_task.rs Adds file-path checkpoint filtering in ScanTaskSource; correctly propagates store errors with ?. Uses task.sources.first() for key registration (pre-existing caveat, previously flagged).
src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs Distributed file-path checkpoint filtering silently swallows all store errors, unlike the local path that propagates them — can cause silent duplicate writes on transient failures.
src/daft-local-execution/src/pipeline.rs Refactors BuilderContext.checkpoint from a tuple to CheckpointPipelineState enum; cleanly skips SCKO node for file-path mode and wires FilePathRegistry to scan source and sink.
src/daft-logical-plan/src/optimization/rules/rewrite_checkpoint_source.rs Splits into rewrite_row_level and rewrite_file_path; file-path path correctly omits the anti-join and wraps Source directly in StageCheckpointKeys.
src/daft-scan/src/lib.rs Adds source_atom_keys() covering whole-file, Parquet row-group, and byte-range chunk specs, with comprehensive unit tests.

Reviews (2): Last reviewed commit: "file-path identity checkpoint mode" | Re-trigger Greptile

Comment thread src/daft-local-execution/src/checkpoint_terminus.rs Outdated
Comment thread src/daft-local-execution/src/sources/scan_task.rs Outdated
Comment thread src/daft-scan/src/lib.rs
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 3, 2026

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing chenghuichen:path-checkpoint (8bb7207) with main (044b833)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@chenghuichen chenghuichen marked this pull request as ready for review May 4, 2026 11:18
Comment thread src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs
@madvart madvart requested a review from rohitkulshreshtha May 12, 2026 20:05
@rohitkulshreshtha
Copy link
Copy Markdown
Contributor

#6446 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants