Skip to content

Incremental scan#2337

Open
xanderbailey wants to merge 26 commits intoapache:mainfrom
xanderbailey:xb/incremental_read
Open

Incremental scan#2337
xanderbailey wants to merge 26 commits intoapache:mainfrom
xanderbailey:xb/incremental_read

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented Apr 15, 2026

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Adds incremental append scan support to iceberg-rust, allowing users to read only newly added data files between two snapshots. This is the Rust equivalent of Java's BaseIncrementalAppendScan.

Core scan module (crates/iceberg/src/scan/)

  • New incremental.rs module containing:
    • AppendSnapshotSet: walks the snapshot ancestry chain between from_snapshot_id and to_snapshot_id, validates connectivity, and collects only APPEND operation snapshot IDs (skipping overwrite/delete/compaction — matching Java's BaseIncrementalAppendScan behavior)
    • IncrementalAppendScanBuilder: builder with the same configuration options as TableScanBuilder (column selection, predicates, concurrency limits, row group filtering, etc.)
    • Supports both exclusive (default, like Java) and inclusive from_snapshot semantics
  • Refactored shared scan-build logic into ScanConfig + build_table_scan() to eliminate duplication between TableScanBuilder and IncrementalAppendScanBuilder
  • Added ManifestFileFilter and ManifestEntryFilter callback types to PlanContext, used by incremental scans to:
    • Skip delete manifests and data manifests whose added_snapshot_id is outside the scan range
    • Include only entries with status == Added and snapshot_id within the append set

Table API (crates/iceberg/src/table.rs)

  • Table::incremental_append_scan(from, to) — exclusive, matches Java's newIncrementalAppendScan()
  • Table::incremental_append_scan_inclusive(from, to) — inclusive variant
  • Same methods on StaticTable

DataFusion integration (crates/integrations/datafusion/)

  • New ScanRange enum replacing the previous Option<i64> snapshot ID, supporting Latest, PointInTime, and Incremental variants
  • IcebergStaticTableProvider gains three new constructors:
    • try_new_incremental(table, from, to) — exclusive
    • try_new_incremental_inclusive(table, from, to) — inclusive
    • try_new_appends_after(table, from) — exclusive, scans to current snapshot
  • get_batch_stream updated to dispatch to the appropriate scan builder based on ScanRange

Are these changes tested?

Yes

Java comparison notes:

  • AppendSnapshotSet::build mirrors Java's BaseIncrementalAppendScan.snapshotsBetween() — only APPEND snapshots are collected, non-APPEND are silently skipped
  • Exclusive from_snapshot semantics match Java's default IncrementalAppendScan behavior
  • Inclusive mode has no direct Java equivalent but follows the same ancestry walking logic
  • Manifest file filtering (skip deletes, check added_snapshot_id) and entry filtering (status == ADDED, snapshot_id in range) match BaseIncrementalAppendScan.doPlanFiles() in Java

@xanderbailey xanderbailey marked this pull request as draft April 15, 2026 22:56
Comment on lines +36 to +43
/// Filter applied to each [`ManifestFile`] before fetching it.
/// Returns `true` to include the manifest, `false` to skip it.
pub(crate) type ManifestFileFilter = Arc<dyn Fn(&ManifestFile) -> bool + Send + Sync>;

/// Filter applied to each manifest entry after loading a manifest.
/// Returns `true` to include the entry, `false` to skip it.
pub(crate) type ManifestEntryFilter = Arc<dyn Fn(&ManifestEntryRef) -> bool + Send + Sync>;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice way to inject these filters and should extend to other future scans

pub manifest_entry_filter: Option<ManifestEntryFilter>,
}

impl std::fmt::Debug for PlanContext {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ManifestFileFilter and ManifestEntryFilter can't be debug

pub(crate) struct AppendSnapshotSet {
/// Snapshot IDs in the range
snapshot_ids: HashSet<i64>,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ancestors_between is used here. We need to add validation and operation-type checking on top of it, we're not re-implementing the traversal. Does that make sense?

@xanderbailey xanderbailey marked this pull request as ready for review April 26, 2026 14:36
///
/// Use [`Table::incremental_append_scan`] or
/// [`Table::incremental_append_scan_inclusive`] to create an instance.
pub struct IncrementalAppendScanBuilder<'a> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is duplication here between the scan builders which I've left for now until maybe we have a 3rd variant and we can pick out the common patterns a bit better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants