Skip to content

Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan#16220

Open
krisnaru wants to merge 1 commit into
apache:mainfrom
krisnaru:copy-table-snapshot-filter
Open

Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan#16220
krisnaru wants to merge 1 commit into
apache:mainfrom
krisnaru:copy-table-snapshot-filter

Conversation

@krisnaru

@krisnaru krisnaru commented May 5, 2026

Copy link
Copy Markdown

Fixes #14458

Summary

  • Removes broken snapshotIds.contains(entry.snapshotId()) filter from RewriteTablePathUtil that silently dropped live, un replicated files when their adding snapshot was expired
  • Replaces snapshot-ID-based entry filtering with entry.isLive() only — all live content files are included in the copy plan
  • Adds anti-join dedup for incremental copies: filters out previously-replicated files by comparing against the start version's content files via contentFileDS
  • Fixes both full copy (expired snapshot IDs missing from endMetadata.snapshots()) and incremental copy (expired snapshot IDs missing from deltaSnapshotIds after RewriteManifests + ExpireSnapshots)

Problem

RewriteTablePathUtil.writeDataFileEntry filtered copy plan entries using snapshotIds.contains(entry.snapshotId()). This is fundamentally broken because entry.snapshotId() reflects the snapshot that originally added the file, which can be expired at any time by table maintenance. When RewriteManifests reorganizes entries from expired snapshots into new manifests, the manifest is correctly selected for processing but the entry-level filter drops the files — causing silent data loss at the target.

Scenario:
Append (S3) → RewriteManifests (S4) → Expire S3 → Incremental replication builds deltaSnapshotIds = {S4}, but file C has entry.snapshotId() = S3 → {S4}.contains(S3) == false → file C excluded from copy plan.

Approach

  1. Entry level: Include all live entries in copy plan (no snapshot ID check)
  2. Incremental dedup: Anti-join against start version's content files using contentFileDS from BaseSparkAction
  3. Manifest selection: Unchanged — manifestsToRewrite still uses deltaSnapshotIds from version history

Test plan

  • Verify existing TestRewriteTablePathsAction tests pass (full copy, incremental, expire scenarios)
  • Verify existing TestRewriteTablePathUtil tests pass
  • Add integration test: incremental replication after append + RewriteManifests + ExpireSnapshots
  • Verify incremental copy produces correct file counts (no duplicate copies of already-replicated files)
  • Verify full copy after many snapshots with expired snapshot IDs includes all live data

@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch 2 times, most recently from fc3db69 to 805b520 Compare May 5, 2026 22:07
@krisnaru krisnaru changed the title Fix invalid snapshot ID filtering in RewriteTablePath copy plan Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan May 5, 2026
@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch 3 times, most recently from f9503fd to ce254d6 Compare May 6, 2026 03:56

@anuragmantri anuragmantri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention we follow is that we raise the initial PR against the latest Spark branch. After the first review we can bulk backport to rest of the branches.

@krisnaru - Could you create Spark 4.1 only PR out of this for easy review, please?


List<Object[]> actual = rowsSorted(targetTableLocation(), "c1");
List<Object[]> expected = rowsSorted(location, "c1");
assertEquals(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth asserting the incremental result only contains 1 new file (fileC) to validate dedup is actually filtering.

.copyPlan()
.removeIf(pair -> previouslyCopiedPaths.contains(pair.first()));
} catch (Exception e) {
LOG.warn("Unable to read content files from start version for incremental filtering", e);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the consequence of this to the warning - falling back to full copy plan.

@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch from ce254d6 to 3f72d71 Compare June 1, 2026 23:47
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove incorrect snapshotId filtering in RewriteTablePathSparkAction

2 participants