Skip to content

fix(scan): use current_schema for default-snapshot column validation#2566

Open
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution
Open

fix(scan): use current_schema for default-snapshot column validation#2566
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution

Conversation

@nazq

@nazq nazq commented Jun 2, 2026

Copy link
Copy Markdown

Which issue does this PR close?

What changes are included in this PR?

TableScanBuilder::build() currently validates caller-supplied column names against snapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicit snapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through an UpdateSchemaAction commit — this rejects columns that are valid against the post-evolution schema:

DataInvalid => Column note not found in table. Schema: <pre-evolution schema>

Fix

crates/iceberg/src/scan/mod.rs:221:

- let schema = snapshot.schema(self.table.metadata())?;
+ let schema = if self.snapshot_id.is_some() {
+     snapshot.schema(self.table.metadata())?
+ } else {
+     self.table.metadata().current_schema().clone()
+ };
  • Explicit snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.
  • Default scan (no snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which reads PARQUET:field_id metadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.

This matches PyIceberg's approach in pyiceberg/io/pyarrow.py::_task_to_record_batches — project by field ID, rename the arrow batch on the way out.

The column-name validation loop (lines 224–237) and the subsequent field_id_by_name lookup (lines 256–261) share the same schema variable, so the fix is one assignment with a 12-line comment explaining the invariant.

Why this wasn't caught earlier

UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them exercise table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.

Are these changes tested?

Yes — three regression tests in scan::tests, all on the same minimal fixture (make_schema_evolved_table) with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp):

Test What it covers
test_default_scan_uses_current_schema_after_evolution select(["id", "value"]) succeeds in the default scan — the post-evolution column name resolves
test_default_scan_rejects_old_name_after_rename select(["id", "tmp"]) fails with DataInvalid — the pre-evolution name is no longer part of the public vocabulary in a default scan
test_snapshot_id_scan_uses_snapshot_schema snapshot_id(1).select(["id", "tmp"]) succeeds (time-travel keeps the old vocabulary); snapshot_id(1).select(["id", "value"]) fails

All 34 existing scan::tests still pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics here match PyIceberg's DataScan.projection() exactly — current schema for a default scan, the snapshot's schema when an explicit snapshot_id pins a point in history — so this brings us in line with an established reference implementation.

One thing worth calling out for other reviewers: the shared schema variable means this changes more than the validation loop — it also feeds predicate binding, field-id resolution, and PlanContext.snapshot_schema, which ends up as FileScanTask.schema and drives the reader's batch transformation. I traced each of these and they should all follow the scan's vocabulary schema, with field-id-based projection handling files written under older schemas, so sharing the variable is correct end-to-end.

Small naming nit: after this change PlanContext.snapshot_schema (and the same-named fields on ManifestFileContext/ManifestEntryContext) holds the current schema for default scans, which the name no longer reflects. A rename to something like scan_schema, or a short doc comment, would help future readers. Fine as a follow-up.

@nazq

nazq commented Jun 10, 2026

Copy link
Copy Markdown
Author

The nit is fair @viirya want me to push a small fix ?

@viirya

viirya commented Jun 10, 2026

Copy link
Copy Markdown
Member

The nit is fair @viirya want me to push a small fix ?

It would be good to have. Thanks.

A default table scan (no explicit snapshot_id) currently validates
caller-supplied column names against the snapshot's schema_id, not the
table's current schema. After an UpdateSchemaAction commit changes the
current schema (rename/add/delete column), pre-existing snapshots still
point at the old schema_id, so the validation loop in
TableScanBuilder::build rejects names that are valid against the
post-evolution schema with:

    DataInvalid => Column <new_name> not found in table. Schema: <old>

The downstream Parquet projection
(arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids)
already maps field IDs to on-disk column names via PARQUET:field_id
metadata, so resolving names against the current schema is safe
end-to-end — field IDs are stable across schema versions, and the
file's original column names live in the parquet metadata until the
file is rewritten.

Fix: branch on whether the caller asked for a specific snapshot.
Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary;
default scan uses the table's current schema.

Tests: three regression tests on a fixture with current-schema-id=1
(id, value) and a sole snapshot at schema-id=0 (id, tmp):

* test_default_scan_uses_current_schema_after_evolution — select(['id','value'])
  succeeds in the default scan
* test_default_scan_rejects_old_name_after_rename — select(['id','tmp'])
  fails with DataInvalid in the default scan
* test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp'])
  succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails

All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3
new). Clippy + rustfmt clean.
@nazq nazq force-pushed the scan-current-schema-on-evolution branch from 5f6afb2 to e9dc388 Compare June 10, 2026 17:20
@nazq

nazq commented Jun 10, 2026

Copy link
Copy Markdown
Author

Done — pushed the rename. PlanContext/ManifestFileContext/ManifestEntryContext now carry scan_schema instead of snapshot_schema, and I added a doc comment on PlanContext::scan_schema spelling out the current-vs-snapshot distinction you described. Kept the rename scoped to the scan-context chain; the visitor/transformer params downstream still take a snapshot_schema arg since those genuinely receive whatever schema the caller passes. Rebased onto current main (which picked up the new FileScanTask::builder() form) and re-ran the scan tests — all green. Thanks for the thorough trace and the approval! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Table scan rejects current-schema column names after UpdateSchemaAction commit

2 participants