fix(scan): use current_schema for default-snapshot column validation by nazq · Pull Request #2566 · apache/iceberg-rust

nazq · 2026-06-02T00:19:49Z

Which issue does this PR close?

Closes Table scan rejects current-schema column names after UpdateSchemaAction commit #2565.

What changes are included in this PR?

TableScanBuilder::build() currently validates caller-supplied column names against snapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicit snapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through an UpdateSchemaAction commit — this rejects columns that are valid against the post-evolution schema:

DataInvalid => Column note not found in table. Schema: <pre-evolution schema>

Fix

crates/iceberg/src/scan/mod.rs:221:

- let schema = snapshot.schema(self.table.metadata())?;
+ let schema = if self.snapshot_id.is_some() {
+     snapshot.schema(self.table.metadata())?
+ } else {
+     self.table.metadata().current_schema().clone()
+ };

Explicit snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.
Default scan (no snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which reads PARQUET:field_id metadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.

This matches PyIceberg's approach in pyiceberg/io/pyarrow.py::_task_to_record_batches — project by field ID, rename the arrow batch on the way out.

The column-name validation loop (lines 224–237) and the subsequent field_id_by_name lookup (lines 256–261) share the same schema variable, so the fix is one assignment with a 12-line comment explaining the invariant.

Why this wasn't caught earlier

UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them exercise table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.

Are these changes tested?

Yes — three regression tests in scan::tests, all on the same minimal fixture (make_schema_evolved_table) with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp):

Test	What it covers
`test_default_scan_uses_current_schema_after_evolution`	`select(["id", "value"])` succeeds in the default scan — the post-evolution column name resolves
`test_default_scan_rejects_old_name_after_rename`	`select(["id", "tmp"])` fails with `DataInvalid` — the pre-evolution name is no longer part of the public vocabulary in a default scan
`test_snapshot_id_scan_uses_snapshot_schema`	`snapshot_id(1).select(["id", "tmp"])` succeeds (time-travel keeps the old vocabulary); `snapshot_id(1).select(["id", "value"])` fails

All 34 existing scan::tests still pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.

viirya

The semantics here match PyIceberg's DataScan.projection() exactly — current schema for a default scan, the snapshot's schema when an explicit snapshot_id pins a point in history — so this brings us in line with an established reference implementation.

One thing worth calling out for other reviewers: the shared schema variable means this changes more than the validation loop — it also feeds predicate binding, field-id resolution, and PlanContext.snapshot_schema, which ends up as FileScanTask.schema and drives the reader's batch transformation. I traced each of these and they should all follow the scan's vocabulary schema, with field-id-based projection handling files written under older schemas, so sharing the variable is correct end-to-end.

Small naming nit: after this change PlanContext.snapshot_schema (and the same-named fields on ManifestFileContext/ManifestEntryContext) holds the current schema for default scans, which the name no longer reflects. A rename to something like scan_schema, or a short doc comment, would help future readers. Fine as a follow-up.

nazq · 2026-06-10T12:52:09Z

The nit is fair @viirya want me to push a small fix ?

viirya · 2026-06-10T16:57:28Z

The nit is fair @viirya want me to push a small fix ?

It would be good to have. Thanks.

A default table scan (no explicit snapshot_id) currently validates caller-supplied column names against the snapshot's schema_id, not the table's current schema. After an UpdateSchemaAction commit changes the current schema (rename/add/delete column), pre-existing snapshots still point at the old schema_id, so the validation loop in TableScanBuilder::build rejects names that are valid against the post-evolution schema with: DataInvalid => Column <new_name> not found in table. Schema: <old> The downstream Parquet projection (arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids) already maps field IDs to on-disk column names via PARQUET:field_id metadata, so resolving names against the current schema is safe end-to-end — field IDs are stable across schema versions, and the file's original column names live in the parquet metadata until the file is rewritten. Fix: branch on whether the caller asked for a specific snapshot. Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary; default scan uses the table's current schema. Tests: three regression tests on a fixture with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp): * test_default_scan_uses_current_schema_after_evolution — select(['id','value']) succeeds in the default scan * test_default_scan_rejects_old_name_after_rename — select(['id','tmp']) fails with DataInvalid in the default scan * test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp']) succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3 new). Clippy + rustfmt clean.

nazq · 2026-06-10T17:20:54Z

Done — pushed the rename. PlanContext/ManifestFileContext/ManifestEntryContext now carry scan_schema instead of snapshot_schema, and I added a doc comment on PlanContext::scan_schema spelling out the current-vs-snapshot distinction you described. Kept the rename scoped to the scan-context chain; the visitor/transformer params downstream still take a snapshot_schema arg since those genuinely receive whatever schema the caller passes. Rebased onto current main (which picked up the new FileScanTask::builder() form) and re-ran the scan tests — all green. Thanks for the thorough trace and the approval! 🙏

viirya approved these changes Jun 10, 2026

View reviewed changes

nazq force-pushed the scan-current-schema-on-evolution branch from 5f6afb2 to e9dc388 Compare June 10, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scan): use current_schema for default-snapshot column validation#2566

fix(scan): use current_schema for default-snapshot column validation#2566
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution

nazq commented Jun 2, 2026

Uh oh!

viirya left a comment

Uh oh!

nazq commented Jun 10, 2026

Uh oh!

viirya commented Jun 10, 2026

Uh oh!

nazq commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nazq commented Jun 2, 2026

Which issue does this PR close?

What changes are included in this PR?

Fix

Why this wasn't caught earlier

Are these changes tested?

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

nazq commented Jun 10, 2026

Uh oh!

viirya commented Jun 10, 2026

Uh oh!

nazq commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants