fix(scan): use current_schema for default-snapshot column validation#2566
fix(scan): use current_schema for default-snapshot column validation#2566nazq wants to merge 1 commit into
Conversation
viirya
left a comment
There was a problem hiding this comment.
The semantics here match PyIceberg's DataScan.projection() exactly — current schema for a default scan, the snapshot's schema when an explicit snapshot_id pins a point in history — so this brings us in line with an established reference implementation.
One thing worth calling out for other reviewers: the shared schema variable means this changes more than the validation loop — it also feeds predicate binding, field-id resolution, and PlanContext.snapshot_schema, which ends up as FileScanTask.schema and drives the reader's batch transformation. I traced each of these and they should all follow the scan's vocabulary schema, with field-id-based projection handling files written under older schemas, so sharing the variable is correct end-to-end.
Small naming nit: after this change PlanContext.snapshot_schema (and the same-named fields on ManifestFileContext/ManifestEntryContext) holds the current schema for default scans, which the name no longer reflects. A rename to something like scan_schema, or a short doc comment, would help future readers. Fine as a follow-up.
|
The nit is fair @viirya want me to push a small fix ? |
It would be good to have. Thanks. |
A default table scan (no explicit snapshot_id) currently validates
caller-supplied column names against the snapshot's schema_id, not the
table's current schema. After an UpdateSchemaAction commit changes the
current schema (rename/add/delete column), pre-existing snapshots still
point at the old schema_id, so the validation loop in
TableScanBuilder::build rejects names that are valid against the
post-evolution schema with:
DataInvalid => Column <new_name> not found in table. Schema: <old>
The downstream Parquet projection
(arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids)
already maps field IDs to on-disk column names via PARQUET:field_id
metadata, so resolving names against the current schema is safe
end-to-end — field IDs are stable across schema versions, and the
file's original column names live in the parquet metadata until the
file is rewritten.
Fix: branch on whether the caller asked for a specific snapshot.
Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary;
default scan uses the table's current schema.
Tests: three regression tests on a fixture with current-schema-id=1
(id, value) and a sole snapshot at schema-id=0 (id, tmp):
* test_default_scan_uses_current_schema_after_evolution — select(['id','value'])
succeeds in the default scan
* test_default_scan_rejects_old_name_after_rename — select(['id','tmp'])
fails with DataInvalid in the default scan
* test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp'])
succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails
All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3
new). Clippy + rustfmt clean.
5f6afb2 to
e9dc388
Compare
|
Done — pushed the rename. |
Which issue does this PR close?
UpdateSchemaActioncommit #2565.What changes are included in this PR?
TableScanBuilder::build()currently validates caller-supplied column names againstsnapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicitsnapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through anUpdateSchemaActioncommit — this rejects columns that are valid against the post-evolution schema:Fix
crates/iceberg/src/scan/mod.rs:221:snapshot_id(time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which readsPARQUET:field_idmetadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.This matches PyIceberg's approach in
pyiceberg/io/pyarrow.py::_task_to_record_batches— project by field ID, rename the arrow batch on the way out.The column-name validation loop (lines 224–237) and the subsequent
field_id_by_namelookup (lines 256–261) share the sameschemavariable, so the fix is one assignment with a 12-line comment explaining the invariant.Why this wasn't caught earlier
UpdateSchemaAction(#2120) shipped with metadata-only tests incrates/catalog/loader/tests/schema_update_suite.rs— none of them exercisetable.scan().select_columns(...)after the schema commit. The pre-existingcrates/integration_tests/tests/read_evolved_schema.rsonly usestable.scan().build()with noselect_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.Are these changes tested?
Yes — three regression tests in
scan::tests, all on the same minimal fixture (make_schema_evolved_table) withcurrent-schema-id=1(id, value) and a sole snapshot atschema-id=0(id, tmp):test_default_scan_uses_current_schema_after_evolutionselect(["id", "value"])succeeds in the default scan — the post-evolution column name resolvestest_default_scan_rejects_old_name_after_renameselect(["id", "tmp"])fails withDataInvalid— the pre-evolution name is no longer part of the public vocabulary in a default scantest_snapshot_id_scan_uses_snapshot_schemasnapshot_id(1).select(["id", "tmp"])succeeds (time-travel keeps the old vocabulary);snapshot_id(1).select(["id", "value"])failsAll 34 existing
scan::testsstill pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.