Spark: Add vectorized Parquet reads for variant columns by nssalian · Pull Request #16292 · apache/iceberg

nssalian · 2026-05-11T19:48:51Z

Follow up to #16087 - fixing the Vectorized support for variant to remove the temporary patches.

Rationale for this Change

Variant columns currently force the entire table into row-at-a-time reads because the vectorized reader doesn't handle them. This PR fixes that by reading variant's metadata and value children as Arrow VarBinary batches, with per-file detection so shredded files automatically fall back to row reads.

What changes are included in this PR?

Vectorized variant read path:

VectorizedReaderBuilder - adds variantVisitor() that creates a VectorizedVariantVisitor scoped to each variant column's Parquet path
VectorizedVariantVisitor - walks variant's internal structure, creates Arrow readers for metadata + value leaves
VectorizedArrowReader.VectorizedVariantReader - composes two child readers, delegates read/setRowGroupInfo/setBatchSize/close
VectorHolder.VariantVectorHolder - carries both child holders through the batch pipeline
VariantColumnVector (new) - Spark ColumnVector implementing getChild(0) = value, getChild(1) = metadata per Spark's getVariant() contract
ColumnVectorBuilder - dispatches VariantVectorHolder before isDummy() check
ColumnVectorWithFilter - adds VariantType branch to getChild() so variant + DV/position deletes work with vectorization

Shredded-file detection at scan plan:

SparkBatch.supportsParquetBatchReads(ScanTask) - per-file lowerBounds.containsKey(variantFieldId) check; presence indicates shredded payload, batch reads are disabled for that scan
SparkBatch.supportsParquetBatchReads(NestedField) - falls back to row reads when the variant column's metrics mode is None or Counts (bounds aren't trustable for shredded detection)
SparkScanBuilder - opts into variant-column stats for both buildIcebergBatchScan and buildIcebergIncrementalAppendScan so lowerBounds is loaded at scan plan without opening Parquet footers

Both Spark 4.0 and 4.1.

Limitations

Shredded variant columns are not vectorized. The per-file lowerBounds check detects them and falls back to row-at-a-time reads
Variant inside structs/lists/maps still falls back to row-at-a-time (pre-existing limitation for all complex types)
When write.metadata.metrics.default is set to none or counts for a variant column, bounds aren't recorded so detection falls back conservatively to row reads

Are these changes tested?

TestSparkVariantRead (v4.0 + v4.1)
- All existing tests now run with both vectorized=false and vectorized=true. Previously, the true value tests were skipped
- testVariantReadAfterDelete - variant column with DV deletes under vectorization
- testReadShreddedAfterPropertyToggled - writes shredded data with write.parquet.shred-variants=true, toggles the property to false, then reads. Verifies the per-file lowerBounds check forces row reads on the existing shredded files (parameterized over vectorized=false/true)
- testReadShreddedWithMetricsDisabled - shredded write with write.metadata.metrics.default=none and =counts. Verifies the metrics-mode gate forces row reads when bounds aren't recorded (parameterized over both modes)
TestVariantShredding (v4.0 + v4.1) - table created with PARQUET_SHRED_VARIANTS=true; SparkBatch correctly detects and falls back
TestSnapshotTableProcedure (v4.0 + v4.1) - external Parquet imports with variant columns lacking the VARIANT annotation now read correctly with vectorization on by default. The previous manual read.parquet.vectorization.enabled=false workaround is removed

Are there any user-facing changes?

Vectorization is now enabled for variant columns on tables that don't shred. Performance benefits flow through automatically
For tables with shredded variant data, batch reads transparently fall back to row reads on a per-file basis. No user configuration required
Tables that disable variant column metrics (write.metadata.metrics.default=none or counts) also fall back to row reads to avoid silent data loss

Performance

Measured includeColumnStats(variantColumns) scan-plan overhead at 10/100/1000 files (5 iterations + 3 warmups, two independent runs, local SSD, hadoop catalog). Per-file delta is roughly 1-2 microseconds and within run-to-run noise at 1000 files. The opt-in only fires for projections containing variant columns; non-variant scans are unchanged.

…ader

nssalian · 2026-05-22T18:43:03Z

@pvary @huaxingao @singhpk234 PTAL

…ader

huan233usc

Some addition small comments

…ader

huan233usc

LGTM Thanks

huaxingao · 2026-06-11T01:50:11Z


  private boolean supportsParquetBatchReads(Types.NestedField field) {
+    if (field.type().isVariantType()) {
+      return !PropertyUtil.propertyAsBoolean(


This gates batch reads on the write.parquet.shred-variants property. The property reflects the current write config, not what's in existing files — so a table that's currently false but still has shredded files (property toggled later, or files written elsewhere) would take the batch path and silently drop typed_value data. Is "property=false -> no shredded files" a safe assumption? If so, worth a short comment noting it.

Good catch. Might have to look at a file level for typed_value fields. Let me find a nice way to add this.

This is non-trivial. Working on it so it doesn't hit more edge cases and it's in line with the interfaces. Will surface once I have it cleanly working locally.

I moved the detection from the table property write.parquet.shred-variants to a per-file lowerBounds.containsKey(variantFieldId) check on the manifest entry, so toggling the property after writing shredded files no longer drops typed_value data on the batch path. SparkScanBuilder opts into variant-column stats for both the batch and incremental scan paths so the check is available without opening any Parquet footers. Added a test too.

…ader

nssalian · 2026-06-22T19:01:52Z

fixing the tests

Spark, Arrow: Add vectorized Parquet reads for variant columns

7c91c45

github-actions Bot added spark arrow labels May 11, 2026

nssalian changed the title ~~Spark, Arrow: Add vectorized Parquet reads for variant columns~~ Spark,Arrow: Add vectorized Parquet reads for variant columns May 11, 2026

nssalian changed the title ~~Spark,Arrow: Add vectorized Parquet reads for variant columns~~ Spark: Add vectorized Parquet reads for variant columns May 11, 2026

nssalian added 5 commits May 11, 2026 13:02

Modify test from previous patch

d8dd3c7

disable vectorized read for shredded variant

d0cf8dd

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

020cf79

…ader

Fall back to row-at-a-time reads for shredded variant columns

4164a4b

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

6a66e06

…ader

nssalian marked this pull request as ready for review May 13, 2026 15:44

huan233usc reviewed May 29, 2026

View reviewed changes

Comment thread ...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VariantColumnVector.java

nssalian added 2 commits June 1, 2026 09:05

Fix vectorized variant read with deletes

253ac64

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

2a7f0ca

…ader

nssalian requested a review from huaxingao June 1, 2026 17:06

huan233usc approved these changes Jun 2, 2026

View reviewed changes

Comment thread arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Outdated

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java Outdated

nssalian added 2 commits June 3, 2026 16:12

PR comments

f5a7439

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

a4a0b54

…ader

nssalian requested review from Fokko and huan233usc June 4, 2026 22:33

huan233usc approved these changes Jun 5, 2026

View reviewed changes

huaxingao reviewed Jun 11, 2026

View reviewed changes

Comment thread ...extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotTableProcedure.java Outdated

AdamGS mentioned this pull request Jun 16, 2026

spark: Add support for Variant for Spark with Vortex spiraldb/iceberg#34

Merged

backport added 3 commits June 21, 2026 20:09

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

933ea41

…ader

Spark: Detect shredded variant per-file via manifest bounds

6f5eb4b

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

2bfc609

…ader

Fix tests

bed7967

nssalian requested review from huaxingao and singhpk234 June 23, 2026 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Add vectorized Parquet reads for variant columns#16292

Spark: Add vectorized Parquet reads for variant columns#16292
nssalian wants to merge 14 commits into
apache:mainfrom
nssalian:variant-vectorized-reader

nssalian commented May 11, 2026 •

edited

Loading

Uh oh!

nssalian commented May 22, 2026

Uh oh!

Uh oh!

huan233usc left a comment

Uh oh!

Uh oh!

Uh oh!

huan233usc left a comment

Uh oh!

huaxingao Jun 11, 2026

Uh oh!

nssalian Jun 11, 2026

Uh oh!

nssalian Jun 12, 2026

Uh oh!

nssalian Jun 22, 2026

Uh oh!

Uh oh!

nssalian commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nssalian commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this Change

What changes are included in this PR?

Limitations

Are these changes tested?

Are there any user-facing changes?

Performance

Uh oh!

nssalian commented May 22, 2026

Uh oh!

Uh oh!

huan233usc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huan233usc left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nssalian commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nssalian commented May 11, 2026 •

edited

Loading