Optimizing Variant read path with lazy caching by nssalian · Pull Request #3481 · apache/parquet-java

nssalian · 2026-04-15T14:52:23Z

Rationale for this change

Profiling in 3452 identified Variant.getFieldAtIndex() and metadata string lookups as hotspots during variant reads. Every call to getFieldByKey, getFieldAtIndex, and getElementAtIndex re-parses headers and re-allocates objects that could be cached.

What changes are included in this PR?

Adds lazy caching to Variant.java for metadata strings, object headers, and array headers. Field lookups in getFieldByKey now defer value construction until a match is found, and child Variants share the parent's metadata cache. Also removes two unused static helper methods.
Adds readUnsignedLittleEndian to VariantUtil for bulk ByteBuffer reads (adapted from Iceberg), replacing byte-at-a-time readUnsigned calls in the Variant read path. Buffers are set to LITTLE_ENDIAN byte order in the constructor to support this.
Concurrency javadoc added; getMetadataKeyCached captures all state in locals to ensure thread safety with at worst redundant decoding.

Includes @steveloughran's string converter optimization from #3452: VariantBuilder.appendAsString(Binary) and its use in VariantConverters.

Are these changes tested?

Ran the benchmarks from 3452 locally
~3.2x variant deserialization speedup

Command run:

java -jar parquet-benchmarks/target/parquet-benchmarks.jar   "VariantBuilderBenchmark.deserializeVariant" -f 1 -wi 3 -i 500 -t 1

Baseline

Master branch with the above Benchmark code to get the current numbers

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt      Score     Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss  500   8875.229 ± 116.528  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss  500  11728.253 ± 107.884  us/op

This PR changes:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score     Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss  500  2765.264 ±  97.714  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss  500  3802.665 ± 102.249  us/op

Are there any user-facing changes?

No.

steveloughran

code looks good; made some minor changes.

This should make a very big difference when selectively retrieving multiple fields within a single variant, or within a variant and nested children.

I do worry about concurrency now. The existing Variant didn't have issues here precisely because it recalculated everything.

We have to be confident that even if concurrent access triggers a duplicate cache operation, there's no harm in this. Otherwise cache access will have to be synchronized.

It all looks good to me.

alamb · 2026-04-15T17:18:08Z

I started the workflows

Co-authored-by: Steve Loughran <stevel@cloudera.com>

steveloughran

reviewing the code again to see if it's possible to get back some of those performance numbers lost with the move to volatile.

We're only reading and caching data, so there's no real write conflicts -is the use of volatile everywhere being over-cautious? it's forcing memory reads everywhere.

And I don't know how common cross-thread reading will actually be in production systems; in spark each worker is its own thread, after all.

Maybe the goal should just be all reviewers being confident that if there are dual writers, the output will always be consistent.

steveloughran · 2026-04-21T09:57:16Z

I would propose a javadoc on concurrency here.

Till now we've had a mutable builder, the caching makes it mutable. But even in a race condition, as the only change whjch ever takes place is a decode an update of the dictionary, if the dictionary was safe then the worst outcome is one of the lazy evals gets lost.

Lets
-make sure that the thread doing the lazy eval retains the values it needs for the duration of get()
-look at the dictionary impl.

Java HashMap is not thread safe, the very old HashTable is, but it may have its own penalty when used.

That rust variant puts a lot of effort into memory efficiency too. Maybe we should make sure that these changes don't completely explode memory consumption. I know, it's a tradeoff (speed, space, code complexity). And queries like speed. And I think the focus should be "single thread speed and no inconsistency on multithreaded use" as single thread workers is what the query engines do. After all, they shouldn't expect the input streams to be thread-safe, should they? Two threads doing parallel reads of a stream is already making some big assumptions about the underlying layers (*)

(*) hadoop input streams are thread safe precisely because code makes those assumptions, FWIW. Going thread unsafe broke hbase

steveloughran · 2026-04-23T13:54:45Z

@Fokko could you approve the test runs here? thanks

…y javadoc

nssalian · 2026-05-05T21:27:22Z

@alamb @emkornfield PTAL

steveloughran · 2026-05-14T09:23:40Z

@nssalian now the variant benchmark is in, can you merge the parquet master branch in to this PR so I can run the benchmarks over it and compare to master itself? thanks

nssalian · 2026-05-14T19:35:21Z

Thanks @steveloughran. I'll follow up with the changes.

steveloughran

this is good, does a lot of the hardening.

+1 (non binding) from me

Fokko · 2026-05-19T10:15:17Z

Thanks @nssalian for working on this, and thanks @steveloughran for the review 👍

steveloughran · 2026-05-19T10:44:21Z

nice!

steveloughran · 2026-05-19T13:50:09Z

Rebased #3562 onto this; resilience tests happy

nssalian marked this pull request as ready for review April 15, 2026 15:00

steveloughran reviewed Apr 15, 2026

View reviewed changes

Fix thread-safety in Variant lazy caches and add comments

6f540f4

Co-authored-by: Steve Loughran <stevel@cloudera.com>

nssalian force-pushed the variant-read-changes branch from 6c6db2e to 6f540f4 Compare April 15, 2026 18:05

nssalian requested a review from steveloughran April 15, 2026 18:08

nssalian changed the title ~~WIP: Optimizing Variant read path with lazy caching~~ Optimizing Variant read path with lazy caching Apr 15, 2026

steveloughran mentioned this pull request Apr 15, 2026

GH-3471: Fix ByteBuffer handling in VariantUtil and VariantBuilder #3472

Merged

steveloughran reviewed Apr 17, 2026

View reviewed changes

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated

steveloughran reviewed Apr 17, 2026

View reviewed changes

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated

Remove unnecessary volatile fields and fix PR comments

fcbef75

nssalian mentioned this pull request Apr 23, 2026

GH-3451. Add a JMH benchmark for variants #3452

Merged

nssalian added 2 commits April 27, 2026 10:19

Merge remote-tracking branch 'apache/master' into variant-read-changes

69a2c1d

Add readUnsignedLittleEndian for bulk ByteBuffer reads and concurrenc…

c6cc9ed

…y javadoc

steveloughran reviewed Apr 27, 2026

View reviewed changes

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

steveloughran mentioned this pull request Apr 30, 2026

Core, Spark: Performant queries over (shredded) Variant data apache/iceberg#16172

Open

3 tasks

This was referenced May 14, 2026

Harden variant decoding #3561

Open

GH-3561 Harden variants against malformed metadata. #3562

Open

steveloughran reviewed May 14, 2026

View reviewed changes

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java Outdated

nssalian added 3 commits May 14, 2026 13:04

Merge remote-tracking branch 'apache/master' into variant-read-changes

1805dfc

PR comments

1a9e455

Merge remote-tracking branch 'apache/master' into variant-read-changes

8454687

steveloughran approved these changes May 15, 2026

View reviewed changes

Fokko approved these changes May 19, 2026

View reviewed changes

Fokko merged commit 7be05b4 into apache:master May 19, 2026
5 checks passed

nssalian deleted the variant-read-changes branch May 19, 2026 18:55

Conversation

nssalian commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Apr 15, 2026

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steveloughran commented Apr 21, 2026

Uh oh!

steveloughran commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nssalian commented May 5, 2026

Uh oh!

steveloughran commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

nssalian commented May 14, 2026

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko commented May 19, 2026

Uh oh!

steveloughran commented May 19, 2026

Uh oh!

steveloughran commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nssalian commented Apr 15, 2026 •

edited

Loading