Optimizing Variant read path with lazy caching#3481
Optimizing Variant read path with lazy caching#3481nssalian wants to merge 1 commit intoapache:masterfrom
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
code looks good; made some minor changes.
This should make a very big difference when selectively retrieving multiple fields within a single variant, or within a variant and nested children.
I do worry about concurrency now. The existing Variant didn't have issues here precisely because it recalculated everything.
We have to be confident that even if concurrent access triggers a duplicate cache operation, there's no harm in this. Otherwise cache access will have to be synchronized.
It all looks good to me.
|
I started the workflows |
Co-authored-by: Steve Loughran <stevel@cloudera.com>
6c6db2e to
6f540f4
Compare
| final ByteBuffer metadata; | ||
|
|
||
| /** | ||
| * Pre-computed metadata dictionary size |
There was a problem hiding this comment.
nit. add a "." at the end as some java versions' javadocs really blow up if you don't.
steveloughran
left a comment
There was a problem hiding this comment.
reviewing the code again to see if it's possible to get back some of those performance numbers lost with the move to volatile.
We're only reading and caching data, so there's no real write conflicts -is the use of volatile everywhere being over-cautious? it's forcing memory reads everywhere.
And I don't know how common cross-thread reading will actually be in production systems; in spark each worker is its own thread, after all.
Maybe the goal should just be all reviewers being confident that if there are dual writers, the output will always be consistent.
| cache = new String[dictSize]; | ||
| metadataCache = cache; | ||
| } | ||
| if (cache[id] == null) { |
There was a problem hiding this comment.
set cache to be metadataCache here, so if there is any race condition the same cache array is updated. If two threads are writing to the same [id], well, one lookup is wasted. The joint cache is still (probably) updated with each value
Rationale for this change
Profiling in 3452 identified
Variant.getFieldAtIndex()and metadata string lookups as hotspots during variant reads. Every call togetFieldByKey,getFieldAtIndex, andgetElementAtIndexre-parses headers and re-allocates objects that could be cached.What changes are included in this PR?
Adds lazy caching to
Variant.javafor metadata strings, object headers, and array headers. Field lookups ingetFieldByKeynow defer value construction until a match is found, and child Variants share the parent's metadata cache. Also removes two unused static helper methods.Includes @steveloughran's string converter optimization from #3452:
VariantBuilder.appendAsString(Binary)and its use inVariantConverters.Are these changes tested?
Ran the benchmarks from 3452 locally
Before:
After:
Most recent benchmarks after the thread safety changes
Are there any user-facing changes?
No.