Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051
Open
GovindBalaji-S-Glean wants to merge 2 commits into
Open
Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051GovindBalaji-S-Glean wants to merge 2 commits into
GovindBalaji-S-Glean wants to merge 2 commits into
Conversation
Adds an alternative implementation of TermInSetQuery in the sandbox module that stores terms as a sorted BytesRef[] over a packed byte[] instead of PrefixCodedTerms. Trades some RAM for cheaper per-segment iteration and a vectorized equals/hashCode fast path. Includes a JMH benchmark in benchmark-jmh comparing both queries across numTerms (30/300/3k/30k), numSegments (5/20/50), indexContent (QUERY_ONLY/SPARSE/RANDOM_50K), and inputShape (UNSORTED_LIST/SORTED_SET). See dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
PR for Lucene dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k
Adds an alternative
TermInSetQueryimplementation in the sandbox module that stores terms as a sortedBytesRef[]wrapping a packedbyte[], rather thanPrefixCodedTerms. Trades some RAM for cheaper per-segment iteration and a vectorizedequals/hashCodefast path on the cache hot path. Includes a JMH benchmark inbenchmark-jmh.Build / test
./gradlew :lucene:sandbox:test --tests "*ArrayTermInSet*"→ 17 tests pass on JDK 25../gradlew :lucene:benchmark-jmh:assemble→ green../gradlew tidy→ no formatting changes outside the 4 files in this PR../gradlew :lucene:sandbox:check :lucene:benchmark-jmh:check→ green on JDK 25 (Temurin 25.0.3).Benchmark
Full matrix on a 32-core / 125 GB GCP n2 VM, JDK 25 (Temurin 25.0.3), JMH 1.37, single fork, 3 warmup × 5 measurement × 5 s.
Benchmark parameters
numTerms30,300,3000,30000numSegments5,20,50construct+iterate*onlyindexContentQUERY_ONLY·SPARSE·RANDOM_50Kconstruct+iterate*onlyinputShapeUNSORTED_LIST·SORTED_SETconstruct*andconstruct+iterate*indexContentQUERY_ONLY— every query term is indexed in every segment. All seeks hit.SPARSE— only 2 deterministic terms (first + middle of the query set) per segment, so most seeks miss. The shape that hurtsTermInSetQuerythe most becausePrefixCodedTermsdecodes per term regardless of hit/miss.RANDOM_50K— 50 000 random terms per segment, independent of the query. Zero hits, but a large dictionary to navigate. Stresses the seek path on a deep terms tree.inputShapeUNSORTED_LIST—Arrays.asList(shuffled). Both queries radix-sort internally, so this measures sort + storage cost.SORTED_SET—TreeSet<BytesRef>(natural-order comparator). Both queries hit their skip-sort fast path, so this isolates storage-shape cost from sort cost.Benchmark methods
construct{TermInSet,ArrayTermInSet}QueryconstructAndIterate{TermInSet,ArrayTermInSet}Queryequals{TermInSetQuery, ArrayTermInSetQuery, FlatPacked, PackedPlusLengths}LRUQueryCachehot path.Numbers are µs/op average.
constructonly — query ctor costSkip-sort fast path nets a clean ~2× across all sizes; on
UNSORTED_LISTthe gap collapses at 30k terms because the radix sort dominates and both queries pay it.construct + per-segment iterate— the path real queries take, slice atnumSegments=20Full benchmark output: https://gist.github.com/GovindBalaji-S-Glean/fe4af91bead4dcd2390fa4da381b09e7
Pattern:
QUERY_ONLY,RANDOM_50K): Array wins consistently, 1.1–1.4×. Iteration cost dominates; storage shape matters less.SPARSE— 2 indexed terms vs a 30/300/3k/30k-term query): Array wins 1.5–4.5× and the gap widens withnumTerms × numSegments. This is exactly the case where TIS'sPrefixCodedTermsdecode runs once per query term per segment while Array just walks a sortedBytesRef[].equals(cache-hit equality on equal queries)Vectorized
Arrays.equalsfast path nets a clean 2.58× / 1.31× win at 300 / 3 000 terms; at 30 000 terms both are statistically indistinguishable (~13 µs ± a few µs) — byte compare has saturated memory bandwidth.Trade-off
Array keeps the sorted
byte[]+ offsets index live for the lifetime of the query, vs TIS'sPrefixCodedTermswhich compresses common prefixes. For small/medium term counts this is negligible; for 30k term sets the per-query memory difference is on the order of a few hundred KB.