Skip to content

Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051

Open
GovindBalaji-S-Glean wants to merge 2 commits into
apache:mainfrom
GovindBalaji-S-Glean:array-tis-sandbox
Open

Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051
GovindBalaji-S-Glean wants to merge 2 commits into
apache:mainfrom
GovindBalaji-S-Glean:array-tis-sandbox

Conversation

@GovindBalaji-S-Glean
Copy link
Copy Markdown

Description

PR for Lucene dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k

Adds an alternative TermInSetQuery implementation in the sandbox module that stores terms as a sorted BytesRef[] wrapping a packed byte[], rather than PrefixCodedTerms. Trades some RAM for cheaper per-segment iteration and a vectorized equals/hashCode fast path on the cache hot path. Includes a JMH benchmark in benchmark-jmh.

Build / test

  • ./gradlew :lucene:sandbox:test --tests "*ArrayTermInSet*" → 17 tests pass on JDK 25.
  • ./gradlew :lucene:benchmark-jmh:assemble → green.
  • ./gradlew tidy → no formatting changes outside the 4 files in this PR.
  • ./gradlew :lucene:sandbox:check :lucene:benchmark-jmh:check → green on JDK 25 (Temurin 25.0.3).

Benchmark

Full matrix on a 32-core / 125 GB GCP n2 VM, JDK 25 (Temurin 25.0.3), JMH 1.37, single fork, 3 warmup × 5 measurement × 5 s.

Benchmark parameters

Param Values What it varies Used by
numTerms 30, 300, 3000, 30000 Number of terms in the query — from a tiny ACL filter to a large enumerated set. all benchmarks
numSegments 5, 20, 50 Number of segments in the index. Each segment pays the per-term seek cost, so this scales the iteration phase. construct+iterate* only
indexContent QUERY_ONLY  ·  SPARSE  ·  RANDOM_50K What lives in each segment's terms dictionary (see below). construct+iterate* only
inputShape UNSORTED_LIST  ·  SORTED_SET How the query terms are passed to the constructor (see below). construct* and construct+iterate*

indexContent

  • QUERY_ONLY — every query term is indexed in every segment. All seeks hit.
  • SPARSE — only 2 deterministic terms (first + middle of the query set) per segment, so most seeks miss. The shape that hurts TermInSetQuery the most because PrefixCodedTerms decodes per term regardless of hit/miss.
  • RANDOM_50K — 50 000 random terms per segment, independent of the query. Zero hits, but a large dictionary to navigate. Stresses the seek path on a deep terms tree.

inputShape

  • UNSORTED_LISTArrays.asList(shuffled). Both queries radix-sort internally, so this measures sort + storage cost.
  • SORTED_SETTreeSet<BytesRef> (natural-order comparator). Both queries hit their skip-sort fast path, so this isolates storage-shape cost from sort cost.

Benchmark methods

Method Measures
construct{TermInSet,ArrayTermInSet}Query Constructor cost only.
constructAndIterate{TermInSet,ArrayTermInSet}Query End-to-end query execution: ctor + per-segment terms iteration.
equals{TermInSetQuery, ArrayTermInSetQuery, FlatPacked, PackedPlusLengths} Cache-hit equality on equal queries — the LRUQueryCache hot path.

Numbers are µs/op average.

construct only — query ctor cost
numTerms shape Array µs TIS µs TIS / Array
30 SORTED_SET 0.86 2.28 2.6×
30 UNSORTED_LIST 2.87 4.06 1.4×
300 SORTED_SET 11.1 22.0 2.0×
300 UNSORTED_LIST 32.8 43.5 1.3×
3 000 SORTED_SET 108 215 2.0×
3 000 UNSORTED_LIST 361 477 1.3×
30 000 SORTED_SET 1 106 2 182 2.0×
30 000 UNSORTED_LIST 5 400 5 952 1.1×

Skip-sort fast path nets a clean ~2× across all sizes; on UNSORTED_LIST the gap collapses at 30k terms because the radix sort dominates and both queries pay it.

construct + per-segment iterate — the path real queries take, slice at numSegments=20

Full benchmark output: https://gist.github.com/GovindBalaji-S-Glean/fe4af91bead4dcd2390fa4da381b09e7

numTerms indexContent shape Array µs TIS µs TIS / Array
300 QUERY_ONLY SORTED_SET 902 1 049 1.16×
300 RANDOM_50K SORTED_SET 2 293 2 687 1.17×
300 SPARSE SORTED_SET 108 193 1.79×
3 000 QUERY_ONLY SORTED_SET 10 071 11 229 1.11×
3 000 SPARSE SORTED_SET 356 1 212 3.41×
30 000 QUERY_ONLY SORTED_SET 87 202 118 585 1.36×
30 000 RANDOM_50K SORTED_SET 60 005 76 420 1.27×
30 000 SPARSE SORTED_SET 2 819 11 593 4.11×

Pattern:

  • Dense indexes (QUERY_ONLY, RANDOM_50K): Array wins consistently, 1.1–1.4×. Iteration cost dominates; storage shape matters less.
  • Sparse indexes (SPARSE — 2 indexed terms vs a 30/300/3k/30k-term query): Array wins 1.5–4.5× and the gap widens with numTerms × numSegments. This is exactly the case where TIS's PrefixCodedTerms decode runs once per query term per segment while Array just walks a sorted BytesRef[].
equals (cache-hit equality on equal queries)
numTerms Array µs TIS µs TIS / Array
300 0.091 ± 0.001 0.235 ± 0.001 2.58×
3 000 1.030 ± 0.005 1.354 ± 0.008 1.31×
30 000 13.92 ± 1.97 12.86 ± 2.29 noise (intervals overlap)

Vectorized Arrays.equals fast path nets a clean 2.58× / 1.31× win at 300 / 3 000 terms; at 30 000 terms both are statistically indistinguishable (~13 µs ± a few µs) — byte compare has saturated memory bandwidth.

Trade-off

Array keeps the sorted byte[] + offsets index live for the lifetime of the query, vs TIS's PrefixCodedTerms which compresses common prefixes. For small/medium term counts this is negligible; for 30k term sets the per-query memory difference is on the order of a few hundred KB.

Adds an alternative implementation of TermInSetQuery in the sandbox
module that stores terms as a sorted BytesRef[] over a packed byte[]
instead of PrefixCodedTerms. Trades some RAM for cheaper per-segment
iteration and a vectorized equals/hashCode fast path.

Includes a JMH benchmark in benchmark-jmh comparing both queries
across numTerms (30/300/3k/30k), numSegments (5/20/50), indexContent
(QUERY_ONLY/SPARSE/RANDOM_50K), and inputShape (UNSORTED_LIST/SORTED_SET).

See dev@ thread:
https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k
@GovindBalaji-S-Glean GovindBalaji-S-Glean marked this pull request as ready for review May 12, 2026 09:51
@github-actions github-actions Bot added this to the 11.0.0 milestone May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant