Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark by GovindBalaji-S-Glean · Pull Request #16051 · apache/lucene

GovindBalaji-S-Glean · 2026-05-12T09:49:41Z

Description

PR for Lucene dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k

Adds an alternative TermInSetQuery implementation in the sandbox module that stores terms as a sorted BytesRef[] wrapping a packed byte[], rather than PrefixCodedTerms. Trades some RAM for cheaper per-segment iteration and a vectorized equals/hashCode fast path on the cache hot path. Includes a JMH benchmark in benchmark-jmh.

Build / test

./gradlew :lucene:sandbox:test --tests "*ArrayTermInSet*" → 17 tests pass on JDK 25.
./gradlew :lucene:benchmark-jmh:assemble → green.
./gradlew tidy → no formatting changes outside the 4 files in this PR.
./gradlew :lucene:sandbox:check :lucene:benchmark-jmh:check → green on JDK 25 (Temurin 25.0.3).

Benchmark

Full matrix on a 32-core / 125 GB GCP n2 VM, JDK 25 (Temurin 25.0.3), JMH 1.37, single fork, 3 warmup × 5 measurement × 5 s.

Benchmark parameters

Param	Values	What it varies	Used by
`numTerms`	`30`, `300`, `3000`, `30000`	Number of terms in the query — from a tiny ACL filter to a large enumerated set.	all benchmarks
`numSegments`	`5`, `20`, `50`	Number of segments in the index. Each segment pays the per-term seek cost, so this scales the iteration phase.	`construct+iterate*` only
`indexContent`	`QUERY_ONLY` · `SPARSE` · `RANDOM_50K`	What lives in each segment's terms dictionary (see below).	`construct+iterate*` only
`inputShape`	`UNSORTED_LIST` · `SORTED_SET`	How the query terms are passed to the constructor (see below).	`construct` and `construct+iterate`

`indexContent`

QUERY_ONLY — every query term is indexed in every segment. All seeks hit.
SPARSE — only 2 deterministic terms (first + middle of the query set) per segment, so most seeks miss. The shape that hurts TermInSetQuery the most because PrefixCodedTerms decodes per term regardless of hit/miss.
RANDOM_50K — 50 000 random terms per segment, independent of the query. Zero hits, but a large dictionary to navigate. Stresses the seek path on a deep terms tree.

`inputShape`

UNSORTED_LIST — Arrays.asList(shuffled). Both queries radix-sort internally, so this measures sort + storage cost.
SORTED_SET — TreeSet<BytesRef> (natural-order comparator). Both queries hit their skip-sort fast path, so this isolates storage-shape cost from sort cost.

Benchmark methods

Method	Measures
`construct{TermInSet,ArrayTermInSet}Query`	Constructor cost only.
`constructAndIterate{TermInSet,ArrayTermInSet}Query`	End-to-end query execution: ctor + per-segment terms iteration.
`equals{TermInSetQuery, ArrayTermInSetQuery, FlatPacked, PackedPlusLengths}`	Cache-hit equality on equal queries — the `LRUQueryCache` hot path.

Numbers are µs/op average.

`construct` only — query ctor cost

numTerms	shape	Array µs	TIS µs	TIS / Array
30	SORTED_SET	0.86	2.28	2.6×
30	UNSORTED_LIST	2.87	4.06	1.4×
300	SORTED_SET	11.1	22.0	2.0×
300	UNSORTED_LIST	32.8	43.5	1.3×
3 000	SORTED_SET	108	215	2.0×
3 000	UNSORTED_LIST	361	477	1.3×
30 000	SORTED_SET	1 106	2 182	2.0×
30 000	UNSORTED_LIST	5 400	5 952	1.1×

Skip-sort fast path nets a clean ~2× across all sizes; on UNSORTED_LIST the gap collapses at 30k terms because the radix sort dominates and both queries pay it.

`construct + per-segment iterate` — the path real queries take, slice at `numSegments=20`

Full benchmark output: https://gist.github.com/GovindBalaji-S-Glean/fe4af91bead4dcd2390fa4da381b09e7

numTerms	indexContent	shape	Array µs	TIS µs	TIS / Array
300	QUERY_ONLY	SORTED_SET	902	1 049	1.16×
300	RANDOM_50K	SORTED_SET	2 293	2 687	1.17×
300	SPARSE	SORTED_SET	108	193	1.79×
3 000	QUERY_ONLY	SORTED_SET	10 071	11 229	1.11×
3 000	SPARSE	SORTED_SET	356	1 212	3.41×
30 000	QUERY_ONLY	SORTED_SET	87 202	118 585	1.36×
30 000	RANDOM_50K	SORTED_SET	60 005	76 420	1.27×
30 000	SPARSE	SORTED_SET	2 819	11 593	4.11×

Pattern:

Dense indexes (QUERY_ONLY, RANDOM_50K): Array wins consistently, 1.1–1.4×. Iteration cost dominates; storage shape matters less.
Sparse indexes (SPARSE — 2 indexed terms vs a 30/300/3k/30k-term query): Array wins 1.5–4.5× and the gap widens with numTerms × numSegments. This is exactly the case where TIS's PrefixCodedTerms decode runs once per query term per segment while Array just walks a sorted BytesRef[].

`equals` (cache-hit equality on equal queries)

numTerms	Array µs	TIS µs	TIS / Array
300	0.091 ± 0.001	0.235 ± 0.001	2.58×
3 000	1.030 ± 0.005	1.354 ± 0.008	1.31×
30 000	13.92 ± 1.97	12.86 ± 2.29	noise (intervals overlap)

Vectorized Arrays.equals fast path nets a clean 2.58× / 1.31× win at 300 / 3 000 terms; at 30 000 terms both are statistically indistinguishable (~13 µs ± a few µs) — byte compare has saturated memory bandwidth.

Trade-off

Array keeps the sorted byte[] + offsets index live for the lifetime of the query, vs TIS's PrefixCodedTerms which compresses common prefixes. For small/medium term counts this is negligible; for 30k term sets the per-query memory difference is on the order of a few hundred KB.

Adds an alternative implementation of TermInSetQuery in the sandbox module that stores terms as a sorted BytesRef[] over a packed byte[] instead of PrefixCodedTerms. Trades some RAM for cheaper per-segment iteration and a vectorized equals/hashCode fast path. Includes a JMH benchmark in benchmark-jmh comparing both queries across numTerms (30/300/3k/30k), numSegments (5/20/50), indexContent (QUERY_ONLY/SPARSE/RANDOM_50K), and inputShape (UNSORTED_LIST/SORTED_SET). See dev@ thread: https://lists.apache.org/thread/ct04woc11hh9vclhscz8pkdozv6xoy6k

GovindBalaji-S-Glean added 2 commits May 12, 2026 14:49

Update CHANGES.txt

8220ea2

GovindBalaji-S-Glean marked this pull request as ready for review May 12, 2026 09:51

github-actions Bot added the module:sandbox label May 12, 2026

github-actions Bot added this to the 11.0.0 milestone May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051

Add ArrayTermInSetQuery to sandbox/ along with a JMH benchmark#16051
GovindBalaji-S-Glean wants to merge 2 commits into
apache:mainfrom
GovindBalaji-S-Glean:array-tis-sandbox

GovindBalaji-S-Glean commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GovindBalaji-S-Glean commented May 12, 2026

Description

Build / test

Benchmark

Benchmark parameters

indexContent

inputShape

Benchmark methods

construct only — query ctor cost

construct + per-segment iterate — the path real queries take, slice at numSegments=20

equals (cache-hit equality on equal queries)

Trade-off

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`indexContent`

`inputShape`

`construct` only — query ctor cost

`construct + per-segment iterate` — the path real queries take, slice at `numSegments=20`

`equals` (cache-hit equality on equal queries)