Skip to content

Add RawSortedIndexBasedFilterOperator for binary search on raw sorted columns#18079

Open
xiangfu0 wants to merge 2 commits intoapache:masterfrom
xiangfu0:claude/determined-moser
Open

Add RawSortedIndexBasedFilterOperator for binary search on raw sorted columns#18079
xiangfu0 wants to merge 2 commits intoapache:masterfrom
xiangfu0:claude/determined-moser

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 2, 2026

Summary

  • Add RawSortedIndexBasedFilterOperator that uses binary search O(log N) on raw sorted forward index columns instead of full scan O(N)
  • For chunk-compressed readers, implements two-level binary search: coarse search at chunk boundaries minimizes decompressions, fine search within cached chunk is free
  • Add MultiChunkReaderContext with an LRU cache of decompressed chunks to avoid repeated decompression during binary search; chunk cache is exposed via ForwardIndexReader.createCachedContext(int numSlots) SPI
  • Supports EQ, NEQ, IN, NOT_IN, RANGE predicates with all data types (INT, LONG, FLOAT, DOUBLE, STRING, BYTES, BIG_DECIMAL)
  • Exposes getNumDocsPerChunk() and createCachedContext() on ForwardIndexReader interface for chunk-aware optimization

Background

Currently SortedIndexBasedFilterOperator only works for dictionary-encoded sorted columns. Raw sorted columns fall through to ScanBasedFilterOperator which linearly scans every document. This PR adds an efficient binary search path for raw sorted columns, matching the optimization that dictionary-encoded sorted columns already enjoy.

Changes

  • ForwardIndexReader.java — Added getNumDocsPerChunk() and createCachedContext(int numSlots) default methods
  • BaseChunkForwardIndexReader.java — Override both methods; adds decompressChunkInto for direct-into-slot decompression
  • MultiChunkReaderContext.java — New LRU chunk cache context (up to N decompressed chunks); handles DELTA/DELTADELTA codecs correctly
  • RawSortedIndexBasedFilterOperator.java — New filter operator with two-level binary search + multi-chunk cache; caches computeMatchingRanges() result for reuse by canOptimizeCount() / canProduceBitmaps()
  • FilterOperatorUtils.java — Route sorted raw SV columns to the new operator
  • RawSortedIndexBasedFilterOperatorTest.java — 21 unit tests covering all predicate types and edge cases
  • MultiChunkReaderContextTest.java — 8 unit tests covering cache hit/miss, LRU eviction, replaceSlot, close cleanup, and integration for all compression types
  • BenchmarkRawSortedIndexFilter.java — JMH microbenchmark comparing binary search vs linear scan

Benchmark

JMH microbenchmark on a sorted raw INT forward index (1K docs/chunk, ~10x value repetition).

  • EQ: matches median value; RANGE: ~1% selectivity window around median
  • Machine: Apple M-series, JDK 17, 1 fork, 3 warmup + 5 measurement × 1s, AverageTime
Benchmark                                       compression    numDocs    Score    Error  Units
-----------------------------------------------------------------------------------------------
binarySearchEq                                 PASS_THROUGH  1,000,000    5.6 ±  2.2  us/op
binarySearchEq                                 PASS_THROUGH  5,000,000    5.7 ±  3.3  us/op
binarySearchEq                                 PASS_THROUGH 10,000,000    5.5 ±  3.2  us/op
linearScanEq   (baseline)                      PASS_THROUGH  1,000,000   22.7 ±  5.3  us/op
linearScanEq   (baseline)                      PASS_THROUGH  5,000,000   22.8 ±  7.1  us/op
linearScanEq   (baseline)                      PASS_THROUGH 10,000,000   24.3 ± 10.7  us/op

binarySearchEq                                          LZ4  1,000,000   13.4 ±  1.1  us/op
binarySearchEq                                          LZ4  5,000,000   14.7 ±  1.4  us/op
binarySearchEq                                          LZ4 10,000,000   15.3 ±  0.8  us/op
linearScanEq   (baseline)                               LZ4  1,000,000   23.7 ±  4.9  us/op
linearScanEq   (baseline)                               LZ4  5,000,000   24.3 ±  6.3  us/op
linearScanEq   (baseline)                               LZ4 10,000,000   25.0 ± 11.3  us/op

binarySearchEq   (with chunk cache)               ZSTANDARD  1,000,000   27.4 ±  0.9  us/op
binarySearchEq   (with chunk cache)               ZSTANDARD  5,000,000   31.7 ±  1.8  us/op
binarySearchEq   (with chunk cache)               ZSTANDARD 10,000,000   34.2 ±  2.2  us/op
linearScanEq   (baseline)                         ZSTANDARD  1,000,000   22.7 ±  2.9  us/op
linearScanEq   (baseline)                         ZSTANDARD  5,000,000   25.7 ± 20.0  us/op
linearScanEq   (baseline)                         ZSTANDARD 10,000,000   24.5 ±  9.6  us/op

binarySearchRange                              PASS_THROUGH  1,000,000    5.6 ±  2.7  us/op
binarySearchRange                              PASS_THROUGH  5,000,000    5.6 ±  3.7  us/op
binarySearchRange                              PASS_THROUGH 10,000,000    5.7 ±  2.2  us/op
linearScanRange  (baseline)                    PASS_THROUGH  1,000,000   23.4 ±  5.2  us/op
linearScanRange  (baseline)                    PASS_THROUGH  5,000,000   24.2 ±  6.5  us/op
linearScanRange  (baseline)                    PASS_THROUGH 10,000,000   23.5 ±  5.9  us/op

binarySearchRange                                       LZ4  1,000,000   20.7 ±  0.6  us/op
binarySearchRange                                       LZ4  5,000,000   23.2 ±  0.6  us/op
binarySearchRange                                       LZ4 10,000,000   25.8 ±  1.6  us/op
linearScanRange  (baseline)                             LZ4  1,000,000   23.3 ±  5.2  us/op
linearScanRange  (baseline)                             LZ4  5,000,000   22.9 ±  3.5  us/op
linearScanRange  (baseline)                             LZ4 10,000,000   25.5 ± 14.4  us/op

binarySearchRange  (with chunk cache)             ZSTANDARD  1,000,000   47.8 ±  1.3  us/op  ⚠️
binarySearchRange  (with chunk cache)             ZSTANDARD  5,000,000   56.4 ±  1.5  us/op  ⚠️
binarySearchRange  (with chunk cache)             ZSTANDARD 10,000,000   62.4 ±  1.7  us/op  ⚠️
linearScanRange  (baseline)                       ZSTANDARD  1,000,000   23.9 ±  7.8  us/op
linearScanRange  (baseline)                       ZSTANDARD  5,000,000   25.2 ± 16.5  us/op
linearScanRange  (baseline)                       ZSTANDARD 10,000,000   23.7 ±  6.7  us/op

Key takeaways:

  • Uncompressed (PASS_THROUGH): binary search is ~4× faster (~5–6 µs vs ~23 µs). Cost is flat across 1M–10M docs.
  • LZ4: EQ is ~1.7× faster (~13–15 µs vs ~24 µs); RANGE is comparable (~21–26 µs vs ~23 µs) — the two binary searches (lower + upper bound) probe mostly disjoint chunk sets, limiting cache reuse.
  • ZSTANDARD/EQ: chunk cache reduces cost from ~42 µs (without cache) to ~27–34 µs (vs ~24 µs linear), a 35% improvement. EQ's lower+upper bound searches overlap on chunks near the match point, giving the cache meaningful hits.
  • ZSTANDARD/RANGE: binary search (~48–62 µs) is slower than linear scan (~24 µs). RANGE's lower and upper bound searches probe opposite ends of the index with almost no chunk overlap, so the cache cannot help. ZSTANDARD's per-decompression cost × ~26 distinct chunks searched exceeds a linear scan's first batch.
  • Linear scan time is nearly independent of numDocs because ScanBasedFilterOperator uses batched iteration — only the first batch (~1K docs) is measured here. Real query latency at scale is proportionally worse for linear scan.

Test plan

  • 21 new unit tests (EQ, NEQ, IN, NOT_IN, RANGE with inclusive/exclusive/unbounded, chunk-aware search, all data types, edge cases)
  • 8 new unit tests for MultiChunkReaderContext (cache hit/miss, LRU eviction, replaceSlot, close, all compression types)
  • 96 existing FilterOperatorUtils tests pass
  • Checkstyle clean
  • Spotless formatted

🤖 Generated with Claude Code

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 74.67249% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (ade9f6c) to head (b7d1b32).

Files with missing lines Patch % Lines
...ator/filter/RawSortedIndexBasedFilterOperator.java 76.60% 40 Missing and 11 partials ⚠️
...inot/core/operator/filter/FilterOperatorUtils.java 33.33% 0 Missing and 6 partials ⚠️
...t/segment/spi/index/reader/ForwardIndexReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18079      +/-   ##
============================================
+ Coverage     63.48%   63.50%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3245       +1     
  Lines        197342   197570     +228     
  Branches      30529    30575      +46     
============================================
+ Hits         125285   125460     +175     
- Misses        62014    62057      +43     
- Partials      10043    10053      +10     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.47% <74.67%> (+<0.01%) ⬆️
java-21 63.47% <74.67%> (+0.03%) ⬆️
temurin 63.50% <74.67%> (+0.01%) ⬆️
unittests 63.49% <74.67%> (+0.01%) ⬆️
unittests1 55.49% <74.67%> (+0.03%) ⬆️
unittests2 34.93% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the claude/determined-moser branch from 8817917 to fe87055 Compare April 2, 2026 08:32
@xiangfu0 xiangfu0 force-pushed the claude/determined-moser branch from fe87055 to f1e6f1d Compare April 14, 2026 10:07
&& dataSource.getDataSourceMetadata().isSingleValue()
&& queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) {
return new RawSortedIndexBasedFilterOperator(queryContext, predicateEvaluator, dataSource, numDocs);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicated with line 140

xiangfu0 and others added 2 commits April 18, 2026 04:31
… columns

For raw (non-dictionary) sorted forward index columns, filter queries previously
fell back to ScanBasedFilterOperator which does a full linear scan O(N). This
change adds a new RawSortedIndexBasedFilterOperator that uses binary search O(log N)
on the forward index to find matching document ID ranges.

Key features:
- Two-level binary search for chunk-compressed readers: coarse search at chunk
  boundaries minimizes decompressions, fine search within cached chunk is free
- Supports EQ, NEQ, IN, NOT_IN, RANGE predicates with all numeric types + STRING
- Exposes getNumDocsPerChunk() on ForwardIndexReader for chunk-aware optimization
- Optimized count and bitmap production from docId ranges

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a column has both a range index and is raw-sorted, the range index
should be preferred over raw sorted binary search. Moved raw sorted
checks after specialized index checks to fix TextMatchTransformFunctionTest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the claude/determined-moser branch from f1e6f1d to b7d1b32 Compare April 18, 2026 11:32
@xiangfu0 xiangfu0 requested a review from Copilot April 18, 2026 11:38
@xiangfu0 xiangfu0 added performance Related to performance optimization index Related to indexing (general) labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new filter operator path for sorted, raw (non-dictionary-encoded) single-value columns, enabling binary-search-based filtering on forward indexes (with chunk-aware optimization when chunk metadata is available). This extends Pinot’s existing “sorted index” optimization beyond dictionary-encoded columns.

Changes:

  • Add RawSortedIndexBasedFilterOperator to compute matching docId ranges via binary search on raw sorted forward indexes (optionally chunk-aware).
  • Extend ForwardIndexReader with getNumDocsPerChunk() and implement it in BaseChunkForwardIndexReader.
  • Update FilterOperatorUtils to route eligible sorted raw SV columns to the new operator and add a new unit test suite.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/ForwardIndexReader.java Adds getNumDocsPerChunk() default method for chunk-aware optimizations.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/forward/BaseChunkForwardIndexReader.java Exposes _numDocsPerChunk via getNumDocsPerChunk().
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/RawSortedIndexBasedFilterOperator.java Implements raw sorted forward-index binary search (including chunk-aware search).
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/FilterOperatorUtils.java Routes sorted raw SV columns to the new operator and prioritizes it similarly to SortedIndexBasedFilterOperator.
pinot-core/src/test/java/org/apache/pinot/core/operator/filter/RawSortedIndexBasedFilterOperatorTest.java Adds unit tests for the new operator (currently focused on INT/LONG/STRING).

Comment on lines +133 to +142
// Sort values to get ranges in order, then merge adjacent ranges
String[] sortedValues = getSortedValues(dataType, values);
List<IntPair> allRanges = new ArrayList<>();
for (String valueStr : sortedValues) {
List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
allRanges.addAll(eqRanges);
}
return mergeAdjacentRanges(allRanges);
}

Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeInRanges() relies on mergeAdjacentRanges() assuming ranges are already in ascending docId order, but getSortedValues() only applies type-correct ordering for INT/LONG/FLOAT/DOUBLE and falls back to lexicographic String sort for other types. For BIG_DECIMAL this can generate out-of-order docId ranges (e.g., values ["2","10"]), which breaks SortedDocIdSet/SortedRangeIntersection expectations and can lead to incorrect AND/OR results. Fix by sorting the collected docId ranges by range.getLeft() (and ideally de-duping/merging overlaps) before constructing SortedDocIdSet, or by implementing type-correct value sorting for BIG_DECIMAL (BigDecimal::compareTo) and BYTES (decoded bytes / ByteArray.compare) as well.

Suggested change
// Sort values to get ranges in order, then merge adjacent ranges
String[] sortedValues = getSortedValues(dataType, values);
List<IntPair> allRanges = new ArrayList<>();
for (String valueStr : sortedValues) {
List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
allRanges.addAll(eqRanges);
}
return mergeAdjacentRanges(allRanges);
}
// Sort values to improve binary search locality, then normalize the resulting docId ranges because
// value ordering does not always imply docId-range ordering for all data types.
String[] sortedValues = getSortedValues(dataType, values);
List<IntPair> allRanges = new ArrayList<>();
for (String valueStr : sortedValues) {
List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
allRanges.addAll(eqRanges);
}
return sortAndMergeRanges(allRanges);
}
private List<IntPair> sortAndMergeRanges(List<IntPair> ranges) {
int numRanges = ranges.size();
if (numRanges < 2) {
return ranges;
}
ranges.sort((range1, range2) -> {
int leftCompare = Integer.compare(range1.getLeft(), range2.getLeft());
return leftCompare != 0 ? leftCompare : Integer.compare(range1.getRight(), range2.getRight());
});
List<IntPair> mergedRanges = new ArrayList<>(numRanges);
IntPair currentRange = ranges.get(0);
int currentStart = currentRange.getLeft();
int currentEnd = currentRange.getRight();
for (int i = 1; i < numRanges; i++) {
IntPair nextRange = ranges.get(i);
int nextStart = nextRange.getLeft();
int nextEnd = nextRange.getRight();
if (nextStart <= currentEnd + 1) {
currentEnd = Math.max(currentEnd, nextEnd);
} else {
mergedRanges.add(new IntPair(currentStart, currentEnd));
currentStart = nextStart;
currentEnd = nextEnd;
}
}
mergedRanges.add(new IntPair(currentStart, currentEnd));
return mergedRanges;
}

Copilot uses AI. Check for mistakes.
Comment on lines +409 to +460
/**
* Sorts the IN/NOT_IN values according to the column data type ordering, so that binary search results produce
* ranges in docId order.
*/
private String[] getSortedValues(DataType dataType, List<String> values) {
String[] sorted = values.toArray(new String[0]);
switch (dataType) {
case INT:
int[] intVals = new int[sorted.length];
for (int i = 0; i < sorted.length; i++) {
intVals[i] = Integer.parseInt(sorted[i]);
}
Arrays.sort(intVals);
for (int i = 0; i < sorted.length; i++) {
sorted[i] = Integer.toString(intVals[i]);
}
break;
case LONG:
long[] longVals = new long[sorted.length];
for (int i = 0; i < sorted.length; i++) {
longVals[i] = Long.parseLong(sorted[i]);
}
Arrays.sort(longVals);
for (int i = 0; i < sorted.length; i++) {
sorted[i] = Long.toString(longVals[i]);
}
break;
case FLOAT:
float[] floatVals = new float[sorted.length];
for (int i = 0; i < sorted.length; i++) {
floatVals[i] = Float.parseFloat(sorted[i]);
}
Arrays.sort(floatVals);
for (int i = 0; i < sorted.length; i++) {
sorted[i] = Float.toString(floatVals[i]);
}
break;
case DOUBLE:
double[] doubleVals = new double[sorted.length];
for (int i = 0; i < sorted.length; i++) {
doubleVals[i] = Double.parseDouble(sorted[i]);
}
Arrays.sort(doubleVals);
for (int i = 0; i < sorted.length; i++) {
sorted[i] = Double.toString(doubleVals[i]);
}
break;
default:
// String and others: natural string sort
Arrays.sort(sorted);
break;
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSortedValues() does not implement type-aware ordering for BIG_DECIMAL or BYTES, but computeInRanges() depends on the values being sorted in the same order as the column to keep docId ranges ordered for SortedDocIdSet. Please add explicit sorting for BIG_DECIMAL (parse to BigDecimal and sort via compareTo) and BYTES (sort by decoded bytes using ByteArray.compare), or remove this method and instead sort the resulting IntPair ranges by docId before merging.

Copilot uses AI. Check for mistakes.
Comment on lines +414 to +503
// --- Long data type test ---

@SuppressWarnings({"unchecked", "rawtypes"})
private static DataSource createLongDataSource(long[] data) {
ForwardIndexReader reader = mock(ForwardIndexReader.class);
when(reader.isDictionaryEncoded()).thenReturn(false);
when(reader.isSingleValue()).thenReturn(true);
when(reader.getStoredType()).thenReturn(DataType.LONG);
when(reader.getNumDocsPerChunk()).thenReturn(0);
when(reader.createContext()).thenReturn(null);
for (int i = 0; i < data.length; i++) {
when(reader.getLong(i, null)).thenReturn(data[i]);
}

DataSourceMetadata metadata = mock(DataSourceMetadata.class);
when(metadata.isSorted()).thenReturn(true);
when(metadata.isSingleValue()).thenReturn(true);
when(metadata.getDataType()).thenReturn(DataType.LONG);

DataSource dataSource = mock(DataSource.class);
when(dataSource.getForwardIndex()).thenReturn(reader);
when(dataSource.getDataSourceMetadata()).thenReturn(metadata);
when(dataSource.getDictionary()).thenReturn(null);
when(dataSource.getNullValueVector()).thenReturn(null);

return dataSource;
}

@Test
public void testLongEqPredicate() {
long[] data = {100L, 200L, 300L, 300L, 400L, 500L};
DataSource dataSource = createLongDataSource(data);
QueryContext queryContext = createQueryContext();
EqPredicate predicate = new EqPredicate(COL_EXPR, "300");
PredicateEvaluator evaluator =
EqualsPredicateEvaluatorFactory.newRawValueBasedEvaluator(predicate, DataType.LONG);

RawSortedIndexBasedFilterOperator operator =
new RawSortedIndexBasedFilterOperator(queryContext, evaluator, dataSource, data.length);

int[] matchingDocIds = getMatchingDocIds(operator);
assertEquals(matchingDocIds, new int[]{2, 3});
}

// --- String data type test ---

@SuppressWarnings({"unchecked", "rawtypes"})
private static DataSource createStringDataSource(String[] data) {
ForwardIndexReader reader = mock(ForwardIndexReader.class);
when(reader.isDictionaryEncoded()).thenReturn(false);
when(reader.isSingleValue()).thenReturn(true);
when(reader.getStoredType()).thenReturn(DataType.STRING);
when(reader.getNumDocsPerChunk()).thenReturn(0);
when(reader.createContext()).thenReturn(null);
for (int i = 0; i < data.length; i++) {
when(reader.getString(i, null)).thenReturn(data[i]);
}

DataSourceMetadata metadata = mock(DataSourceMetadata.class);
when(metadata.isSorted()).thenReturn(true);
when(metadata.isSingleValue()).thenReturn(true);
when(metadata.getDataType()).thenReturn(DataType.STRING);

DataSource dataSource = mock(DataSource.class);
when(dataSource.getForwardIndex()).thenReturn(reader);
when(dataSource.getDataSourceMetadata()).thenReturn(metadata);
when(dataSource.getDictionary()).thenReturn(null);
when(dataSource.getNullValueVector()).thenReturn(null);

return dataSource;
}

@Test
public void testStringRangePredicate() {
String[] data = {"apple", "banana", "cherry", "date", "elderberry", "fig"};
DataSource dataSource = createStringDataSource(data);
QueryContext queryContext = createQueryContext();
RangePredicate predicate =
new RangePredicate(COL_EXPR, true, "banana", true, "elderberry", DataType.STRING);
PredicateEvaluator evaluator =
RangePredicateEvaluatorFactory.newRawValueBasedEvaluator(predicate, DataType.STRING);

RawSortedIndexBasedFilterOperator operator =
new RawSortedIndexBasedFilterOperator(queryContext, evaluator, dataSource, data.length);

int[] matchingDocIds = getMatchingDocIds(operator);
// banana(1), cherry(2), date(3), elderberry(4)
assertEquals(matchingDocIds, new int[]{1, 2, 3, 4});
}
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new operator claims support for multiple raw data types (FLOAT/DOUBLE/BYTES/BIG_DECIMAL), but this test suite only exercises INT/LONG/STRING. Adding focused tests for at least BIG_DECIMAL and BYTES (especially IN/NOT_IN, where range ordering/merging is sensitive) would prevent regressions and would have caught the ordering issue in getSortedValues().

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +108
* not available (e.g., non-chunk-based readers or uncompressed readers). This is used to optimize binary search on
* sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine search within
* the target chunk, minimizing chunk decompressions.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc says getNumDocsPerChunk() returns 0 for “uncompressed readers”, but BaseChunkForwardIndexReader can be uncompressed (PASS_THROUGH) and still has meaningful chunk boundaries and returns _numDocsPerChunk. Consider updating the wording to reflect “non-chunk-based readers” (or “readers without chunk metadata”) rather than tying it to compression.

Suggested change
* not available (e.g., non-chunk-based readers or uncompressed readers). This is used to optimize binary search on
* sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine search within
* the target chunk, minimizing chunk decompressions.
* not available (e.g., non-chunk-based readers or readers without chunk metadata). This is used to optimize binary
* search on sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine
* search within the target chunk, minimizing chunk decompressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general) performance Related to performance optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants