Add RawSortedIndexBasedFilterOperator for binary search on raw sorted columns by xiangfu0 · Pull Request #18079 · apache/pinot

xiangfu0 · 2026-04-02T04:45:45Z

Summary

Add RawSortedIndexBasedFilterOperator that uses binary search O(log N) on raw sorted forward index columns instead of full scan O(N)
For chunk-compressed readers, implements two-level binary search: coarse search at chunk boundaries minimizes decompressions, fine search within cached chunk is free
Add MultiChunkReaderContext with an LRU cache of decompressed chunks to avoid repeated decompression during binary search; chunk cache is exposed via ForwardIndexReader.createCachedContext(int numSlots) SPI
Supports EQ, NEQ, IN, NOT_IN, RANGE predicates with all data types (INT, LONG, FLOAT, DOUBLE, STRING, BYTES, BIG_DECIMAL)
Exposes getNumDocsPerChunk() and createCachedContext() on ForwardIndexReader interface for chunk-aware optimization

Background

Currently SortedIndexBasedFilterOperator only works for dictionary-encoded sorted columns. Raw sorted columns fall through to ScanBasedFilterOperator which linearly scans every document. This PR adds an efficient binary search path for raw sorted columns, matching the optimization that dictionary-encoded sorted columns already enjoy.

Changes

ForwardIndexReader.java — Added getNumDocsPerChunk() and createCachedContext(int numSlots) default methods
BaseChunkForwardIndexReader.java — Override both methods; adds decompressChunkInto for direct-into-slot decompression
MultiChunkReaderContext.java — New LRU chunk cache context (up to N decompressed chunks); handles DELTA/DELTADELTA codecs correctly
RawSortedIndexBasedFilterOperator.java — New filter operator with two-level binary search + multi-chunk cache; caches computeMatchingRanges() result for reuse by canOptimizeCount() / canProduceBitmaps()
FilterOperatorUtils.java — Route sorted raw SV columns to the new operator
RawSortedIndexBasedFilterOperatorTest.java — 21 unit tests covering all predicate types and edge cases
MultiChunkReaderContextTest.java — 8 unit tests covering cache hit/miss, LRU eviction, replaceSlot, close cleanup, and integration for all compression types
BenchmarkRawSortedIndexFilter.java — JMH microbenchmark comparing binary search vs linear scan

Benchmark

JMH microbenchmark on a sorted raw INT forward index (1K docs/chunk, ~10x value repetition).

EQ: matches median value; RANGE: ~1% selectivity window around median
Machine: Apple M-series, JDK 17, 1 fork, 3 warmup + 5 measurement × 1s, AverageTime

Benchmark                                       compression    numDocs    Score    Error  Units
-----------------------------------------------------------------------------------------------
binarySearchEq                                 PASS_THROUGH  1,000,000    5.6 ±  2.2  us/op
binarySearchEq                                 PASS_THROUGH  5,000,000    5.7 ±  3.3  us/op
binarySearchEq                                 PASS_THROUGH 10,000,000    5.5 ±  3.2  us/op
linearScanEq   (baseline)                      PASS_THROUGH  1,000,000   22.7 ±  5.3  us/op
linearScanEq   (baseline)                      PASS_THROUGH  5,000,000   22.8 ±  7.1  us/op
linearScanEq   (baseline)                      PASS_THROUGH 10,000,000   24.3 ± 10.7  us/op

binarySearchEq                                          LZ4  1,000,000   13.4 ±  1.1  us/op
binarySearchEq                                          LZ4  5,000,000   14.7 ±  1.4  us/op
binarySearchEq                                          LZ4 10,000,000   15.3 ±  0.8  us/op
linearScanEq   (baseline)                               LZ4  1,000,000   23.7 ±  4.9  us/op
linearScanEq   (baseline)                               LZ4  5,000,000   24.3 ±  6.3  us/op
linearScanEq   (baseline)                               LZ4 10,000,000   25.0 ± 11.3  us/op

binarySearchEq   (with chunk cache)               ZSTANDARD  1,000,000   27.4 ±  0.9  us/op
binarySearchEq   (with chunk cache)               ZSTANDARD  5,000,000   31.7 ±  1.8  us/op
binarySearchEq   (with chunk cache)               ZSTANDARD 10,000,000   34.2 ±  2.2  us/op
linearScanEq   (baseline)                         ZSTANDARD  1,000,000   22.7 ±  2.9  us/op
linearScanEq   (baseline)                         ZSTANDARD  5,000,000   25.7 ± 20.0  us/op
linearScanEq   (baseline)                         ZSTANDARD 10,000,000   24.5 ±  9.6  us/op

binarySearchRange                              PASS_THROUGH  1,000,000    5.6 ±  2.7  us/op
binarySearchRange                              PASS_THROUGH  5,000,000    5.6 ±  3.7  us/op
binarySearchRange                              PASS_THROUGH 10,000,000    5.7 ±  2.2  us/op
linearScanRange  (baseline)                    PASS_THROUGH  1,000,000   23.4 ±  5.2  us/op
linearScanRange  (baseline)                    PASS_THROUGH  5,000,000   24.2 ±  6.5  us/op
linearScanRange  (baseline)                    PASS_THROUGH 10,000,000   23.5 ±  5.9  us/op

binarySearchRange                                       LZ4  1,000,000   20.7 ±  0.6  us/op
binarySearchRange                                       LZ4  5,000,000   23.2 ±  0.6  us/op
binarySearchRange                                       LZ4 10,000,000   25.8 ±  1.6  us/op
linearScanRange  (baseline)                             LZ4  1,000,000   23.3 ±  5.2  us/op
linearScanRange  (baseline)                             LZ4  5,000,000   22.9 ±  3.5  us/op
linearScanRange  (baseline)                             LZ4 10,000,000   25.5 ± 14.4  us/op

binarySearchRange  (with chunk cache)             ZSTANDARD  1,000,000   47.8 ±  1.3  us/op  ⚠️
binarySearchRange  (with chunk cache)             ZSTANDARD  5,000,000   56.4 ±  1.5  us/op  ⚠️
binarySearchRange  (with chunk cache)             ZSTANDARD 10,000,000   62.4 ±  1.7  us/op  ⚠️
linearScanRange  (baseline)                       ZSTANDARD  1,000,000   23.9 ±  7.8  us/op
linearScanRange  (baseline)                       ZSTANDARD  5,000,000   25.2 ± 16.5  us/op
linearScanRange  (baseline)                       ZSTANDARD 10,000,000   23.7 ±  6.7  us/op

Key takeaways:

Uncompressed (PASS_THROUGH): binary search is ~4× faster (~5–6 µs vs ~23 µs). Cost is flat across 1M–10M docs.
LZ4: EQ is ~1.7× faster (~13–15 µs vs ~24 µs); RANGE is comparable (~21–26 µs vs ~23 µs) — the two binary searches (lower + upper bound) probe mostly disjoint chunk sets, limiting cache reuse.
ZSTANDARD/EQ: chunk cache reduces cost from ~42 µs (without cache) to ~27–34 µs (vs ~24 µs linear), a 35% improvement. EQ's lower+upper bound searches overlap on chunks near the match point, giving the cache meaningful hits.
ZSTANDARD/RANGE: binary search (~48–62 µs) is slower than linear scan (~24 µs). RANGE's lower and upper bound searches probe opposite ends of the index with almost no chunk overlap, so the cache cannot help. ZSTANDARD's per-decompression cost × ~26 distinct chunks searched exceeds a linear scan's first batch.
Linear scan time is nearly independent of numDocs because ScanBasedFilterOperator uses batched iteration — only the first batch (~1K docs) is measured here. Real query latency at scale is proportionally worse for linear scan.

Test plan

21 new unit tests (EQ, NEQ, IN, NOT_IN, RANGE with inclusive/exclusive/unbounded, chunk-aware search, all data types, edge cases)
8 new unit tests for MultiChunkReaderContext (cache hit/miss, LRU eviction, replaceSlot, close, all compression types)
96 existing FilterOperatorUtils tests pass
Checkstyle clean
Spotless formatted

🤖 Generated with Claude Code

codecov-commenter · 2026-04-02T05:49:05Z

Codecov Report

❌ Patch coverage is 74.67249% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.50%. Comparing base (ade9f6c) to head (b7d1b32).

Files with missing lines	Patch %	Lines
...ator/filter/RawSortedIndexBasedFilterOperator.java	76.60%	40 Missing and 11 partials ⚠️
...inot/core/operator/filter/FilterOperatorUtils.java	33.33%	0 Missing and 6 partials ⚠️
...t/segment/spi/index/reader/ForwardIndexReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18079      +/-   ##
============================================
+ Coverage     63.48%   63.50%   +0.01%     
  Complexity     1627     1627              
============================================
  Files          3244     3245       +1     
  Lines        197342   197570     +228     
  Branches      30529    30575      +46     
============================================
+ Hits         125285   125460     +175     
- Misses        62014    62057      +43     
- Partials      10043    10053      +10

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.47% <74.67%> (+<0.01%)`	⬆️
java-21	`63.47% <74.67%> (+0.03%)`	⬆️
temurin	`63.50% <74.67%> (+0.01%)`	⬆️
unittests	`63.49% <74.67%> (+0.01%)`	⬆️
unittests1	`55.49% <74.67%> (+0.03%)`	⬆️
unittests2	`34.93% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lakechd · 2026-04-16T20:59:49Z

+            && dataSource.getDataSourceMetadata().isSingleValue()
+            && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) {
+          return new RawSortedIndexBasedFilterOperator(queryContext, predicateEvaluator, dataSource, numDocs);
+        }


this is duplicated with line 140

… columns For raw (non-dictionary) sorted forward index columns, filter queries previously fell back to ScanBasedFilterOperator which does a full linear scan O(N). This change adds a new RawSortedIndexBasedFilterOperator that uses binary search O(log N) on the forward index to find matching document ID ranges. Key features: - Two-level binary search for chunk-compressed readers: coarse search at chunk boundaries minimizes decompressions, fine search within cached chunk is free - Supports EQ, NEQ, IN, NOT_IN, RANGE predicates with all numeric types + STRING - Exposes getNumDocsPerChunk() on ForwardIndexReader for chunk-aware optimization - Optimized count and bitmap production from docId ranges Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a column has both a range index and is raw-sorted, the range index should be preferred over raw sorted binary search. Moved raw sorted checks after specialized index checks to fix TextMatchTransformFunctionTest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a new filter operator path for sorted, raw (non-dictionary-encoded) single-value columns, enabling binary-search-based filtering on forward indexes (with chunk-aware optimization when chunk metadata is available). This extends Pinot’s existing “sorted index” optimization beyond dictionary-encoded columns.

Changes:

Add RawSortedIndexBasedFilterOperator to compute matching docId ranges via binary search on raw sorted forward indexes (optionally chunk-aware).
Extend ForwardIndexReader with getNumDocsPerChunk() and implement it in BaseChunkForwardIndexReader.
Update FilterOperatorUtils to route eligible sorted raw SV columns to the new operator and add a new unit test suite.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/ForwardIndexReader.java`	Adds `getNumDocsPerChunk()` default method for chunk-aware optimizations.
`pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/forward/BaseChunkForwardIndexReader.java`	Exposes `_numDocsPerChunk` via `getNumDocsPerChunk()`.
`pinot-core/src/main/java/org/apache/pinot/core/operator/filter/RawSortedIndexBasedFilterOperator.java`	Implements raw sorted forward-index binary search (including chunk-aware search).
`pinot-core/src/main/java/org/apache/pinot/core/operator/filter/FilterOperatorUtils.java`	Routes sorted raw SV columns to the new operator and prioritizes it similarly to `SortedIndexBasedFilterOperator`.
`pinot-core/src/test/java/org/apache/pinot/core/operator/filter/RawSortedIndexBasedFilterOperatorTest.java`	Adds unit tests for the new operator (currently focused on INT/LONG/STRING).

Copilot · 2026-04-18T11:44:29Z

+    // Sort values to get ranges in order, then merge adjacent ranges
+    String[] sortedValues = getSortedValues(dataType, values);
+    List<IntPair> allRanges = new ArrayList<>();
+    for (String valueStr : sortedValues) {
+      List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
+      allRanges.addAll(eqRanges);
+    }
+    return mergeAdjacentRanges(allRanges);
+  }
+


computeInRanges() relies on mergeAdjacentRanges() assuming ranges are already in ascending docId order, but getSortedValues() only applies type-correct ordering for INT/LONG/FLOAT/DOUBLE and falls back to lexicographic String sort for other types. For BIG_DECIMAL this can generate out-of-order docId ranges (e.g., values ["2","10"]), which breaks SortedDocIdSet/SortedRangeIntersection expectations and can lead to incorrect AND/OR results. Fix by sorting the collected docId ranges by range.getLeft() (and ideally de-duping/merging overlaps) before constructing SortedDocIdSet, or by implementing type-correct value sorting for BIG_DECIMAL (BigDecimal::compareTo) and BYTES (decoded bytes / ByteArray.compare) as well.

Suggested change

// Sort values to get ranges in order, then merge adjacent ranges

String[] sortedValues = getSortedValues(dataType, values);

List<IntPair> allRanges = new ArrayList<>();

for (String valueStr : sortedValues) {

List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);

allRanges.addAll(eqRanges);

}

return mergeAdjacentRanges(allRanges);

}

// Sort values to improve binary search locality, then normalize the resulting docId ranges because

// value ordering does not always imply docId-range ordering for all data types.

String[] sortedValues = getSortedValues(dataType, values);

List<IntPair> allRanges = new ArrayList<>();

for (String valueStr : sortedValues) {

List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);

allRanges.addAll(eqRanges);

}

return sortAndMergeRanges(allRanges);

}

private List<IntPair> sortAndMergeRanges(List<IntPair> ranges) {

int numRanges = ranges.size();

if (numRanges < 2) {

return ranges;

}

ranges.sort((range1, range2) -> {

int leftCompare = Integer.compare(range1.getLeft(), range2.getLeft());

return leftCompare != 0 ? leftCompare : Integer.compare(range1.getRight(), range2.getRight());

});

List<IntPair> mergedRanges = new ArrayList<>(numRanges);

IntPair currentRange = ranges.get(0);

int currentStart = currentRange.getLeft();

int currentEnd = currentRange.getRight();

for (int i = 1; i < numRanges; i++) {

IntPair nextRange = ranges.get(i);

int nextStart = nextRange.getLeft();

int nextEnd = nextRange.getRight();

if (nextStart <= currentEnd + 1) {

currentEnd = Math.max(currentEnd, nextEnd);

} else {

mergedRanges.add(new IntPair(currentStart, currentEnd));

currentStart = nextStart;

currentEnd = nextEnd;

}

}

mergedRanges.add(new IntPair(currentStart, currentEnd));

return mergedRanges;

}

Copilot · 2026-04-18T11:44:30Z

+  /**
+   * Sorts the IN/NOT_IN values according to the column data type ordering, so that binary search results produce
+   * ranges in docId order.
+   */
+  private String[] getSortedValues(DataType dataType, List<String> values) {
+    String[] sorted = values.toArray(new String[0]);
+    switch (dataType) {
+      case INT:
+        int[] intVals = new int[sorted.length];
+        for (int i = 0; i < sorted.length; i++) {
+          intVals[i] = Integer.parseInt(sorted[i]);
+        }
+        Arrays.sort(intVals);
+        for (int i = 0; i < sorted.length; i++) {
+          sorted[i] = Integer.toString(intVals[i]);
+        }
+        break;
+      case LONG:
+        long[] longVals = new long[sorted.length];
+        for (int i = 0; i < sorted.length; i++) {
+          longVals[i] = Long.parseLong(sorted[i]);
+        }
+        Arrays.sort(longVals);
+        for (int i = 0; i < sorted.length; i++) {
+          sorted[i] = Long.toString(longVals[i]);
+        }
+        break;
+      case FLOAT:
+        float[] floatVals = new float[sorted.length];
+        for (int i = 0; i < sorted.length; i++) {
+          floatVals[i] = Float.parseFloat(sorted[i]);
+        }
+        Arrays.sort(floatVals);
+        for (int i = 0; i < sorted.length; i++) {
+          sorted[i] = Float.toString(floatVals[i]);
+        }
+        break;
+      case DOUBLE:
+        double[] doubleVals = new double[sorted.length];
+        for (int i = 0; i < sorted.length; i++) {
+          doubleVals[i] = Double.parseDouble(sorted[i]);
+        }
+        Arrays.sort(doubleVals);
+        for (int i = 0; i < sorted.length; i++) {
+          sorted[i] = Double.toString(doubleVals[i]);
+        }
+        break;
+      default:
+        // String and others: natural string sort
+        Arrays.sort(sorted);
+        break;
+    }


getSortedValues() does not implement type-aware ordering for BIG_DECIMAL or BYTES, but computeInRanges() depends on the values being sorted in the same order as the column to keep docId ranges ordered for SortedDocIdSet. Please add explicit sorting for BIG_DECIMAL (parse to BigDecimal and sort via compareTo) and BYTES (sort by decoded bytes using ByteArray.compare), or remove this method and instead sort the resulting IntPair ranges by docId before merging.

Copilot · 2026-04-18T11:44:30Z

+  // --- Long data type test ---
+
+  @SuppressWarnings({"unchecked", "rawtypes"})
+  private static DataSource createLongDataSource(long[] data) {
+    ForwardIndexReader reader = mock(ForwardIndexReader.class);
+    when(reader.isDictionaryEncoded()).thenReturn(false);
+    when(reader.isSingleValue()).thenReturn(true);
+    when(reader.getStoredType()).thenReturn(DataType.LONG);
+    when(reader.getNumDocsPerChunk()).thenReturn(0);
+    when(reader.createContext()).thenReturn(null);
+    for (int i = 0; i < data.length; i++) {
+      when(reader.getLong(i, null)).thenReturn(data[i]);
+    }
+
+    DataSourceMetadata metadata = mock(DataSourceMetadata.class);
+    when(metadata.isSorted()).thenReturn(true);
+    when(metadata.isSingleValue()).thenReturn(true);
+    when(metadata.getDataType()).thenReturn(DataType.LONG);
+
+    DataSource dataSource = mock(DataSource.class);
+    when(dataSource.getForwardIndex()).thenReturn(reader);
+    when(dataSource.getDataSourceMetadata()).thenReturn(metadata);
+    when(dataSource.getDictionary()).thenReturn(null);
+    when(dataSource.getNullValueVector()).thenReturn(null);
+
+    return dataSource;
+  }
+
+  @Test
+  public void testLongEqPredicate() {
+    long[] data = {100L, 200L, 300L, 300L, 400L, 500L};
+    DataSource dataSource = createLongDataSource(data);
+    QueryContext queryContext = createQueryContext();
+    EqPredicate predicate = new EqPredicate(COL_EXPR, "300");
+    PredicateEvaluator evaluator =
+        EqualsPredicateEvaluatorFactory.newRawValueBasedEvaluator(predicate, DataType.LONG);
+
+    RawSortedIndexBasedFilterOperator operator =
+        new RawSortedIndexBasedFilterOperator(queryContext, evaluator, dataSource, data.length);
+
+    int[] matchingDocIds = getMatchingDocIds(operator);
+    assertEquals(matchingDocIds, new int[]{2, 3});
+  }
+
+  // --- String data type test ---
+
+  @SuppressWarnings({"unchecked", "rawtypes"})
+  private static DataSource createStringDataSource(String[] data) {
+    ForwardIndexReader reader = mock(ForwardIndexReader.class);
+    when(reader.isDictionaryEncoded()).thenReturn(false);
+    when(reader.isSingleValue()).thenReturn(true);
+    when(reader.getStoredType()).thenReturn(DataType.STRING);
+    when(reader.getNumDocsPerChunk()).thenReturn(0);
+    when(reader.createContext()).thenReturn(null);
+    for (int i = 0; i < data.length; i++) {
+      when(reader.getString(i, null)).thenReturn(data[i]);
+    }
+
+    DataSourceMetadata metadata = mock(DataSourceMetadata.class);
+    when(metadata.isSorted()).thenReturn(true);
+    when(metadata.isSingleValue()).thenReturn(true);
+    when(metadata.getDataType()).thenReturn(DataType.STRING);
+
+    DataSource dataSource = mock(DataSource.class);
+    when(dataSource.getForwardIndex()).thenReturn(reader);
+    when(dataSource.getDataSourceMetadata()).thenReturn(metadata);
+    when(dataSource.getDictionary()).thenReturn(null);
+    when(dataSource.getNullValueVector()).thenReturn(null);
+
+    return dataSource;
+  }
+
+  @Test
+  public void testStringRangePredicate() {
+    String[] data = {"apple", "banana", "cherry", "date", "elderberry", "fig"};
+    DataSource dataSource = createStringDataSource(data);
+    QueryContext queryContext = createQueryContext();
+    RangePredicate predicate =
+        new RangePredicate(COL_EXPR, true, "banana", true, "elderberry", DataType.STRING);
+    PredicateEvaluator evaluator =
+        RangePredicateEvaluatorFactory.newRawValueBasedEvaluator(predicate, DataType.STRING);
+
+    RawSortedIndexBasedFilterOperator operator =
+        new RawSortedIndexBasedFilterOperator(queryContext, evaluator, dataSource, data.length);
+
+    int[] matchingDocIds = getMatchingDocIds(operator);
+    // banana(1), cherry(2), date(3), elderberry(4)
+    assertEquals(matchingDocIds, new int[]{1, 2, 3, 4});
+  }
+}


The new operator claims support for multiple raw data types (FLOAT/DOUBLE/BYTES/BIG_DECIMAL), but this test suite only exercises INT/LONG/STRING. Adding focused tests for at least BIG_DECIMAL and BYTES (especially IN/NOT_IN, where range ordering/merging is sensitive) would prevent regressions and would have caught the ordering issue in getSortedValues().

Copilot · 2026-04-18T11:44:30Z

+   * not available (e.g., non-chunk-based readers or uncompressed readers). This is used to optimize binary search on
+   * sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine search within
+   * the target chunk, minimizing chunk decompressions.


Javadoc says getNumDocsPerChunk() returns 0 for “uncompressed readers”, but BaseChunkForwardIndexReader can be uncompressed (PASS_THROUGH) and still has meaningful chunk boundaries and returns _numDocsPerChunk. Consider updating the wording to reflect “non-chunk-based readers” (or “readers without chunk metadata”) rather than tying it to compression.

Suggested change

* not available (e.g., non-chunk-based readers or uncompressed readers). This is used to optimize binary search on

* sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine search within

* the target chunk, minimizing chunk decompressions.

* not available (e.g., non-chunk-based readers or readers without chunk metadata). This is used to optimize binary

* search on sorted raw columns by enabling two-level search: coarse search at chunk boundaries followed by fine

* search within the target chunk, minimizing chunk decompressions.

xiangfu0 force-pushed the claude/determined-moser branch from 8817917 to fe87055 Compare April 2, 2026 08:32

xiangfu0 force-pushed the claude/determined-moser branch from fe87055 to f1e6f1d Compare April 14, 2026 10:07

lakechd reviewed Apr 16, 2026

View reviewed changes

xiangfu0 and others added 2 commits April 18, 2026 04:31

xiangfu0 force-pushed the claude/determined-moser branch from f1e6f1d to b7d1b32 Compare April 18, 2026 11:32

xiangfu0 requested a review from Copilot April 18, 2026 11:38

xiangfu0 added performance Related to performance optimization index Related to indexing (general) labels Apr 18, 2026

Copilot started reviewing on behalf of xiangfu0 April 18, 2026 11:39 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RawSortedIndexBasedFilterOperator for binary search on raw sorted columns#18079

Add RawSortedIndexBasedFilterOperator for binary search on raw sorted columns#18079
xiangfu0 wants to merge 2 commits intoapache:masterfrom
xiangfu0:claude/determined-moser

xiangfu0 commented Apr 2, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 2, 2026 •

edited

Loading

Uh oh!

lakechd Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Copilot AI Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    // Sort values to get ranges in order, then merge adjacent ranges
-    String[] sortedValues = getSortedValues(dataType, values);
-    List<IntPair> allRanges = new ArrayList<>();
-    for (String valueStr : sortedValues) {
-      List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
-      allRanges.addAll(eqRanges);
-    }
-    return mergeAdjacentRanges(allRanges);
-  }
+    // Sort values to improve binary search locality, then normalize the resulting docId ranges because
+    // value ordering does not always imply docId-range ordering for all data types.
+    String[] sortedValues = getSortedValues(dataType, values);
+    List<IntPair> allRanges = new ArrayList<>();
+    for (String valueStr : sortedValues) {
+      List<IntPair> eqRanges = computeEqRanges(context, dataType, valueStr);
+      allRanges.addAll(eqRanges);
+    }
+    return sortAndMergeRanges(allRanges);
+  }
+  private List<IntPair> sortAndMergeRanges(List<IntPair> ranges) {
+    int numRanges = ranges.size();
+    if (numRanges < 2) {
+      return ranges;
+    }
+    ranges.sort((range1, range2) -> {
+      int leftCompare = Integer.compare(range1.getLeft(), range2.getLeft());
+      return leftCompare != 0 ? leftCompare : Integer.compare(range1.getRight(), range2.getRight());
+    });
+    List<IntPair> mergedRanges = new ArrayList<>(numRanges);
+    IntPair currentRange = ranges.get(0);
+    int currentStart = currentRange.getLeft();
+    int currentEnd = currentRange.getRight();
+    for (int i = 1; i < numRanges; i++) {
+      IntPair nextRange = ranges.get(i);
+      int nextStart = nextRange.getLeft();
+      int nextEnd = nextRange.getRight();
+      if (nextStart <= currentEnd + 1) {
+        currentEnd = Math.max(currentEnd, nextEnd);
+      } else {
+        mergedRanges.add(new IntPair(currentStart, currentEnd));
+        currentStart = nextStart;
+        currentEnd = nextEnd;
+      }
+    }
+    mergedRanges.add(new IntPair(currentStart, currentEnd));
+    return mergedRanges;
+  }

Conversation

xiangfu0 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

Benchmark

Test plan

Uh oh!

codecov-commenter commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lakechd Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiangfu0 commented Apr 2, 2026 •

edited

Loading

codecov-commenter commented Apr 2, 2026 •

edited

Loading