Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,13 @@ New Features

Improvements
---------------------
* GITHUB#15948: Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior,
weighted Logarithmic Opinion Pooling, and auto parameter estimation. Add
BayesianScoreEstimator for estimating sigmoid calibration parameters from corpus
statistics. Add base rate prior to BayesianScoreQuery for log-odds space shifting.
Add per-signal weights and logit normalization to LogOddsFusionQuery.
(Jaepil Jeong)

* GITHUB#15823: Implement method to add all stream elements into a PriorityQueue.
Call PriorityQueue#addAll with mapped stream in DisjunctionMaxBulkScorer's constructor. (Zhou Hui)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.search;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.StoredFields;
import org.apache.lucene.index.Term;
import org.apache.lucene.util.ArrayUtil;

/**
* Estimates {@link BayesianScoreQuery} parameters (alpha, beta, base rate) from corpus statistics
* via pseudo-query sampling.
*
* <p>The estimation algorithm:
*
* <ol>
* <li>Sample N documents randomly from the index
* <li>For each document, create a pseudo-query from its first few tokens in the target field
* <li>Run each pseudo-query via BM25 and collect the score distribution
* <li>Estimate: beta = median(scores), alpha = 1 / std(scores)
* <li>Estimate base rate: mean fraction of documents scoring above the 95th percentile
* </ol>
*
* @lucene.experimental
*/
public class BayesianScoreEstimator {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I see these params are then used within BayesianScoreQuery

I wonder, could we have a constructor for BayesianScoreQuery (and have those internal parameters be nullable), that detects during rewrite if the parameters are null, and if they are, we provide the correct estimation?

Or we adjust the interface so that BayesianScoreQuery accepts an estimator in its constructor OR the parameters, and if its an estimator, it will handle it rewrite?

Is the main concern that the estimation should only ever happen once per the life time of the index? Or only periodically vs. on every query?

Copy link
Copy Markdown
Contributor Author

@jaepil jaepil Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions — let me take them in reverse order, since the lifecycle question (3) is the most fundamental and the API choice follows from it.

On lifecycle (3): The estimated parameters are corpus-level statistics. α and β are derived from the BM25 score distribution's center and spread, and the base rate is a global prior. None of them depend on the user query, so the natural lifecycle is per-IndexReader (per-commit), not per-query. Estimation runs ~50 pseudo-queries × top-K collection, which is fine once per reader but prohibitive on every query.

On putting estimation inside rewrite() (1 and 2): I'm a bit hesitant for a few reasons:

  1. rewrite() is generally expected to be cheap and stats-driven, not to perform I/O of this magnitude (reading stored fields, running 50 inner searches, sorting score arrays).
  2. Even with a fixed seed, lazy estimation in rewrite() would need a reader-keyed cache to avoid redoing the work — otherwise every rewrite() call repeats the sampling.
  3. It blurs query identity: equals/hashCode of an unestimated query vs. its rewritten form needs careful handling, especially for the query-cache layer.

What I'd propose instead: keep the explicit Parameters constructor as the primary, deterministic API, and add a convenience factory:

public static Query BayesianScoreQuery.withAutoCalibration(
    IndexSearcher searcher, String field, Query inner) throws IOException;

Internally this memoizes Parameters keyed by IndexReader.CacheHelper#getKey(), so estimation runs once per reader and is cleaned up automatically when the reader closes. The user gets the "just works" ergonomics without overloading rewrite() with sampling I/O.

Structurally this follows the same precedent as KnnFloatVectorQuery / KnnByteVectorQuery: some queries inherently need a reader-bound resolution step, and Lucene already accommodates that. The difference here is that we resolve eagerly at construction time rather than lazily in rewrite(), since calibration parameters are reusable across many inner queries against the same reader (whereas a kNN result set is tied to a specific query vector and isn't).

Happy to push this as a follow-up commit if the direction makes sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The estimated parameters are corpus-level statistics. α and β are derived from the BM25 score distribution's center and spread, and the base rate is a global prior. None of them depend on the user query, so the natural lifecycle is per-IndexReader (per-commit), not per-query. Estimation runs ~50 pseudo-queries × top-K collection, which is fine once per reader but prohibitive on every query.

Ah, gotcha! I am better understanding. Thank you.

My concern is how do we know what a "typical user query" looks like. Doesn't this require knowledge of the query?

Or did y'alls empirical analysis show that just using random docs worked well enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, and the answer is: calibration doesn't need to model the user query distribution — it only needs the score distribution to be representative of the corpus's BM25 dynamic range.

Here's why: α and β are derived from the BM25 score distribution's spread (alpha = 1/std) and center (beta = median). These are scale statistics. As long as the pseudo-queries exercise the same scoring code path that real user queries will hit (BM25Similarity over the same field's term frequencies and IDF table), the resulting α/β describe the scorer's calibration, which is invariant to which specific terms appear in the query. The base rate is similarly a corpus-level fraction, not query-conditional.

A useful sanity check: sigmoid is monotone, so α and β never change ranking — they only adjust where on the (0,1) curve scores land for downstream Log-OP fusion. Even substantial pseudo-query/real-query distribution mismatch only shifts the calibration curve, which is the same effect as picking a different α/β manually.

That said, the "random docs + first N tokens" approach in this PR does have a real weakness on corpora with shared boilerplate prefixes (license headers, structured templates), where pseudo-queries collapse into near-duplicates. I'm thinking about replacing the document-text path with reservoir sampling over the field's indexed vocabulary, which would give uniform random samples of unique terms instead — a more defensible "what does this scorer's distribution look like" probe than "what do the first 5 words of random documents look like."

We did test this calibration approach during the research phase across several corpora and didn't see issues, but I'd like to redo that validation directly against the Lucene implementation as a follow-up PR before this leaves @lucene.experimental status.


/** Estimated parameters for {@link BayesianScoreQuery}. */
public record Parameters(float alpha, float beta, float baseRate) {}

private static final int DEFAULT_N_SAMPLES = 50;
private static final int DEFAULT_TOKENS_PER_QUERY = 5;
private static final double PERCENTILE_THRESHOLD = 0.95;
private static final float BASE_RATE_MIN = 1e-6f;
private static final float BASE_RATE_MAX = 0.5f;

private BayesianScoreEstimator() {}

/**
* Estimates BayesianScoreQuery parameters from the given index.
*
* @param searcher the index searcher to sample from
* @param field the text field to create pseudo-queries for
* @param nSamples number of documents to sample (default 50)
* @param tokensPerQuery number of tokens per pseudo-query (default 5)
* @param seed random seed for reproducible sampling
* @return estimated alpha, beta, and base rate
* @throws IOException if an I/O error occurs reading the index
*/
public static Parameters estimate(
IndexSearcher searcher, String field, int nSamples, int tokensPerQuery, long seed)
throws IOException {
IndexReader reader = searcher.getIndexReader();
int maxDoc = reader.maxDoc();
if (maxDoc == 0) {
return new Parameters(1.0f, 0.0f, 0.01f);
}

nSamples = Math.min(nSamples, maxDoc);
Random rng = new Random(seed);

// Sample document IDs
int[] sampledDocs = sampleDocIds(maxDoc, nSamples, rng);

// Create pseudo-queries and collect scores
List<float[]> allScoreArrays = new ArrayList<>();
List<Float> baseRateFractions = new ArrayList<>();
StoredFields storedFields = reader.storedFields();

for (int docId : sampledDocs) {
String fieldValue = storedFields.document(docId).get(field);
if (fieldValue == null || fieldValue.isEmpty()) {
continue;
}

// Extract first N tokens as pseudo-query terms
String[] tokens = tokenize(fieldValue, tokensPerQuery);
Comment on lines +95 to +96
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if just the first N works well for all types of data. For example, legal or source code may have all docs with a very similar "header" and this would effectively eliminate any random distribution and have a pretty significant bias.

Copy link
Copy Markdown
Contributor Author

@jaepil jaepil Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — taking the first N tokens biases heavily toward boilerplate prefixes (license headers, legal preambles, structured templates), and on those corpora the pseudo-queries collapse into near-duplicates.

The cleaner fix, I think, is to drop the document-text path entirely and reservoir-sample over the field's indexed vocabulary via MultiTerms.getTerms(reader, field) + TermsEnum. Vocabulary-level sampling is uniform over unique terms, not over occurrences — a boilerplate term that appears in 100% of documents has the same selection probability as a rare content term, so shared-prefix corpora no longer dominate the sample.

If this direction sounds right, I'll prepare a follow-up commit with a regression test for the shared-prefix case.

if (tokens.length == 0) {
continue;
}

// Build a BooleanQuery from the tokens
BooleanQuery.Builder builder = new BooleanQuery.Builder();
for (String token : tokens) {
builder.add(new TermQuery(new Term(field, token)), BooleanClause.Occur.SHOULD);
}
Query pseudoQuery = builder.build();

// Collect all scores
float[] scores = collectScores(searcher, pseudoQuery, maxDoc);
if (scores.length == 0) {
continue;
}
allScoreArrays.add(scores);

// Base rate: fraction of docs above 95th percentile
float[] sorted = scores.clone();
Arrays.sort(sorted);
int pIdx = (int) (sorted.length * PERCENTILE_THRESHOLD);
pIdx = Math.min(pIdx, sorted.length - 1);
float threshold = sorted[pIdx];
int highCount = 0;
for (float s : scores) {
if (s >= threshold) {
highCount++;
}
}
baseRateFractions.add((float) highCount / maxDoc);
}

if (allScoreArrays.isEmpty()) {
return new Parameters(1.0f, 0.0f, 0.01f);
}

// Flatten all scores for global statistics
int totalScores = 0;
for (float[] arr : allScoreArrays) {
totalScores += arr.length;
}
float[] allScores = new float[totalScores];
int offset = 0;
for (float[] arr : allScoreArrays) {
System.arraycopy(arr, 0, allScores, offset, arr.length);
offset += arr.length;
}

// beta = median
Arrays.sort(allScores);
float beta = allScores[allScores.length / 2];

// alpha = 1 / std
double mean = 0;
for (float s : allScores) {
mean += s;
}
mean /= allScores.length;
double variance = 0;
for (float s : allScores) {
double diff = s - mean;
variance += diff * diff;
}
variance /= allScores.length;
double std = Math.sqrt(variance);
float alpha = std > 0 ? (float) (1.0 / std) : 1.0f;

// base rate = mean of per-query fractions, clamped
float baseRate = 0;
for (float f : baseRateFractions) {
baseRate += f;
}
baseRate /= baseRateFractions.size();
baseRate = Math.clamp(baseRate, BASE_RATE_MIN, BASE_RATE_MAX);

return new Parameters(alpha, beta, baseRate);
}

/**
* Estimates parameters with default settings (50 samples, 5 tokens per query, seed 42).
*
* @param searcher the index searcher
* @param field the text field
* @return estimated parameters
* @throws IOException if an I/O error occurs
*/
public static Parameters estimate(IndexSearcher searcher, String field) throws IOException {
return estimate(searcher, field, DEFAULT_N_SAMPLES, DEFAULT_TOKENS_PER_QUERY, 42);
}

private static int[] sampleDocIds(int maxDoc, int nSamples, Random rng) {
// Fisher-Yates partial shuffle for sampling without replacement
int[] all = new int[maxDoc];
for (int i = 0; i < maxDoc; i++) {
all[i] = i;
}
int n = Math.min(nSamples, maxDoc);
for (int i = 0; i < n; i++) {
int j = i + rng.nextInt(maxDoc - i);
int tmp = all[i];
all[i] = all[j];
all[j] = tmp;
}
return ArrayUtil.copyOfSubArray(all, 0, n);
}

private static String[] tokenize(String text, int maxTokens) {
// Simple whitespace tokenization with lowercasing
String[] parts = text.toLowerCase(java.util.Locale.ROOT).split("\\s+");
int n = Math.min(parts.length, maxTokens);
List<String> tokens = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
String token = parts[i].replaceAll("[^a-z0-9]", "");
if (token.isEmpty() == false) {
tokens.add(token);
}
}
return tokens.toArray(new String[0]);
}
Comment on lines +204 to +216
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think we should actually analyze with a provided analyzer or gather information from term vectors or something. I suspect for many corpuses that doing whitespace and trimming like this just doesn't reflect reality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I should have implemented better code here.

Rather than threading an Analyzer parameter through the public API, I think the cleaner fix is to side-step analysis entirely by sampling from the already-analyzed term dictionary via MultiTerms.getTerms(reader, field) + TermsEnum. Concretely: reservoir-sample nSamples * tokensPerQuery unique terms from the field's vocabulary, partition into pseudo-queries, and feed them directly into new Term(field, bytesRef). The bytes are identical to what's indexed, so:

  • No analyzer parameter needed on the public API.
  • No dependency on stored fields or term vectors.
  • Works correctly for any analyzer chain the user indexed with — Korean, Chinese, custom n-gram, anything.

If this direction sounds right, I'll prepare a follow-up commit.


private static float[] collectScores(IndexSearcher searcher, Query query, int maxDoc)
throws IOException {
int topN = Math.min(maxDoc, 10000);
TopDocs topDocs = searcher.search(query, topN);
float[] scores = new float[topDocs.scoreDocs.length];
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
scores[i] = topDocs.scoreDocs[i].score;
}
return scores;
}
}
Loading
Loading