[FEATURE] Add hybrid edismax + KNN reRank query mode (query.type=2)#4654
Open
dkd-dobberkau wants to merge 6 commits into
Open
[FEATURE] Add hybrid edismax + KNN reRank query mode (query.type=2)#4654dkd-dobberkau wants to merge 6 commits into
dkd-dobberkau wants to merge 6 commits into
Conversation
Introduces three TypoScriptConfiguration helpers required for the upcoming hybrid query.type=2 mode in QueryBuilder: - isHybridVectorSearchEnabled(): query.type === 2 - getVectorReRankDocs(): candidate-pool size, default 200 - getVectorReRankWeight(): cosine multiplier, default 2.0 Pure vector helpers and the existing query.type=1 path are unchanged.
QueryBuilder gains a third branch: classical edismax produces the
candidate set, Solr's {!rerank} query parser orders the top-N by
bm25 + weight*cosine using a {!knn_text_to_vector} inner query.
Unlike pure vector search (query.type=1) the classical query
remains the recall driver, so qf/pf/mm/highlighting all work as
usual and irrelevant vector neighbours can no longer dominate the
result list.
Tunables (TypoScript):
- search.vectorSearch.reRank.docs (default 200)
- search.vectorSearch.reRank.weight (default 2.0)
max(1, min($topK, $reRankDocs)) prevents an invalid topK=0 being sent to Solr's KNN parser when either getter is misconfigured. A second unit test exercises the case where the two inputs differ, so the min() logic is no longer silently bypassed by equal values.
…path Replace call to undefined Query::getRawSearchTerm() with getQuery() in canBuildSearchQueryForHybridVectorSearch test — the return type of buildSearchQuery() is Query (not SearchQuery), so getRawSearchTerm() is not visible to PHPStan level 6. getQuery() returns the same 'q' parameter value and is defined on the base class. CS dry-run: 0 files to fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Aligns the hybrid vector test with the sibling canBuildSearchQueryForVectorSearch pattern: an explicit instanceof SearchQuery check narrows PHPStan's type, so the spec-mandated getRawSearchTerm() assertion can stay. Removes the redundant double-check of \$query->getQuery() introduced when working around the PHPStan error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #4065. Adds a hybrid (classical + KNN) sibling to the existing
query.type = 1(pure vector) mode introduced as part of that feature.Motivation
The existing
plugin.tx_solr.search.query.type = 1performs a pure KNNvector search (
q=*:*plus a vector frange filter), discarding classicalterm-matching signals. For keyword-heavy queries this produces phantom
matches from the embedding cluster while exact-term hits get scored
identically to remote semantic neighbours.
In practice, on a TYPO3 demo site (
solr-ddev-sitewithnomic-embed-textembeddings), searching for
WordPressin pure-vector mode returns 25 hits(all unrelated TYPO3 pages sharing a "CMS" embedding cluster) while
searching for
TYPO3— the dominant term on the site — returns only 1.Raising the minimum-similarity threshold does not help; it only shifts the
phantom count linearly.
What this PR adds
A third value
query.type = 2— hybrid re-rank mode:preserving qf/pf/mm/highlighting/facets/spellcheck.
{!rerank reRankQuery={!knn_text_to_vector ...}}ordersthe top-N candidates by
bm25 + weight * cosine(precision refinement).The hybrid branch is strictly additive: a new
isHybridVectorSearchEnabled()predicate sits inside the existingelse-arm of
buildSearchQuery().query.type = 0(classical) andquery.type = 1(pure vector) code paths are byte-identical to base.Configuration
Two new TypoScript keys under
plugin.tx_solr.search.vectorSearch.reRank:docs200weight2.0final = bm25 + weight * cosineExisting keys (
vectorSearch.minimumSimilarity,vectorSearch.topK)stay relevant only for
query.type = 1; the docs notes have been updatedto make this explicit.
Manual verification (TYPO3 14 / Solr 10)
type=0type=1type=2TYPO3WordPresslanguagetype=2matchestype=0on recall (correct counts) while top-of-listordering is refined by the cosine signal. The phantom CMS-cluster matches
under
type=1disappear because BM25 = 0 documents are not candidates.Test coverage
build path including
rq/rqqparameter shape,min()clamp branchwith
reRankDocs < topK, cross-relation with the existing vector flags)QueryBuilder|Searchfilter) pass, no regressionsKnown gaps / follow-ups
model=llm f=vectorstrings inattachHybridVectorReRanking()mirrorpreparePureVectorSearch()— samehardcoding, kept for symmetry. A future PR could hoist both to a config
getter.
seeded vector fixture data, scoped out of this PR.
2.0default weight is a heuristic chosen for the demo site; largeproduction indices may need re-tuning.
TxSolrSearch.rst: section headingsuse
query.vectorSearch.*while:TS Path:values correctly useplugin.tx_solr.search.vectorSearch.*(no.query.segment in theTypoScript path). The two new sections follow the existing heading style
for consistency; cleanup is a separate concern.
Commits
[FEATURE] Add config getters for hybrid vector re-rank search[FEATURE] Hybrid edismax + KNN reRank query path (query.type=2)[BUGFIX] Clamp hybrid KNN topK to 1 and cover min() branch with a test[TASK] Apply TYPO3 coding standards and fix PHPStan to hybrid vector path[BUGFIX] Restore getRawSearchTerm() assertion with instanceof guard[DOCS] Document hybrid query.type=2 and reRank settings