Skip to content

[FEATURE] Add hybrid edismax + KNN reRank query mode (query.type=2)#4654

Open
dkd-dobberkau wants to merge 6 commits into
TYPO3-Solr:mainfrom
dkd-dobberkau:feature/hybrid-vector-rerank
Open

[FEATURE] Add hybrid edismax + KNN reRank query mode (query.type=2)#4654
dkd-dobberkau wants to merge 6 commits into
TYPO3-Solr:mainfrom
dkd-dobberkau:feature/hybrid-vector-rerank

Conversation

@dkd-dobberkau
Copy link
Copy Markdown
Contributor

Builds on #4065. Adds a hybrid (classical + KNN) sibling to the existing
query.type = 1 (pure vector) mode introduced as part of that feature.

Motivation

The existing plugin.tx_solr.search.query.type = 1 performs a pure KNN
vector search (q=*:* plus a vector frange filter), discarding classical
term-matching signals. For keyword-heavy queries this produces phantom
matches from the embedding cluster while exact-term hits get scored
identically to remote semantic neighbours.

In practice, on a TYPO3 demo site (solr-ddev-site with nomic-embed-text
embeddings), searching for WordPress in pure-vector mode returns 25 hits
(all unrelated TYPO3 pages sharing a "CMS" embedding cluster) while
searching for TYPO3 — the dominant term on the site — returns only 1.
Raising the minimum-similarity threshold does not help; it only shifts the
phantom count linearly.

What this PR adds

A third value query.type = 2hybrid re-rank mode:

  • Classical edismax produces the candidate set (recall driver),
    preserving qf/pf/mm/highlighting/facets/spellcheck.
  • Solr's {!rerank reRankQuery={!knn_text_to_vector ...}} orders
    the top-N candidates by bm25 + weight * cosine (precision refinement).
  • Score-mix is tunable via two new TypoScript keys.

The hybrid branch is strictly additive: a new
isHybridVectorSearchEnabled() predicate sits inside the existing
else-arm of buildSearchQuery(). query.type = 0 (classical) and
query.type = 1 (pure vector) code paths are byte-identical to base.

Configuration

Two new TypoScript keys under plugin.tx_solr.search.vectorSearch.reRank:

Key Type Default Effect
docs Integer 200 Candidate pool size for re-ranking
weight Float 2.0 Cosine multiplier: final = bm25 + weight * cosine

Existing keys (vectorSearch.minimumSimilarity, vectorSearch.topK)
stay relevant only for query.type = 1; the docs notes have been updated
to make this explicit.

Manual verification (TYPO3 14 / Solr 10)

Term type=0 type=1 type=2
TYPO3 31 1 31
WordPress 0 25 0
language 6 varies 6

type=2 matches type=0 on recall (correct counts) while top-of-list
ordering is refined by the cosine signal. The phantom CMS-cluster matches
under type=1 disappear because BM25 = 0 documents are not candidates.

Test coverage

  • 961 unit tests pass (8 new: getter defaults + configured values, hybrid
    build path including rq/rqq parameter shape, min() clamp branch
    with reRankDocs < topK, cross-relation with the existing vector flags)
  • 84 integration tests (QueryBuilder|Search filter) pass, no regressions
  • PHPStan level 6: 0 errors
  • TYPO3 Coding Standards: clean

Known gaps / follow-ups

  • The hardcoded model=llm f=vector strings in
    attachHybridVectorReRanking() mirror preparePureVectorSearch() — same
    hardcoding, kept for symmetry. A future PR could hoist both to a config
    getter.
  • No integration test exercises hybrid against a live Solr — would require
    seeded vector fixture data, scoped out of this PR.
  • The 2.0 default weight is a heuristic chosen for the demo site; large
    production indices may need re-tuning.
  • Pre-existing doc inconsistency in TxSolrSearch.rst: section headings
    use query.vectorSearch.* while :TS Path: values correctly use
    plugin.tx_solr.search.vectorSearch.* (no .query. segment in the
    TypoScript path). The two new sections follow the existing heading style
    for consistency; cleanup is a separate concern.

Commits

  1. [FEATURE] Add config getters for hybrid vector re-rank search
  2. [FEATURE] Hybrid edismax + KNN reRank query path (query.type=2)
  3. [BUGFIX] Clamp hybrid KNN topK to 1 and cover min() branch with a test
  4. [TASK] Apply TYPO3 coding standards and fix PHPStan to hybrid vector path
  5. [BUGFIX] Restore getRawSearchTerm() assertion with instanceof guard
  6. [DOCS] Document hybrid query.type=2 and reRank settings

dkd-dobberkau and others added 6 commits May 11, 2026 17:18
Introduces three TypoScriptConfiguration helpers required for the
upcoming hybrid query.type=2 mode in QueryBuilder:
- isHybridVectorSearchEnabled(): query.type === 2
- getVectorReRankDocs(): candidate-pool size, default 200
- getVectorReRankWeight(): cosine multiplier, default 2.0

Pure vector helpers and the existing query.type=1 path are
unchanged.
QueryBuilder gains a third branch: classical edismax produces the
candidate set, Solr's {!rerank} query parser orders the top-N by
bm25 + weight*cosine using a {!knn_text_to_vector} inner query.

Unlike pure vector search (query.type=1) the classical query
remains the recall driver, so qf/pf/mm/highlighting all work as
usual and irrelevant vector neighbours can no longer dominate the
result list.

Tunables (TypoScript):
- search.vectorSearch.reRank.docs   (default 200)
- search.vectorSearch.reRank.weight (default 2.0)
max(1, min($topK, $reRankDocs)) prevents an invalid topK=0
being sent to Solr's KNN parser when either getter is
misconfigured. A second unit test exercises the case where the
two inputs differ, so the min() logic is no longer silently
bypassed by equal values.
…path

Replace call to undefined Query::getRawSearchTerm() with getQuery() in
canBuildSearchQueryForHybridVectorSearch test — the return type of
buildSearchQuery() is Query (not SearchQuery), so getRawSearchTerm()
is not visible to PHPStan level 6. getQuery() returns the same 'q'
parameter value and is defined on the base class.

CS dry-run: 0 files to fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Aligns the hybrid vector test with the sibling canBuildSearchQueryForVectorSearch
pattern: an explicit instanceof SearchQuery check narrows PHPStan's type, so
the spec-mandated getRawSearchTerm() assertion can stay. Removes the
redundant double-check of \$query->getQuery() introduced when working
around the PHPStan error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant