Skip to content

Commit 9fac9a9

Browse files
groksrcclaude
andcommitted
feat(core): add config-gated entity-aware ranking boost for hybrid search
Proper nouns in a query carry no extra weight against generic semantic similarity, so documents about a different entity on the same topic can outrank the document that actually names the queried entity (#951 cross-conversation confusion in the LoCoMo benchmark). Add an optional, lexical-only re-scoring pass to hybrid fusion: - Extract candidate entity terms from the query (capitalized / proper-noun tokens that are not common stopwords; trailing possessives stripped). - Count how many distinct query entity terms appear in each fused candidate's entity name (title) or a relation row's linked entity names. - Multiply matching candidates' fused scores by 1 + weight * min(matches, max_terms), promoting entity-matching docs. The boost runs over the full fused candidate set before the limit/offset cut, so a matching doc below the cutoff can be promoted into the returned window. It adds no model inference (index/lexical lookups only), so per-query latency overhead is trivial, and only affects hybrid retrieval. Behind three config flags, DEFAULT OFF pending LoCoMo benchmark validation: search_entity_boost_enabled, search_entity_boost_weight, search_entity_boost_max_terms. Documented in docs/semantic-search.md. Tests: unit coverage for entity-term extraction and the boost math; a hybrid-pipeline test showing reordering when enabled and unchanged ordering when disabled; and a service-level integration test over a real DB with a deterministic stub embedding provider proving an entity-matching doc outranks a higher-similarity non-matching doc only when enabled. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
1 parent d46c688 commit 9fac9a9

8 files changed

Lines changed: 764 additions & 0 deletions

docs/semantic-search.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,44 @@ All settings are fields on `BasicMemoryConfig` and can be set via environment va
107107
| `semantic_embedding_document_input_type` | `BASIC_MEMORY_SEMANTIC_EMBEDDING_DOCUMENT_INPUT_TYPE` | Auto for known LiteLLM models | Optional LiteLLM `input_type` for indexed document/passages. |
108108
| `semantic_embedding_query_input_type` | `BASIC_MEMORY_SEMANTIC_EMBEDDING_QUERY_INPUT_TYPE` | Auto for known LiteLLM models | Optional LiteLLM `input_type` for search queries. |
109109
| `semantic_vector_k` | `BASIC_MEMORY_SEMANTIC_VECTOR_K` | `100` | Candidate count for vector nearest-neighbour retrieval. Higher values improve recall at the cost of latency. |
110+
| `search_entity_boost_enabled` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_ENABLED` | `false` | Enable the entity-aware ranking boost in hybrid search (see below). Default off pending benchmark validation. |
111+
| `search_entity_boost_weight` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_WEIGHT` | `0.15` | Per-matched-term multiplier strength for the entity boost. A candidate matching N query entity terms is scaled by `1 + weight * min(N, max_terms)`. |
112+
| `search_entity_boost_max_terms` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_MAX_TERMS` | `3` | Maximum number of distinct matched entity terms that contribute to the boost, bounding the multiplier. |
113+
114+
## Entity-Aware Ranking Boost
115+
116+
Hybrid search fuses keyword (FTS) and vector similarity, but proper nouns in a query
117+
carry no special weight against generic semantic similarity. As a result, a document
118+
about a *different* entity on the same topic can outrank the document that actually
119+
names the queried entity — e.g. "What are Joanna's hobbies?" surfacing a generic
120+
hobbies note ahead of Joanna's note (see
121+
[#951](https://github.com/basicmachines-co/basic-memory/issues/951)).
122+
123+
When `search_entity_boost_enabled=true`, hybrid retrieval performs a final,
124+
lexical-only re-scoring pass:
125+
126+
1. It extracts candidate entity terms from the query — capitalized / proper-noun
127+
tokens that are not common stopwords (e.g. `Joanna`, `Anthony`, `NASA`).
128+
2. For each fused candidate, it counts how many distinct query entity terms appear in
129+
the candidate's entity name (its title) or in a relation row's linked entity names.
130+
3. Matching candidates have their fused score multiplied by
131+
`1 + weight * min(matches, max_terms)`, so an entity-matching document can be
132+
promoted above a higher-similarity non-matching one.
133+
134+
The boost adds **no model inference** — it is pure index/lexical lookup, so per-query
135+
latency overhead is trivial. It only affects `hybrid` retrieval; `text` and `vector`
136+
modes are unchanged. Non-matching candidates keep their original scores, so ordering
137+
among them is preserved.
138+
139+
```bash
140+
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_ENABLED=true
141+
# Optional tuning:
142+
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_WEIGHT=0.15
143+
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_MAX_TERMS=3
144+
```
145+
146+
> **Default off.** This setting is disabled by default pending LoCoMo benchmark
147+
> validation. Enable it to experiment with entity-heavy corpora.
110148
111149
## Embedding Providers
112150

src/basic_memory/config.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,38 @@ def __init__(self, **data: Any) -> None: ...
346346
"Valid values: text, vector, hybrid. "
347347
"When unset, defaults to 'hybrid' if semantic search is enabled, otherwise 'text'.",
348348
)
349+
# Entity-aware ranking boost (hybrid retrieval).
350+
# Trigger: proper nouns in a query (e.g. "Joanna") carry no extra weight against
351+
# generic semantic similarity, so documents from the wrong conversation can outrank
352+
# the gold document during hybrid fusion (#951).
353+
# Why: entities are first-class in Basic Memory, so a candidate whose title or linked
354+
# relation names contain a query proper noun is a stronger answer than a same-topic
355+
# document about a different entity.
356+
# Outcome: when enabled, hybrid fusion multiplies a candidate's fused score by a small
357+
# bonus for each distinct query entity term it matches lexically (no model inference).
358+
# Default OFF pending LoCoMo benchmark validation by the maintainer.
359+
search_entity_boost_enabled: bool = Field(
360+
default=False,
361+
description="Enable entity-aware ranking boost in hybrid search. When enabled, "
362+
"hybrid candidates whose title or linked relation names contain a proper-noun "
363+
"term from the query are boosted in the final ranking. Lexical-only; adds no "
364+
"model inference. Default off pending benchmark validation.",
365+
)
366+
search_entity_boost_weight: float = Field(
367+
default=0.15,
368+
description="Per-matched-term multiplier strength for the entity-aware ranking "
369+
"boost. A candidate matching N distinct query entity terms has its fused score "
370+
"multiplied by (1 + weight * N), capped at search_entity_boost_max_terms terms. "
371+
"Only applies when search_entity_boost_enabled is true.",
372+
ge=0.0,
373+
)
374+
search_entity_boost_max_terms: int = Field(
375+
default=3,
376+
description="Maximum number of distinct matched entity terms that contribute to "
377+
"the entity-aware ranking boost, bounding the multiplier so a single candidate "
378+
"cannot run away with the ranking.",
379+
gt=0,
380+
)
349381

350382
# Database connection pool configuration (Postgres only)
351383
db_pool_size: int = Field(

src/basic_memory/repository/postgres_search_repository.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,9 @@ def __init__(
6767
self._semantic_postgres_prepare_concurrency = (
6868
self._app_config.semantic_postgres_prepare_concurrency
6969
)
70+
self._entity_boost_enabled = self._app_config.search_entity_boost_enabled
71+
self._entity_boost_weight = self._app_config.search_entity_boost_weight
72+
self._entity_boost_max_terms = self._app_config.search_entity_boost_max_terms
7073
self._embedding_provider = embedding_provider
7174
self._vector_dimensions = 384
7275
self._vector_tables_initialized = False

src/basic_memory/repository/search_repository_base.py

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,64 @@
4545
# the vector/hybrid retrieval path must key rows by (type, id) to avoid collisions.
4646
type SearchIndexKey = tuple[str, int]
4747

48+
# --- Entity-aware ranking boost (#951) ---
49+
50+
# Match word tokens (allowing internal apostrophes/hyphens) so we can inspect
51+
# their capitalization to detect proper-noun-like query terms.
52+
_ENTITY_TERM_TOKEN_PATTERN = re.compile(r"[A-Za-z][A-Za-z'\-]*")
53+
54+
# Common capitalized sentence-starters and interrogatives that look like proper
55+
# nouns but are not entity references. Kept lowercase for case-insensitive checks.
56+
# Intentionally small: a candidate term only boosts a row when it actually matches
57+
# that row's title/relation names, so a stray non-entity term simply does nothing.
58+
_ENTITY_TERM_STOPWORDS = frozenset(
59+
{
60+
"a",
61+
"an",
62+
"and",
63+
"are",
64+
"as",
65+
"at",
66+
"be",
67+
"but",
68+
"by",
69+
"do",
70+
"does",
71+
"for",
72+
"from",
73+
"has",
74+
"have",
75+
"how",
76+
"i",
77+
"in",
78+
"is",
79+
"it",
80+
"of",
81+
"on",
82+
"or",
83+
"the",
84+
"their",
85+
"they",
86+
"this",
87+
"to",
88+
"was",
89+
"we",
90+
"were",
91+
"what",
92+
"when",
93+
"where",
94+
"which",
95+
"who",
96+
"whom",
97+
"whose",
98+
"why",
99+
"will",
100+
"with",
101+
"you",
102+
"your",
103+
}
104+
)
105+
48106

49107
@dataclass
50108
class VectorSyncBatchResult:
@@ -166,6 +224,13 @@ class SearchRepositoryBase(ABC):
166224
_vector_dimensions: int
167225
_vector_tables_initialized: bool
168226

227+
# Entity-aware ranking boost (#951). Defaults keep the feature off for any
228+
# subclass or test double that does not explicitly configure it. Concrete
229+
# backends overwrite these from BasicMemoryConfig in their __init__.
230+
_entity_boost_enabled: bool = False
231+
_entity_boost_weight: float = 0.0
232+
_entity_boost_max_terms: int = 1
233+
169234
def __init__(self, session_maker: async_sessionmaker[AsyncSession], project_id: int):
170235
"""Initialize with session maker and project_id filter.
171236
@@ -2147,6 +2212,105 @@ async def _fetch_search_index_rows_by_ids(
21472212
# Shared semantic search: hybrid score-based fusion
21482213
# ------------------------------------------------------------------
21492214

2215+
# --- Entity-aware ranking boost (#951) ---
2216+
2217+
@staticmethod
2218+
def _extract_query_entity_terms(search_text: Optional[str]) -> set[str]:
2219+
"""Extract candidate entity (proper-noun) terms from a query string.
2220+
2221+
Heuristic, lexical only (no model inference): a token is a candidate entity
2222+
term when it is title-cased or all-caps and is not a common stopword. The
2223+
result is lowercased so downstream matching is case-insensitive.
2224+
2225+
Examples:
2226+
"What are Joanna's hobbies?" -> {"joanna"}
2227+
"Who is Anthony?" -> {"anthony"}
2228+
"Deborah and Jolene" -> {"deborah", "jolene"}
2229+
"what is the weather" -> set() (no proper nouns)
2230+
"""
2231+
if not search_text:
2232+
return set()
2233+
2234+
terms: set[str] = set()
2235+
for match in _ENTITY_TERM_TOKEN_PATTERN.finditer(search_text):
2236+
token = match.group(0)
2237+
# Trigger: token begins with an uppercase letter (Title-Case or ALL-CAPS).
2238+
# Why: proper nouns and named entities are conventionally capitalized; this
2239+
# is the cheapest reliable signal without a NER model.
2240+
# Outcome: lowercase, non-capitalized words are ignored as generic terms.
2241+
if not token[0].isupper():
2242+
continue
2243+
normalized = token.lower()
2244+
# Strip a trailing possessive so "Joanna's" matches the entity "Joanna".
2245+
if normalized.endswith("'s"):
2246+
normalized = normalized[:-2]
2247+
if normalized in _ENTITY_TERM_STOPWORDS:
2248+
continue
2249+
# Single characters (e.g. a stray "I") carry no entity signal.
2250+
if len(normalized) < 2:
2251+
continue
2252+
terms.add(normalized)
2253+
return terms
2254+
2255+
@staticmethod
2256+
def _row_entity_match_count(row: SearchIndexRow, entity_terms: set[str]) -> int:
2257+
"""Count distinct query entity terms that a candidate row references.
2258+
2259+
Matches against the row's own entity name (title) and the names embedded in
2260+
a relation row's title (``"From -> To"``). These are the fields where Basic
2261+
Memory's first-class entity names surface, so a match here is strong evidence
2262+
the candidate is about the queried entity rather than a same-topic document.
2263+
"""
2264+
if not entity_terms:
2265+
return 0
2266+
2267+
haystack_parts = [row.title or ""]
2268+
# Relation rows encode linked entity names in their title ("From -> To");
2269+
# the relation_type itself is not an entity name, so it is excluded.
2270+
haystack = " ".join(part for part in haystack_parts if part)
2271+
if not haystack:
2272+
return 0
2273+
2274+
haystack_tokens: set[str] = set()
2275+
for match in _ENTITY_TERM_TOKEN_PATTERN.finditer(haystack):
2276+
token = match.group(0).lower()
2277+
# Mirror the query-side possessive stripping so a doc titled
2278+
# "Joanna's Hobbies" matches the query entity term "joanna".
2279+
if token.endswith("'s"):
2280+
token = token[:-2]
2281+
haystack_tokens.add(token)
2282+
return len(entity_terms & haystack_tokens)
2283+
2284+
def _apply_entity_boost(
2285+
self,
2286+
fused_scores: dict[SearchIndexKey, float],
2287+
rows_by_key: dict[SearchIndexKey, SearchIndexRow],
2288+
entity_terms: set[str],
2289+
) -> dict[SearchIndexKey, float]:
2290+
"""Multiply fused scores by a per-matched-term bonus for entity-matching rows.
2291+
2292+
Trigger: entity boosting is enabled and the query contains proper-noun terms.
2293+
Why: a candidate whose entity/relation names contain a queried proper noun is a
2294+
stronger answer than a generic same-topic document (#951 cross-conversation
2295+
confusion).
2296+
Outcome: ``score * (1 + weight * min(matches, max_terms))``. Rows that match no
2297+
query entity term are returned unchanged, so relative order among non-matching
2298+
rows is preserved.
2299+
"""
2300+
if not self._entity_boost_enabled or not entity_terms or self._entity_boost_weight <= 0:
2301+
return fused_scores
2302+
2303+
boosted: dict[SearchIndexKey, float] = {}
2304+
for row_key, score in fused_scores.items():
2305+
row = rows_by_key.get(row_key)
2306+
matches = self._row_entity_match_count(row, entity_terms) if row is not None else 0
2307+
if matches <= 0:
2308+
boosted[row_key] = score
2309+
continue
2310+
capped_matches = min(matches, self._entity_boost_max_terms)
2311+
boosted[row_key] = score * (1.0 + self._entity_boost_weight * capped_matches)
2312+
return boosted
2313+
21502314
async def _search_hybrid(
21512315
self,
21522316
*,
@@ -2250,6 +2414,15 @@ async def _search_hybrid(
22502414
f = fts_scores.get(row_key, 0.0)
22512415
fused_scores[row_key] = max(v, f) + FUSION_BONUS * min(v, f)
22522416

2417+
# Entity-aware ranking boost (#951): runs over the full fused candidate set
2418+
# before the limit/offset cut, so a boosted entity-matching candidate can be
2419+
# promoted into the returned window. No-op when the feature is disabled or the
2420+
# query contains no proper-noun terms, preserving the existing ordering.
2421+
entity_terms = (
2422+
self._extract_query_entity_terms(query_text) if self._entity_boost_enabled else set()
2423+
)
2424+
fused_scores = self._apply_entity_boost(fused_scores, rows_by_key, entity_terms)
2425+
22532426
ranked = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
22542427
output: list[SearchIndexRow] = []
22552428
for row_key, fused_score in ranked[offset : offset + limit]:

src/basic_memory/repository/sqlite_search_repository.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ def __init__(
5555
self._semantic_embedding_sync_batch_size = (
5656
self._app_config.semantic_embedding_sync_batch_size
5757
)
58+
self._entity_boost_enabled = self._app_config.search_entity_boost_enabled
59+
self._entity_boost_weight = self._app_config.search_entity_boost_weight
60+
self._entity_boost_max_terms = self._app_config.search_entity_boost_max_terms
5861
self._embedding_provider = embedding_provider
5962
self._sqlite_vec_load_lock = asyncio.Lock()
6063
self._sqlite_prepare_write_lock = asyncio.Lock()

0 commit comments

Comments
 (0)