Skip to content

Commit bc0ae4f

Browse files
committed
fix(verif): extend query-entity resolution to natural-language tokens (engineer follow-up)
Engineer flagged a residual gap in ddb5b58/024ea1a: extract_query_entities() only matches CamelCase / path / backtick patterns. Natural-language queries ("what was discussed about authentication") resolved zero entity_ids and fell through to the token-Jaccard fallback, which could mute the dendritic ablation delta on most LongMemEval-S queries. Fix: two-stage resolution in _resolve_query_entity_ids(): - Stage 1: extract_query_entities() (high-precision, unchanged) - Stage 2: extract_keywords() from shared/text.py (stopword-aware) → for each token of length >= 4, try get_entity_by_name(token). Skips tokens already resolved by stage 1. Cost per recall: typically <=30 indexed lookups against the entities.name index (sub-ms). The keyword extractor's stopword filter prevents swamping the entity index with function-word junk. This matters for dendritic ablation evidence: if natural-language queries fall to token-Jaccard in BOTH active and ablated paths, the delta is zero by construction — independent of whether the mechanism contributes. After this fix, queries that name real entities exercise the entity-set Jaccard on the active path and degrade to token-Jaccard only when ablated.
1 parent 024ea1a commit bc0ae4f

1 file changed

Lines changed: 44 additions & 6 deletions

File tree

mcp_server/core/recall_pipeline.py

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -311,22 +311,60 @@ def _candidate_tags(c: dict[str, Any]) -> set[str]:
311311

312312

313313
def _resolve_query_entity_ids(query: str, store: Any) -> set[int]:
314-
"""Resolve query entities (CamelCase / paths / backticks) to entity_ids.
315-
316-
Constant-cost per recall (typical query has 0-5 such entities), so
317-
this stays a small loop of ``get_entity_by_name`` calls, not a
318-
per-candidate scan. Returns the empty set if nothing resolves —
319-
callers must then fall back to the token-Jaccard proxy.
314+
"""Resolve query entities to entity_ids.
315+
316+
Two-stage resolution so dendritic Jaccard fires on natural-language
317+
queries (not just CamelCase / path / backtick patterns):
318+
319+
1. ``extract_query_entities()`` — high-precision extraction of
320+
CamelCase, slash paths, backtick code spans (Poirazi et al.
321+
compatible: these are typically the entity-anchors a user writes).
322+
2. **Token fallback** — for every alphabetic token of length ≥ 4
323+
not already extracted, attempt ``get_entity_by_name(token)``.
324+
This catches natural-language anchors like "authentication" or
325+
"deployment" which the regex-extractor misses but which often
326+
canonicalize to a real entity_id in the production graph.
327+
328+
Constant cost per recall (typical query: ≤30 candidate tokens
329+
after stopword filter; each is one indexed lookup against the
330+
``entities`` table's name index — sub-millisecond). Returns the
331+
empty set if nothing resolves; callers fall back to the
332+
token-Jaccard proxy.
320333
"""
321334
from mcp_server.core.query_decomposition import extract_query_entities
335+
from mcp_server.shared.text import extract_keywords
322336

323337
if not hasattr(store, "get_entity_by_name"):
324338
return set()
325339
ids: set[int] = set()
340+
seen_names: set[str] = set()
341+
342+
# Stage 1: high-precision extractor (CamelCase / paths / backticks).
326343
for name in extract_query_entities(query):
344+
if not name or name in seen_names:
345+
continue
346+
seen_names.add(name)
327347
row = store.get_entity_by_name(name)
328348
if row and row.get("id") is not None:
329349
ids.add(int(row["id"]))
350+
351+
# Stage 2: natural-language token fallback. Use the existing
352+
# stopword-aware keyword extractor (`shared/text.py::extract_keywords`)
353+
# which already filters function words; only try tokens of length ≥ 4
354+
# to avoid swamping the entity index with junk lookups.
355+
try:
356+
for token in extract_keywords(query):
357+
if len(token) < 4 or token in seen_names:
358+
continue
359+
seen_names.add(token)
360+
row = store.get_entity_by_name(token)
361+
if row and row.get("id") is not None:
362+
ids.add(int(row["id"]))
363+
except Exception: # noqa: BLE001
364+
# Fallback path is non-load-bearing; if the keyword extractor
365+
# ever fails on a malformed query we still have the stage-1 ids.
366+
pass
367+
330368
return ids
331369

332370

0 commit comments

Comments
 (0)