fix(verif): extend query-entity resolution to natural-language tokens (engineer follow-up)

cdeust · cdeust · commit bc0ae4f9eaaa · 2026-05-02T11:56:10.000+02:00
Engineer flagged a residual gap in ddb5b58/024ea1a: extract_query_entities() only matches CamelCase / path / backtick patterns. Natural-language queries ("what was discussed about authentication") resolved zero entity_ids and fell through to the token-Jaccard fallback, which could mute the dendritic ablation delta on most LongMemEval-S queries. Fix: two-stage resolution in _resolve_query_entity_ids(): - Stage 1: extract_query_entities() (high-precision, unchanged) - Stage 2: extract_keywords() from shared/text.py (stopword-aware) → for each token of length >= 4, try get_entity_by_name(token). Skips tokens already resolved by stage 1. Cost per recall: typically <=30 indexed lookups against the entities.name index (sub-ms). The keyword extractor's stopword filter prevents swamping the entity index with function-word junk. This matters for dendritic ablation evidence: if natural-language queries fall to token-Jaccard in BOTH active and ablated paths, the delta is zero by construction — independent of whether the mechanism contributes. After this fix, queries that name real entities exercise the entity-set Jaccard on the active path and degrade to token-Jaccard only when ablated.
diff --git a/mcp_server/core/recall_pipeline.py b/mcp_server/core/recall_pipeline.py
@@ -311,22 +311,60 @@ def _candidate_tags(c: dict[str, Any]) -> set[str]:
 
 
 def _resolve_query_entity_ids(query: str, store: Any) -> set[int]:
-    """Resolve query entities (CamelCase / paths / backticks) to entity_ids.
-
-    Constant-cost per recall (typical query has 0-5 such entities), so
-    this stays a small loop of ``get_entity_by_name`` calls, not a
-    per-candidate scan. Returns the empty set if nothing resolves —
-    callers must then fall back to the token-Jaccard proxy.
+    """Resolve query entities to entity_ids.
+
+    Two-stage resolution so dendritic Jaccard fires on natural-language
+    queries (not just CamelCase / path / backtick patterns):
+
+    1. ``extract_query_entities()`` — high-precision extraction of
+       CamelCase, slash paths, backtick code spans (Poirazi et al.
+       compatible: these are typically the entity-anchors a user writes).
+    2. **Token fallback** — for every alphabetic token of length ≥ 4
+       not already extracted, attempt ``get_entity_by_name(token)``.
+       This catches natural-language anchors like "authentication" or
+       "deployment" which the regex-extractor misses but which often
+       canonicalize to a real entity_id in the production graph.
+
+    Constant cost per recall (typical query: ≤30 candidate tokens
+    after stopword filter; each is one indexed lookup against the
+    ``entities`` table's name index — sub-millisecond). Returns the
+    empty set if nothing resolves; callers fall back to the
+    token-Jaccard proxy.
     """
     from mcp_server.core.query_decomposition import extract_query_entities
+    from mcp_server.shared.text import extract_keywords
 
     if not hasattr(store, "get_entity_by_name"):
         return set()
     ids: set[int] = set()
+    seen_names: set[str] = set()
+
+    # Stage 1: high-precision extractor (CamelCase / paths / backticks).
     for name in extract_query_entities(query):
+        if not name or name in seen_names:
+            continue
+        seen_names.add(name)
         row = store.get_entity_by_name(name)
         if row and row.get("id") is not None:
             ids.add(int(row["id"]))
+
+    # Stage 2: natural-language token fallback. Use the existing
+    # stopword-aware keyword extractor (`shared/text.py::extract_keywords`)
+    # which already filters function words; only try tokens of length ≥ 4
+    # to avoid swamping the entity index with junk lookups.
+    try:
+        for token in extract_keywords(query):
+            if len(token) < 4 or token in seen_names:
+                continue
+            seen_names.add(token)
+            row = store.get_entity_by_name(token)
+            if row and row.get("id") is not None:
+                ids.add(int(row["id"]))
+    except Exception:  # noqa: BLE001
+        # Fallback path is non-load-bearing; if the keyword extractor
+        # ever fails on a malformed query we still have the stage-1 ids.
+        pass
+
     return ids