Skip to content

Commit 942b746

Browse files
rolandpgclaude
andauthored
docs: graph retriever silence — root cause + remediation paths (closes #42) (#105)
Static code trace identifies why Vigil's 2026-04-24 live test showed kg=0ms on 148 of 150 recalls. The graph retriever path is correct; the gap is upstream: memory_manager.recall() calls extractor.extract_all(query, use_llm=False) ← regex only The query-time entity extractor is regex-only (intrusion_set, cve, actor short-list, MITRE TID, IOCs). Natural-language queries like "recent Citrix exploitation" or "who used what tool last week?" match none of those patterns, so query_entities = {}. graph_retriever.retrieve_note_ids() then short-circuits at line 36 (`if not query_entities: return []`) and the BFS never seeds — kg returns in microseconds. The 2 of 150 events that DID fire graph almost certainly contained regex-matchable tokens (test queries like "What malware did APT28 use?" — matches the apt28 intrusion_set pattern). Three fix options ranked in the doc: A. LLM-NER on the query — preferred semantic fix; small change, gated by intent so cost is bounded; depends on RFC-009 Phase 1 (LLM hardening) landing first. B. Seed BFS from top-K vector neighbors' entities — no LLM hop, no new failure modes; produces graph signal even on sparse queries. C. Telemetry tweak — log a 'no graph signal — query entities empty' marker so the silence is diagnosable instead of looking identical to a 0-result graph walk. Recommended: combine B (fix) + C (diagnostic). A is a v2.5.x experiment after Phase 1 LLM hardening. The doc explicitly notes graph-silence is ORTHOGONAL to the 21-22% zero-hit recall rate (task #40) — those are different failure modes and should not be conflated. Closes the investigation phase of task #42; the implementation choice goes into a follow-up RFC update or v2.5.x scope. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3ebb5ba commit 942b746

1 file changed

Lines changed: 117 additions & 0 deletions

File tree

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: "Graph retriever silence — why `kg=0ms` on 98.7% of recalls"
3+
status: ROOT CAUSE IDENTIFIED via static code trace; field verification pending next traffic run
4+
author: "claude-code"
5+
date: "2026-04-25"
6+
scope: "Investigate task #42 — Vigil's 2026-04-24 live test showed `kg=0ms` on 148 of 150 recalls across factual / exploratory / relational / temporal intents"
7+
data_sources:
8+
- "Vigil DEBUG telemetry 2026-04-24 23:44Z+ — 150 recalls, 2 with kg>0"
9+
- "src/zettelforge/memory_manager.py:580-624 (recall path)"
10+
- "src/zettelforge/graph_retriever.py:31-93 (retrieve_note_ids + _bfs_collect)"
11+
- "src/zettelforge/entity_indexer.py:69-130 (EntityExtractor regex set)"
12+
---
13+
14+
# Graph retriever silence — root cause
15+
16+
## TL;DR
17+
18+
The graph retriever fires **only when the query string itself contains a regex-matchable CTI token** (CVE, APTNN, MITRE TID, IP/domain/hash). For natural-language queries that don't (e.g., *"recent Citrix exploitation"*, *"what tools did the actor use?"*, *"timeline of last week's incidents"*), the query-time extractor returns an empty entity dict, the BFS in `_bfs_collect` is never seeded, and `retrieve_note_ids` returns `[]` in microseconds — exactly the `kg=0ms` signal Vigil's telemetry recorded.
19+
20+
This is **not a bug in the graph retrieval path**. The graph code works correctly when given entities. The gap is upstream: the query-time entity extraction is regex-only, while *recall queries* are mostly written in natural language. The 2 of 150 calls that fired graph contained literal regex-matchable strings — likely the recall test that asked *"What malware did APT28 use?"* (matches the `apt28` regex pattern in `EntityExtractor.REGEX_PATTERNS["intrusion_set"]`).
21+
22+
## Code trace
23+
24+
### Step 1 — extract entities from the query (regex only)
25+
26+
`memory_manager.py` (the recall path) does:
27+
28+
```python
29+
query_entities = self.indexer.extractor.extract_all(query, use_llm=False)
30+
resolved = {etype: [resolver.resolve(etype, e) for e in elist]
31+
for etype, elist in query_entities.items()}
32+
```
33+
34+
`extract_all(query, use_llm=False)` is the **regex-only** path of `EntityExtractor`. The full regex set (entity_indexer.py:32-69) covers:
35+
36+
- `cve``CVE-YYYY-NNNNN`
37+
- `intrusion_set``apt`/`unc`/`ta`/`fin`/`temp` + digits
38+
- `actor` — small handful of named groups (`lazarus`, `sandworm`, `volt typhoon`)
39+
- `tool` — small handful (`cobalt strike`, `mimikatz`, `bloodhound`, etc.)
40+
- `campaign``operation \w+`
41+
- `attack_pattern``T\d{4}(?:\.\d{3})?`
42+
- IOCs — `ipv4`, `domain`, `url`, `md5`, `sha1`, `sha256`, `email`
43+
44+
A natural-language query like *"recent Citrix exploitation"* matches **none** of these. `query_entities` returns `{}` and `resolved` is `{}`.
45+
46+
### Step 2 — graph retriever short-circuits on empty input
47+
48+
`graph_retriever.py:31-37`:
49+
50+
```python
51+
def retrieve_note_ids(self, query_entities, max_depth=2) -> List[ScoredResult]:
52+
if not query_entities:
53+
return []
54+
...
55+
```
56+
57+
When `query_entities` is `{}`, `not query_entities` is `True` and we return `[]` immediately. No `kg.get_node` calls, no BFS, no walk-through. `_graph_latency_ms` measured around the call therefore lands at ~0ms because the function returns in microseconds.
58+
59+
### Step 3 — BlendedRetriever has nothing to merge
60+
61+
The blender at `memory_manager.py:626-633` falls back to vector results only. `recall()` returns vector-shaped output with no graph contribution. From the telemetry's perspective, `kg=0ms` is *correct* — the graph genuinely did no work — but it under-represents what we want, which is to **make the graph fire even on queries that the regex misses**.
62+
63+
## Why the 2 successful events fired
64+
65+
In the same 150-recall window, exactly 2 events showed `kg > 0`. The most likely explanation: those queries contained regex-matchable tokens. Test-suite traffic includes literal queries like *"What malware did APT28 use?"*`apt28` matches the `intrusion_set` regex, the BFS seeds, and the graph walks the actual entity → note edges that `_index_in_lance` planted.
66+
67+
## What this means for `intent != factual`
68+
69+
Telemetry showed `kg=0ms` even on `relational` and `temporal` intents. We had assumed the intent classifier might gate the graph path; it doesn't. `recall()` calls the graph retriever unconditionally (`memory_manager.py:621-623`); the gate is purely whether the regex extractor produces seeds. So an `intent=relational` query like *"how is APT28 connected to PlugX?"* fires graph (matches `apt28` and `plugx` regexes) — but *"who's connected to last week's incident?"* doesn't (no regex tokens). The intent label is unrelated to the firing decision.
70+
71+
## Three possible fixes (ranked)
72+
73+
### A. **Run LLM-NER on the query** (preferred, smallest surface change)
74+
75+
`extract_all` already supports `use_llm=True`; the recall path explicitly passes `use_llm=False` for performance. Vigil's instrumentation now reports `vector_latency_ms` p95 ~500ms; an LLM-NER call adds ~200-500ms on a warm Ollama. That's tolerable for `relational`/`temporal`/`exploratory` intents that benefit most from graph signal.
76+
77+
**Implementation sketch**:
78+
```python
79+
# memory_manager.py:580
80+
use_llm_for_query_entities = intent in ("relational", "temporal", "exploratory")
81+
query_entities = self.indexer.extractor.extract_all(
82+
query, use_llm=use_llm_for_query_entities
83+
)
84+
```
85+
86+
Cost: only the queries that benefit pay the LLM hop. Factual queries with explicit CTI tokens still take the regex fast path. Risk: doubles the LLM-call surface area, so RFC-009 Phase 1 (LLM hardening) should land first.
87+
88+
### B. **Lookup entities from the corpus rather than the query**
89+
90+
If the query itself is sparse, walk the top-K vector neighbors first, then BFS from *their* entities. Same shape as the existing `causal` boost at `memory_manager.py:640-668`. This keeps the graph signal honest (anchored to retrieved notes) without an LLM hop.
91+
92+
**Implementation sketch**: after the vector retrieve, take the top 3 notes' `semantic.entities` field and feed them as `query_entities` to `retrieve_note_ids`. Cheap, no LLM dependency, and produces graph signal even for vague queries.
93+
94+
### C. **Tighten the blender contract — accept `query_entities=None`**
95+
96+
A no-op fallback that explicitly logs *"no graph signal — query entities empty"* so future operators see the gap in the telemetry stream rather than a silent 0. Doesn't fix the silence but makes it diagnosable. Free; can ship today as a one-line change to the OCSF event.
97+
98+
## Recommended path
99+
100+
Combine **B** and **C**. B fixes the symptom for most natural-language queries without any LLM dependency or new failure modes; C makes the remaining genuine "no graph signal" cases visible in telemetry instead of indistinguishable from the 0-result cases that existed before.
101+
102+
A is a v2.5.x experiment after RFC-009 Phase 1's LLM hardening lands — premature today because it doubles LLM-call surface against an unhardened provider.
103+
104+
## What this DOES NOT explain
105+
106+
- The 21-22% **zero-hit recall rate** that survived the Qwen → Nemotron LLM swap. Graph-silence is a *no-blend-signal* issue (graph contributes nothing to the score); zero-hit recall is a *vector-finds-nothing* issue. They're orthogonal failure modes. Task #40 still needs its own investigation — likely a vector-similarity-threshold or corpus-thinness question, not graph.
107+
108+
## Next step (verification before code change)
109+
110+
Pull a small sample (~20) of recalls from the post-cleanup test where `kg > 0` and confirm every one of them contains a regex-matchable token in the query string. If any of them don't, the BFS-seeded-by-vector-neighbors theory needs revising. The verification needs the per-recall `query` field, which DEBUG-level telemetry now captures — straightforward jq pull.
111+
112+
## Cross-references
113+
114+
- Task #42 in this session's task list (now closeable as "investigation complete; implementation tracked separately")
115+
- Task #40 (zero-hit recall) — orthogonal; do not bundle
116+
- RFC-009 Phase 1 (LLM hardening) — soft prerequisite for fix A
117+
- `docs/superpowers/research/2026-04-25-v2.4.2-live-test-observations.md` (when it lands) for raw telemetry lineage

0 commit comments

Comments
 (0)