Architectural Decision Record for Epic #10186. Evaluates shared caching substrates to eliminate
GraphServicesingleton-cache divergence across concurrent MCP server processes (Claude Desktop / Claude Code / Antigravity).
| Attribute | Value |
|---|---|
| Status | Proposed — 2026-04-22 |
| Author | Claude Opus 4.7 (Claude Code) |
| Resolves | #10189 |
| Unblocks | #10190 |
| Parent Epic | #10186 |
| Informs | #10184 (symptom this substrate class causes), #10185 (within-process band-aid already merged) |
Neo.ai.services.memory-core.GraphService is declared singleton: true
and holds the in-memory projection of the graph (db.nodes Map, db.edges Store,
vicinityLoadedNodes Set). Every MCP server process has its own singleton;
processes share only the SQLite-WAL backing file at
.neo-ai-data/sqlite/memory-core-graph.sqlite.
The empirical diagnostic from PR #10197 (diagnoseMcpConcurrency.mjs) confirms up
to 4 concurrent Node processes touching the same SQLite file under typical
multi-harness operation (2 × Claude Desktop + 2 × Antigravity with Twin Language
Server doubling). Under #10186's worst-case estimate, concurrent-process count
reaches ~20 when all harnesses × worktrees are active.
SQLite WAL handles storage concurrency correctly. The observed failure (#10184:
bindAgentIdentity null at boot despite a populated SQLite row) is a cache-layer
divergence — Process B's in-memory cache doesn't see what Process A wrote to
storage. PR #10185's retry-with-vicinity-delete closed the boot-time symptom but
does not address the underlying class of bug.
This ADR evaluates substrate options for the shared-cache-coherence layer that would close the bug class systemically.
The codebase already implements a cross-process cache-coherence primitive. This ADR's solution space is reshaped by this discovery; rejecting the "invent new substrate" framing is the single most important architectural decision.
ai/graph/storage/SQLite.mjs lines 95–110 declare:
CREATE TABLE IF NOT EXISTS GraphLog (
log_id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_id TEXT,
entity_type TEXT,
-- ...
);
-- Triggers on Nodes and Edges tables:
CREATE TRIGGER node_insert AFTER INSERT ON Nodes
BEGIN INSERT INTO GraphLog(entity_id, entity_type) VALUES (NEW.id, 'nodes'); END;
-- (equivalent update / delete triggers for nodes and edges)Every mutation to Nodes or Edges — regardless of which process authored it —
appends a durable record to GraphLog. Because GraphLog lives in the shared
SQLite file, all processes observe the same sequence.
ai/graph/Database.mjs line 77 implements delta-log replay:
syncCache() {
if (!this.storage || !this.lastSyncId) return; // ← bug (see §2.4)
let delta = this.storage.getDeltaLog(this.lastSyncId);
this.lastSyncId = delta.lastLogId;
if (delta.invalidNodes.length > 0) {
this.nodes.remove(delta.invalidNodes);
delta.invalidNodes.forEach(id => {
this.vicinityLoadedNodes.delete(id);
this.lastAccessMap.delete(id);
});
}
if (delta.invalidEdges.length > 0) {
this.edges.remove(delta.invalidEdges);
}
}syncCache() is called at the start of every getAdjacentNodes(nodeId) call
(Database.mjs:260). Its contract: "pull any mutations since our last sync,
invalidate the corresponding cache entries so the subsequent lazy-load picks up
the fresh state."
Database.mjs:43 declares lastSyncId: 0 as the initial state. storage.load()
(SQLite.mjs:335–343) advances it to the current MAX(log_id) at boot so the
process starts tracking only mutations after its load completes. Local writes
bump the watermark via acknowledgeLocalMutations() (Database.mjs:117) so a
process doesn't re-process its own writes as external changes.
Reading syncCache()'s guard: if (!this.storage || !this.lastSyncId) return;.
The !this.lastSyncId clause early-exits when lastSyncId === 0 — which happens
when the DB has never recorded a mutation at the moment storage.load() ran.
Concrete race (matches #10184 empirically):
- Claude Desktop boots Process B at T₀.
GraphLogis empty:MAX(log_id) = 0. storage.load()setslastSyncId = 0.- At T₁, another process (or a one-shot seed script like
seedAgentIdentities.mjs) writes@neo-opus-4-7toNodes. Trigger fires:GraphLog.log_id = 1. - At T₂, Process B calls
bindAgentIdentity('neo-opus-4-7'):getAdjacentNodes('@neo-opus-4-7')triggerssyncCache()!this.lastSyncIdis truthy (0 is falsy) → early return, no delta replayvicinityLoadedNodes.has('@neo-opus-4-7')is false →loadNodeVicinitySyncruns- Depending on SQLite WAL visibility at T₂, the query may or may not see the T₁ write
- If the query sees empty vicinity:
vicinityLoadedNodes.add(id)marks loaded (this is the #10181 mark-without-load bug) - If it sees the node: we're fine
- After T₂, Process B's cache is permanently stuck on whatever outcome T₂ produced,
because the next
syncCache()still short-circuits (lastSyncId still 0 until Process B itself writes something).
Two interlocking bugs produce the observed symptom:
- Bug A (syncCache early-return): prevents the delta-replay substrate from running on a fresh-boot process until it makes its first local write
- Bug B (mark-without-load,
Database.mjs:281): compounds Bug A — once a zero-result lazy-load happens, the vicinity cache marks the id "loaded" and subsequentgetAdjacentNodescalls skip re-reading SQLite until something explicitly invalidates the entry (which Bug A prevents)
The substrate exists and is well-designed. Two surgical defects in its
implementation produce the #10184 failure mode.
- Correctness: eliminate cross-process cache divergence deterministically
- Minimize churn: reuse existing substrate when available — avoid importing standard-backend-framework patterns (Redis pub-sub, ZooKeeper consensus, CRDT eventual-consistency) that are training-data attractors for this problem class but wrong-substrate for a local single-machine MCP swarm (per epic-review stage 5 extension opportunity)
- Testability: any chosen approach must be empirically verifiable — specifically,
the A→seed / B→read divergence test from
#10186's Verification section - Service-boundary discipline: the fix belongs on the cache layer (
Database.mjs), not scattered across call-sites likebindAgentIdentity(which PR #10185 had to patch as a band-aid for lack of a substrate fix) - Performance: changes must not regress the hot path (O(1) in-memory reads for cached nodes; O(log n) SQLite reads for lazy-load). Any substrate that adds network round-trips or process-boundary RPC on every cache read is regressive
Approach: Fix Bugs A and B identified in §2.4. Do not introduce new substrate.
Two surgical changes in ai/graph/Database.mjs:
-
Drop the
!this.lastSyncIdearly-return clause insyncCache()(currently line 78). Rationale:getDeltaLog(0)is a legitimate query meaning "give me all mutations" — exactly what a process with an unadvanced watermark needs. The existing guard is overly conservative and creates the Bug A race. Change becomes:syncCache() { if (!this.storage) return; // lastSyncId === 0 is legitimate — fresh boot wanting full catch-up let delta = this.storage.getDeltaLog(this.lastSyncId); this.lastSyncId = delta.lastLogId; // ... }
-
Don't mark
vicinityLoadedNodeson empty-result lazy load (Database.mjs:281, the #10181 root). Rationale: a zero-result query means "the data isn't there yet," not "I've loaded nothing and will never check again." Marking loaded on empty conflates the two. Change becomes:if (vicinity.nodes.length > 0) { me.nodes.add(vicinity.nodes); } if (vicinity.edges.length > 0) { me.edges.add(vicinity.edges); } // Only mark vicinity-loaded when we actually loaded something if (vicinity.nodes.length > 0 || vicinity.edges.length > 0) { me.vicinityLoadedNodes.add(nodeId); }
Pros:
- Zero new substrate; reuses what exists
GraphLogtriggers are already firing in production; changes are to the consumer- Surgical: ~10-line diff across two methods
- Changes are hypothesis-testable via unit tests (mock storage with pre-seeded
deltas; assert syncCache replays them even when
lastSyncId === 0) - No performance regression — drops a truthy check, adds a length check
- Honors service-boundary discipline (cache-layer fix, not call-site patching)
Cons:
- Removing the
!lastSyncIdcheck meansgetDeltaLog(0)may scan the fullGraphLogtable on first call for a long-lived DB. Mitigation:GraphLogis append-only and grows linearly with mutations; a periodic compaction task (separate ticket) can cap growth. For current scale (tens of thousands of mutations), the full-scan is sub-millisecond. - Doesn't address the
vicinityLoadedNodessemantic limitation: the Set doesn't track when a node was marked loaded, so an entry from T-distant-past stays authoritative. Delta-replay invalidates entries on mutations, so this is correct in steady state — but a node that has never been mutated and never existed will be re-queried every call (no "confirmed-absent" caching). For identity binding this is the correct behavior (boot-time seeding may happen any moment); for other call sites it's a trade-off worth documenting.
Approach: Remove db.nodes/db.edges as cache Maps; every getNode()
queries SQLite directly.
Pros:
- Maximum simplicity: zero divergence possible, no invalidation semantics to reason about
- Leverages SQLite WAL's built-in concurrency correctness
Cons:
- Every tool dispatch reads SQLite: O(N) queries per session vs current O(1) cached reads
- Requires empirical measurement to validate latency budget (not available in this ADR's timebox)
- Refactor touches every consumer of
db.nodes/db.edges, not just the race sites - Removes a legitimate optimization without empirical evidence it's unnecessary
Approach: Store cache-invalidation sequence in SQLite; each process subscribes to a sequence table and invalidates local entries on advance.
Verdict: Effectively Option A. This is precisely what GraphLog + syncCache
already implement. The only delta from Option A is whether we call the existing
primitive "adequate" or "needs extending." Given Option A's two-bug hypothesis
produces the observed symptom, Option C collapses into Option A at the
implementation level.
Approach: New long-running daemon process owns the cache; MCP servers talk to it via Unix socket / named pipe / SharedArrayBuffer.
Rejection rationale:
- Introduces a new process lifecycle (startup, crashes, upgrades) — adds operational complexity without clear gain
- Creates a single-point-of-failure the current design doesn't have (daemon crash → all MCP servers lose their cache, cold restart required)
- IPC round-trips regress the hot path from O(1) memory access to syscall latency
- Wrong-substrate: the problem is "stale cache," not "no source of truth" — SQLite is already the SoT, just needs coherent consumers
- Training-data attractor: standard-backend-framework solution (e.g., Redis sidecar) transplanted into a domain that doesn't need it
Approach: Use an existing Chroma collection to store cache state accessible cross-process.
Rejection rationale:
- Chroma is a vector DB — using it as plain KV is a category error (epic-review stage 5 flagged this preemptively)
- Every cache read requires an embedding lookup (even if the "key" is just an identity string, Chroma's query path is vector-similarity, not exact-match)
- Network hop to Chroma adds latency no faster than SQLite direct read
- Adds a cross-service dependency (Chroma must be healthy for identity cache to work) where none currently exists at that layer
Choose Option A: Harden the existing GraphLog + syncCache() primitive.
The underlying substrate is already correct. Two surgical defects in its
implementation produce the #10184 symptom. Fix the defects in place rather
than layering new substrate on top.
The ADR recommends #10190 be re-scoped from "Implement chosen shared
caching substrate" to "Harden existing syncCache() + vicinityLoadedNodes
correctness". The title's "implement substrate" framing is obsolete under this
decision — no new substrate will be built. The scope becomes:
- Drop the
!this.lastSyncIdearly-return guard inDatabase.mjs:78 - Only mark
vicinityLoadedNodes.add(nodeId)when the vicinity query returned non-empty results inDatabase.mjs:281 - Add Playwright regression tests:
- Fresh-boot race (Bug A): mock a DB with pre-existing
GraphLogentries andlastSyncId = 0; assert subsequentsyncCache()call replays them - Mark-without-load (Bug B): upsert a node, clear
db.nodes, keepvicinityLoadedNodesmarked; assert that after the fix, subsequentgetNodeon an empty-vicinity returns null without sticking the cache - End-to-end divergence: two Database instances over the same SQLite
file; Process A writes, Process B reads via
getNode, assert Process B's next read sees Process A's write
- Fresh-boot race (Bug A): mock a DB with pre-existing
- Remove PR #10185's
bindAgentIdentityretry-loop band-aid once the substrate-layer fix is empirically verified — the retry is hypothesis-1 specific and becomes redundant - Preserve PR #10182's
callToolself-heal block as defense-in-depth until the end-to-end divergence test passes cleanly across at least one live restart cycle with all three harnesses active.- Update (2026-04-23): This condition was empirically verified and the
#10182block was surgically removed via PR #10227. The architectural rationale shifted from "defense-in-depth" to "surface, don't obscure" — defensive code here masked bugs, whereas explicit failure aligns with Neo.mjs's loud-failure discipline. The#10228test-pollution fix further mitigates the remaining failure mode that would have made self-heal useful.
- Update (2026-04-23): This condition was empirically verified and the
After #10190 lands:
- Run
diagnoseMcpConcurrency.mjspre-restart to confirm multi-process state - Full restart of Claude Desktop + Claude Code + Antigravity
list_messagesvia memory-core returns populated mailbox (closes#10184ACs 5/6)- Manual A→seed / B→read verification via two MCP client sessions
- If any step fails: the substrate fix is incomplete and the band-aids in PR #10185 + #10182 remain operative
- Closes
#10184's underlying bug class, not just the symptom - Unblocks live A2A handshakes (
#10139Mailbox epic dependency chain) - Eliminates the need for the PR #10185 retry band-aid once verified
- Preserves all existing consumer contracts — no API surface changes
- Honors "reuse existing substrate" as a first-class architectural principle for this codebase
- Provides a regression-test suite that doubles as documentation of the coherence contract
getDeltaLog(0)full-scan on first call: manageable at current scale (sub-millisecond on tens of thousands ofGraphLogrows). Mitigation: periodic compaction (separate future ticket) or boundgetDeltaLogwith a recency window once the table grows large.- Hidden assumption on SQLite WAL visibility timing:
loadNodeVicinitySyncstill depends on WAL checkpointing making writes visible to readers. If WAL visibility is itself racy, Option A's fixes may be necessary-but-insufficient. Mitigation: the divergence test in#10190will surface this empirically. vicinityLoadedNodesdoesn't track confirmed-absent: after Bug B fix, a never-existed node will be re-queried on everygetNodecall. For identity binding this is correct (boot-time seeding is timing-sensitive); for high-frequency callers it's a micro-regression. Accept as trade-off; measure if it surfaces.
| Option | Rejection reason |
|---|---|
| B — Drop cache entirely | Unmeasured latency risk; over-refactor for a two-bug fix |
| C — SQLite-backed shared cache | Collapses into Option A (already the existing primitive) |
| D — IPC daemon | New SPOF; wrong-substrate training-data attractor |
| E — Chroma KV | Category error (vector DB as plain KV); epic-review §5 flagged |
ai/graph/Database.mjs—syncCache(),lastSyncId,vicinityLoadedNodesai/graph/storage/SQLite.mjs—GraphLogschema + triggers,getDeltaLog(),load()ai/services/memory-core/GraphService.mjs—initAsync(),getNode()
- Parent Epic: #10186 (MCP concurrency audit + single-writer enforcement)
- Sibling subs: #10187 (diagnoseMcpConcurrency — merged), #10188 (boot-time sibling-PID logging — in progress), #10190 (implementation — blocked on this ADR)
- Symptom this ADR addresses: #10184 (identity-binding null at boot)
- Band-aid PRs preserved until verified: #10185 (retry loop), #10182 (self-heal)
- Test harness dependency: #10183 (MCP server test harness — useful but not blocking)
- Epic #10186 five-stage review by Claude Opus 4.7: #10186 (comment anchor)
- Flagged training-data attractors for this problem class (Redis pub-sub, ZooKeeper, CRDTs) preempt the Option D/E rejections above
query_raw_memories(query="cross-harness MCP singleton cache divergence GraphLog syncCache")query_raw_memories(query="vicinityLoadedNodes mark-without-load lastSyncId early-return")query_summaries(query="MCP concurrency ADR shared caching substrate")- Commit-range anchor: PR #10197 (diagnoseMcpConcurrency merged) → this ADR
Known contributing sessions (partial, restart-fragmented):
ae546a40-2133-482f-85a6-779fdf6757b2(authoring session; cross-harness diagnostic + #10187 implementation + #10189 ADR)fe1cd4d0-e667-45cf-bd02-7fb79a13a64f(PR #10185 review cycles — origin of the band-aid this ADR contextualizes)