Updated to reflect debug CLI in Phase 1, context budget separation, concurrent query model, path normalization, InvalidationPlanner phasing, golden set sizing, and the following fixes from executive review: file move ordering (P0-1), BLOB(16) type consistency (P0-2), disambiguator in UNIQUE index with declaration-header hash (P0-3), uniform 500-node transaction bound (P0-4), multi-span content_hash definition (P0-5), edge invalidation reframed on node_id (P1-6), PRAGMA split for writer-only settings (P1-7), non-null project_id with synthetic root (P1-9), symbol-exact lookup chain clarification (P1-10), and grounded cancellation semantics (P1-12). Consistency review (v5.2): symbol_disambiguator sole scheme with no file-path fallback, multi-span chunk_hash ordering by span_hash bytes (not span_id), node_identity_map DDL with BLOB(16) and disambiguator fields, multi-span-aware file deletion logic, skip-if-unchanged clarified for move handling, symbol channel overload handling, license correction (Apache 2.0), TS adapter alignment, and project detection collision logging. Further fixes (v5.3): C# tree-sitter fallback includes signature for overload safety, TS identity scopes non-exported symbols by file_id, file deletion test split for single-span vs multi-span, transaction bound exception clarified for edge replace, symbol_disambiguator NOT NULL DEFAULT '', primary-span partial unique index, §6.2 phase labeling. Consistency check fixes (v5.3.1): TS export_scope classification (package/module/file) replacing underspecified "exported vs non-exported" split; Plan §5 intro and §6.2 corrected for eager-embedding reality; node_spans.file_id FK action specified (ON DELETE CASCADE); file deletion operation order made unambiguous; C# Roslyn key reconciliation mechanism specified (in-place mutation); file node journaling explicitly required; retrieval channel renamed "qualified-name" (was "symbol-exact"); atomic edge replace scaling mitigation note added; file node eager embedding input specified; project_id assignment sequencing clarified. Lazy-summary removal (v5.3.2): Removed all references to lazy summary generation, ensure_summary, summary_prompt_version_used, quality_flag, quality guards, and SummaryConfig — feature was dropped from codebase. Summary invalidation reframed as content-hash invariant. Phase 3 test assertions rewritten for EmbeddingProvider/vec_nodes reality.
- Stable logical identity:
symbol_keyMUST be stable across moves when the symbol identity is unchanged (C# all symbols; TS package-scoped exports only — see export_scope below).- C#: derived from Roslyn symbol identity (kind + fully-qualified + signature). Phase 1 fallback (tree-sitter only): qualified_name + kind + parameter count + parameter types-as-written + generic arity (overload-safe).
- TS: Each symbol is classified by
export_scope(package | module | file):package: reachable from a stable package entrypoint/barrel export surface.symbol_key=package_id+ package-level export path + kind + normalized signature. Move-stable; file path MUST NOT appear in identity.export defaultis disambiguated by its unique package-level export path.module: hasexportkeyword but NOT reachable from package entrypoint.symbol_key=package_id+file_id+ local export name + kind + normalized signature. NOT move-stable (includesfile_id).file(non-exported): no export keyword. Same derivation asmodulescope (includesfile_id). NOT move-stable.- Phase 1: tree-sitter cannot trace the export graph; all exported symbols are classified as
export_scope = module(conservative, includesfile_id). Move-stability for TS is NOT achieved in Phase 1. Phase 2a: TS Language Service resolves re-exports; symbols reachable from package entrypoints are upgraded toexport_scope = packagevia in-placesymbol_keymutation on the existingnode_id. - Collision handling via
symbol_disambiguator(declaration-header hash — hashes the declaration line shape, not the body). File path MUST NOT be used as a secondary disambiguator for package-scoped exports; this is the sole disambiguation scheme to preserve move-stability.
- Identity reconciliation (Phase 2a): When semantic enrichment produces a different
symbol_keyfor the same symbol (C# Roslyn-derived keys replacing tree-sitter fallback keys, or TSexport_scopeupgrade from module→package), the system MUST mutatesymbol_keyin-place on the existingnode_idto preserve cached data. Match by file location + symbol kind + name. If a UNIQUE conflict occurs, fall back to delete+create with anode_identity_mapentry. - (Phase 2b) Move/rename preservation: When a move/rename is detected with high confidence, the system prefers reusing
node_id. In Phase 1, renames are treated as hard-delete (with journal entry) + create. - Collisions: If
symbol_keycollides, the system MUST disambiguate without breaking move-stability, and MUST NOT "flip-flop" identities across runs. The full identity key is(language, project_id, symbol_key, symbol_disambiguator), enforced by a UNIQUE index. - File move preservation (Phase 1): Within a
ChangeBatch, creates/modifications MUST be processed before deletes. The pipeline upserts by(language, project_id, symbol_key, symbol_disambiguator), updatingnode_spansto the newfile_id/file_path, then processes deletes. This preservesnode_idand cached data for pure file moves without requiring rename detection. - (Phase 2b) Edge identity preservation: When rename detection is added, edges MUST NOT be remapped by "guessing." Either reuse
node_idor use explicitnode_identity_map.
- Two kinds of invalidation exist and both are required:
- Source-driven invalidation (file content /
chunk_hashchanges). - Semantic-context invalidation (project/package config changes). Phase 1 detects and logs; Phase 2a acts on them.
- Source-driven invalidation (file content /
- InvalidationPlanner: Centralized component. Phase 1: source-driven invalidation. Phase 2a: adds semantic-context invalidation. Phase 2b: adds rename detection integration. Must be unit-testable against Section 6.6 decision matrix.
- Semantic-context triggers (Phase 2a): Changes to
*.sln,*.csproj,Directory.Build.props/targets,tsconfig*.json,package.json, lockfiles MUST trigger semantic scope recompute. - Atomic semantic edge replace (Phase 2a): Prior semantic edges MUST be deleted/replaced atomically (single write transaction) to prevent mixed-stale graphs.
- Content-hash invariant: For multi-span nodes,
nodes.chunk_hashstores a composite content_hash: SHA-256 of the concatenation of allnode_spans.span_hashvalues, ordered byspan_hashbytes ascending (not byspan_id, which is insertion-order and unstable across reparse).chunk_hashis purely content-derived. - Embedding invalidation: Embeddings MUST be recomputed when embedding model id/dimensionality changes, or the text basis changes.
- Parse-status truthfulness: If semantic enrichment is unavailable/failed,
parse_statusMUST reflect it.
- No implicit network calls (highest priority):
- All MCP tools MUST be purely local. MUST NOT trigger network calls.
get_source_spans,get_symbol,read_file— always local SQLite or filesystem reads.
- Offline correctness: Graph traversal, retrieval, source access, and cached embeddings all work offline. The engine makes no network calls.
- File system sandboxing: All file system tools resolve paths relative to the repo root. Paths escaping the root are rejected. Symlinks outside the root are not followed.
- No graph mutation via MCP: External clients cannot directly write to the graph. Only
index_filestriggers graph writes through the ingest pipeline.
- Single-writer discipline: All writes through dedicated writer thread; readers use read-only pooled connections.
- Foreign keys enforced: Every connection (reader and writer) MUST set
PRAGMA foreign_keys = ON,PRAGMA mmap_size, andPRAGMA busy_timeout. WAL mode (journal_mode),synchronous, andwal_autocheckpointare set by the writer thread (or DB creation path) only; read-only connections inherit WAL mode automatically. - Multi-span correctness: Every multi-span node has at least one
node_spansrow withis_primary = true. - Hard-delete with deletion journal: Node removal is a hard DELETE with
ON DELETE CASCADEfor edges. Before each delete, a snapshot is written todeletion_logfor Phase 2b rename matching. This applies to all node types including file nodes — file nodes are journaled so Phase 2b file-level rename correlation can query the journal. All read paths are clean — nois_deletedfilter anywhere. - File deletion operation order (transactional): When a file is deleted, the following steps execute within a single write transaction, in this exact order: (1) collect affected
node_ids fromnode_spans; (2) deletenode_spansrows for thatfile_id; (3) for each affected node: if zero spans remain → journal + hard-delete; if spans remain → reassign primary + update convenience fields; (4) journal + delete the file node last. The file node MUST NOT be deleted first becausenodes.file_id ON DELETE SET NULLwould null-out convenience fields before the writer can process affected nodes. node_spans.file_idFK action:ON DELETE CASCADE. When a file node is deleted, associatednode_spansrows are automatically removed. The explicit deletion algorithm (above) processes spans before the file node deletion, but the CASCADE acts as a safety net.- Journal sweep: Deletion log entries older than retention period (default 1 hour) are swept during idle time.
- Embedding storage is simple: One row per node in sqlite-vec. Overwritten in-place. WAL ensures reader consistency. No versioning.
- FTS/vector sync: For any node, FTS and vector representations MUST be consistent with the node's current state.
- Migrations are monotonic: Newer on-disk schema refuses older binaries; forward migrations are transactional.
- UUID storage:
node_id,file_id, and edge endpoints stored asBLOB(16). All DDL examples (includingnode_spans) MUST useBLOB(16)for UUID columns andBLOB(32)for hash columns. - Path normalization: All stored paths are repo-relative POSIX.
normalize_pathcalled at every system boundary. Never mid-pipeline. project_idis non-null: Nodes are assigned to a synthetic "repo-root" project per language until project detection completes. This ensures the UNIQUE index on(language, project_id, symbol_key, symbol_disambiguator)cannot be bypassed by NULL values (SQLite treats each NULL as distinct in UNIQUE constraints).project_idassignment sequencing: Project detection (scanning for.csproj,package.jsonworkspaces) MUST complete before symbol node indexing begins, so that nodes receive their correctproject_idfrom the start. Late reassignment ofproject_idafter initial indexing requires re-indexing affected symbols (sinceproject_idparticipates in the UNIQUE constraint) and may cause transient constraint violations.symbol_disambiguatoris non-null: Defined asTEXT NOT NULL DEFAULT ''(empty string when no disambiguation is needed). This prevents SQLite's NULL-distinct-in-UNIQUE behavior from silently allowing duplicate identity keys.- Primary-span uniqueness enforced: Exactly one
is_primary = trueper node innode_spans, enforced by a partial unique index:CREATE UNIQUE INDEX idx_node_spans_one_primary ON node_spans(node_id) WHERE is_primary = 1. - Bounded write transactions: All node upsert write transactions are bounded to at most 500 nodes per transaction. This applies uniformly, including burst-recovery and large refactoring batches. The sole exception is atomic semantic edge replace (one transaction per affected project/package scope), which is bounded per scope rather than by a fixed node count.
- Context budget separation: Retrieval pipeline has
retrieval.max_output_tokens(hard cap). MCP clients pass the desired limit when invoking retrieval tools. Retrieval does not know about conversation history or system prompts. - Cold-start awareness: Behavioral queries will have low vector recall until eager embeddings have been computed for the full codebase. Phase 4b eval should include behavioral queries to establish baseline.
- Safe mode is real (Phase 2a):
indexing.safe_mode = true: MUST NOT evaluate MSBuild, MUST NOT execute TS plugins. C# runs assyntactic_only.- Any relaxation MUST be explicit and user-approved.
- NuGet restore separately gated:
indexing.allow_nuget_restore(default false) even when safe mode is off. - No unsupported sandbox promises: Primary mitigation is safe mode, not filesystem sandboxing.
- Bounded queues: Write queue and child-process request queues bounded with deterministic backpressure.
- Cancellation correctness: Superseded semantic requests cancellable via
CancellationToken(tokio_util), propagated to rayon CPU tasks (periodic token checks) and child process RPC calls (request_id cancellation messages). The writer thread MUST check the token before COMMIT — if cancelled, the transaction is rolled back. MUST NOT commit partial results. - No long-lived write transactions: Bounded batch size (max 500 nodes/tx) for node upsert transactions to avoid WAL growth. Applies uniformly to all node upsert batch types. Exception: atomic semantic edge replace (Phase 2a) may exceed this bound because it operates on edges within a single project/package scope and must be atomic to prevent mixed-staleness. Edge replace transactions are bounded per project/package scope, not by a fixed node count.
- No constant boilerplate: Embedding uses only discriminative content.
- File node embedding input: File nodes use
file_path+file_name+language+ import/export summary (top-N imported/exported symbol names) + file header doc comment. Must NOT collapse all file nodes into the same vector space region. - Normalization: All embeddings use the same model and consistent normalization.
reference_count: Phase 1 = approximate (tree-sitter edge count). Phase 2a = precise (Roslyn/TS findReferences).is_public_api(TS): Phase 1 = direct exports only (export keyword). Phase 2a = includes re-exports via barrel file resolution.
- TS file move in Phase 1: all TS exported symbols are classified
export_scope = module(tree-sitter cannot trace the export graph), sosymbol_keyincludesfile_idand is NOT move-stable. A pure file move is treated as delete+create for TS symbols in Phase 1 (cache loss, not correctness failure). Verify that two identically-named exported helpers in different files produce distinct identity keys (becausefile_iddiffers). Full identity key(language, project_id, symbol_key, symbol_disambiguator)is unique across files. Non-exported file-scoped symbols also includefile_idin identity. - File move preserves
node_idand cached embeddings (ChangeBatch processes creates before deletes, upserts by identity key, updatesnode_spansto new file). get_source()andget_node()never trigger network calls.- File deletion hard-deletes all single-span nodes whose only span was in that file; deletion journal entries written for each. Multi-span nodes whose other spans survive are NOT deleted (see multi-span deletion test below). The file node itself is journaled and then hard-deleted last. Verify the operation order: collect affected nodes → delete spans → process nodes (journal+delete or reassign primary) → journal+delete file node. Verify the entire sequence executes within a single write transaction.
- Hard-deleted nodes leave no orphaned edges, vec_nodes, or fts_nodes rows.
- Journal sweep removes entries older than retention period.
- Rename produces new node_id; old node hard-deleted with journal entry.
- Multi-span correctness: partial class maintains one primary span. Content_hash (
chunk_hash) is SHA-256 of span hashes ordered byspan_hashbytes ascending (notspan_id).chunk_hashis purely content-derived. - File deletion with multi-span nodes: deleting a file that contains one span of a multi-span node removes only that span's
node_spansrow, reassigns primary span if needed, and does NOT delete the node. Deleting a file whose nodes have no remaining spans hard-deletes those nodes. - Semantic-context file changes detected and logged (no action until Phase 2a).
- InvalidationPlanner: unit test against decision matrix for all (edge_type, change_type) pairs. Decision matrix includes
node_idlifecycle rows (preserved → no action; replaced → CASCADE + re-establish), notsymbol_key-based triggers. - Path normalization: Windows backslash paths → repo-relative POSIX at boundary.
- Debug CLI: all commands produce valid JSON on sample project.
- Backend integration: health check authenticates and completes a round-trip.
project_idis non-null: nodes without a detected project are assigned to a synthetic "repo-root" project per language. Project detection MUST complete before symbol node indexing begins. Collisions from multi-project repos with identically-named symbols MUST be detected and logged with an actionable warning.node_spansDDL usesBLOB(16)for UUID columns andBLOB(32)for hash columns, consistent with all other schema definitions.node_spans.file_idFK usesON DELETE CASCADE.- Write transactions bounded to ≤500 nodes/tx, including burst-recovery batches.
- Cancellation: writer thread rolls back uncommitted transaction on cancellation token; no partial results persisted.
- Context change invalidation: changing
tsconfig.jsonor.csprojtriggers semantic recompute even with zero source edits. - Atomic replace: no observable mixed old/new edge state during semantic recompute.
- Safe mode enforcement: MSBuildWorkspace and TS plugins blocked when safe mode is on.
- NuGet gating: with safe mode off and
allow_nuget_restore = false, evaluation proceeds but restore is blocked. reference_countupgraded to precise count from Roslyn/TS.is_public_apifor TS includes barrel file re-exports.- Edge invalidation reframe: when a target node's
node_idis replaced (delete+create), ON DELETE CASCADE removes stale edges. Callers re-establish edges on next semantic pass. No ad-hoc remapping bysymbol_key. - Identity reconciliation (C#): When Roslyn produces a different
symbol_keythan the tree-sitter fallback for the same symbol, verify thatsymbol_keyis mutated in-place on the existingnode_id(not delete+create). Verify cached embeddings are preserved. Verify that if the new key conflicts with an existing node, the system falls back to delete+create with anode_identity_mapentry. - TS export_scope upgrade: After semantic enrichment resolves re-exports, verify that symbols reachable from package entrypoints are upgraded from
export_scope = moduletoexport_scope = package. Verifysymbol_keyis mutated in-place (file_id removed, package-level export path added) on the existingnode_id. Verify that a subsequent pure file move preservessymbol_keyfor package-scoped exports. Verify that module-scoped exports (not reachable from entrypoints) retainfile_idin identity.
- Chunk fingerprint: identical bodies / different names → high similarity (≥0.95).
- Detected rename reuses
node_id; edges automatically preserved. - Uncommitted rename (delete + create in debounce window) correctly correlated via deletion journal.
- When node_id cannot be preserved,
node_identity_mapentry created withBLOB(16)UUID columns and both old/newsymbol_disambiguatorfields.
EmbeddingProvidertrait produces deterministic embeddings;HashEmbeddingProviderused in tests.vec_nodestable created byensure_vec_nodes_table; embeddings stored as BLOB keyed bynode_id BLOB(16).load_sqlite_vec()registers sqlite-vec as a global auto-extension before any connection is opened.
- Eval harness includes behavioral queries with expected lower baseline (cold-start gap).
- Reference codebases: ≥1 C#, ≥1 TS/React, each >10K nodes.
- All MCP tools return valid JSON responses.
- File system tools reject paths escaping the repo root with
path_escapeerror. index_filesroutes through the ingest pipeline and updates the graph.get_statusreturns accurate counts for indexed files, symbols, and embeddings.