Skip to content

feat+fix: C# support, investigation-grade trace, semantic search, execution flows, channels, cross-repo intelligence#162

Open
Koolerx wants to merge 3 commits intoDeusData:mainfrom
Koolerx:fix/csharp-and-trace-improvements
Open

feat+fix: C# support, investigation-grade trace, semantic search, execution flows, channels, cross-repo intelligence#162
Koolerx wants to merge 3 commits intoDeusData:mainfrom
Koolerx:fix/csharp-and-trace-improvements

Conversation

@Koolerx
Copy link
Copy Markdown

@Koolerx Koolerx commented Mar 27, 2026

Summary

40 commits adding major features and fixing critical bugs across the MCP handler, extraction layer, pipeline, store, and Cypher engine. Developed while stress-testing against large enterprise codebases and running real investigation scenarios.

Highlights: C# blast radius analysis (0 → 16 callers), execution flow detection (1 → 300 flows), hybrid BM25+vector semantic search, camelCase token splitting, process deduplication, route deduplication, cross-repo search/trace/impact analysis across 54 repos.


Phase 1: Core Fixes and Features (Commits 1-29)

Bug Fixes (Commits 1-6)

  1. trace_call_path Class → Method resolution — BFS resolves through DEFINES_METHOD edges for Class/Interface nodes
  2. detect_changes use-after-free — Switched to strcpy variants for stack buffer reuse
  3. Route path validation — Blocklist filter for vendored/minified JS false positives
  4. C# inheritance via base_list — INHERITS edges: 210 → 1,588 (7.5x)
  5. Crash on 0-edge nodes + fuzzy name fallback — Heap-allocated traversal array, substring fallback
  6. Class has_method node ID — Fixed DEFINES_METHOD BFS to use Class node ID

New Features (Commits 7-17)

  1. get_architecture returns full analysis
  2. Louvain clustering with semantic labels
  3. Hapi.js route extraction (0 → 1,665 routes)
  4. BM25 full-text search via SQLite FTS5
  5. Execution flow detection (BFS + Louvain, 300 flows)
  6. Socket.IO + EventEmitter channel detection
  7. get_impact blast radius with risk assessment
    14-17. Cypher JSON properties, investigation-grade trace, C# delegate/event resolution, C# channel constants

Gap Closure (Commits 18-29)

  1. get_impact Class over Constructor resolution for C#
  2. Entry point detection for C#/Java class methods (1 → 280 flows)
  3. Channel dedup + count(DISTINCT) + SQL injection fix
  4. Cypher NOT EXISTS subquery (dead-code detection in <1s)
  5. Cross-repo channel query + has_property in trace
  6. C# property extraction with HAS_PROPERTY edges (19K+ properties)
  7. C/C++ CALLS edge attribution to enclosing function scope
  8. C++ entry point heuristics (WinMain, DllMain, etc.)
  9. HANDLES + HTTP_CALLS in process detection BFS
  10. Route→Function resolution + relaxed process detection (4 → 61 flows)
  11. Resolve relative import paths (153 → 11,770 IMPORTS, 77x)
  12. CommonJS require() import extraction

Phase 2: Search Quality (Commits 30-33)

30. Process participation in search_graph results

Each BM25 search result now includes the execution flows it participates in.

31. JS/TS constant resolution for Socket.IO channel detection

Resolves const EVENT = "foo" references in socket.emit(EVENT) patterns. Channels detected: 6 → 17 per repo.

32. Expose BM25 query and sort_by params in search_graph schema

The FTS5 BM25 search path existed but query was not declared in the tool's inputSchema. AI agents couldn't discover or use it. Now exposed with full documentation.

33. Pure BM25 relevance ranking + camelCase token splitting

  • Removed fan_in popularity boost from BM25 ranking — popular-but-irrelevant functions no longer outrank relevant matches
  • Added cbm_camel_split() SQLite functionupdateCloudClientupdateCloudClient update Cloud Client enabling word-level BM25 matching
  • Switched FTS5 to contentless mode (content='') — required for camelCase split tokens to match correctly at query time

Phase 3: Process and Route Quality (Commit 34, 38)

34. Deduplicate entry points + [module] prefix on process labels

  • Entry point dedup: Route resolution added same function N times when N routes pointed to same handler file. 61 → 17 processes.
  • Module labels: funcA → funcZ[controllers] funcA → funcZ — instantly navigable among 50+ flows.

38. Route node deduplication — eliminate ghost nodes

Three sources of 3x route duplication: express + hapi extractors both matching same patterns, plus module-level extraction with empty QN. Fixed with (method, path) dedup. 1665 → 555 routes, 0 ghosts.


Phase 4: Semantic Vector Search (Commits 35-37)

35. Hybrid BM25+vector semantic search via external embedding API

Full semantic search architecture:

  • Embeddings table in SQLite with cbm_cosine_sim() custom function
  • HTTP embedding client via Mongoose to any OpenAI-compatible /v1/embeddings endpoint
  • RRF merge (k=60) combining BM25 keyword results with vector cosine similarity
  • generate_embeddings MCP tool for manual trigger
  • semantic_results field in search_graph output for vector-only matches

Configuration: CBM_EMBEDDING_URL, CBM_EMBEDDING_MODEL, CBM_EMBEDDING_DIMS env vars. No new dependencies — uses vendored Mongoose (HTTP) and yyjson (JSON).

36. Fix use-after-free in semantic result strings

yyjson_mut_obj_add_str (borrows pointer) → yyjson_mut_obj_add_strcpy (copies string).

37. Auto-generate embeddings during full indexing

When CBM_EMBEDDING_URL is configured, the pipeline auto-generates embeddings after process and channel detection. Zero-friction: repos indexed while the embedding server is running get embeddings automatically.


Phase 5: Cross-Repo Intelligence (Commits 39-40)

39. Unified cross-repository index

New _cross_repo.db built by scanning all per-project databases:

  • 134K node stubs (Function/Method/Class/Interface/Route from all repos)
  • 526 channel references across 13 projects
  • 134K embedding vectors copied for cross-repo semantic search
  • BM25 FTS5 index with camelCase splitting across all repos
  • 12 cross-repo channel matches automatically detected (emit in A, listen in B)
  • Build time: ~2 seconds for 54 repos

New MCP tools: build_cross_repo_index, trace_cross_repo. Auto-rebuilds after every index_repository.

40. Cross-repo search, flow tracing with call chains, and impact analysis

Cross-repo search (search_graph with project="*"):
Hybrid BM25+vector search across all 54 repos in a single call. Returns results with both short project name and full project_id for follow-up queries.

Enhanced trace_cross_repo with call chains:
When a channel filter is provided, traces depth-2 upstream callers of the emitter and depth-2 downstream callees of the listener. Handles Class→Method resolution and (file-level) listener fallback via channels table lookup.

Cross-repo impact analysis (get_impact with cross_repo=true):
After per-repo BFS, checks if d=1 impacted symbols emit channels to other repos. For each affected channel, opens consumer project DB, traces downstream from listener, returns cross_repo_impacts array.


Testing

All 40 commits compile clean with -Wall -Wextra -Werror. 2,586 existing tests pass. Stress-tested against:

  • Large C# monolith (~128K nodes) — class hierarchy, delegates, 19K properties, 278 flows, 152 channels
  • Node.js/TS Hapi.js monorepo (~146K nodes) — 11,770 IMPORTS, 300 flows, 212 channels, 555 routes
  • React/TS monorepo (~9K nodes) — 300 flows, 344 IMPORTS
  • Java/Vert.x service (~15K nodes) — 300 flows from Java entry points
  • C++ library (~570 nodes) — Function→Function CALLS attribution
  • 54 repos with 134K embeddings — cross-repo semantic search, channel tracing, impact analysis

New Files

  • src/pipeline/embedding.c / embedding.h — Semantic embedding generation + RRF merge
  • src/store/cross_repo.c / cross_repo.h — Cross-repo index, search, channel matching, trace helper

Configuration (new env vars)

Variable Default Purpose
CBM_EMBEDDING_URL (none) OpenAI-compatible /v1/embeddings endpoint
CBM_EMBEDDING_MODEL nomic-embed-text Embedding model name
CBM_EMBEDDING_DIMS 768 Vector dimensions

@Koolerx Koolerx force-pushed the fix/csharp-and-trace-improvements branch from 33a7d1d to 58fff9e Compare March 27, 2026 18:03
@Koolerx Koolerx changed the title fix: C# support improvements and MCP handler bug fixes fix+feat: C# support, MCP bug fixes, Hapi routes, Louvain clustering Mar 27, 2026
@Koolerx Koolerx changed the title fix+feat: C# support, MCP bug fixes, Hapi routes, Louvain clustering feat+fix: C# support, investigation-grade trace output, BM25 search, execution flows, channel detection Mar 27, 2026
@DeusData DeusData added enhancement New feature or request parsing/quality Graph extraction bugs, false positives, missing edges language-request Request for new language support labels Mar 29, 2026
@Koolerx Koolerx changed the title feat+fix: C# support, investigation-grade trace output, BM25 search, execution flows, channel detection feat+fix: C# support, investigation-grade trace, BM25 search, execution flows, channels, IMPORTS resolution Mar 29, 2026
@Koolerx Koolerx force-pushed the fix/csharp-and-trace-improvements branch from f7f94be to 70bc64f Compare March 30, 2026 17:42
@Koolerx Koolerx changed the title feat+fix: C# support, investigation-grade trace, BM25 search, execution flows, channels, IMPORTS resolution feat+fix: C# support, investigation-grade trace, semantic search, execution flows, channels, cross-repo intelligence Mar 30, 2026
@Koolerx
Copy link
Copy Markdown
Author

Koolerx commented Mar 30, 2026

@DeusData, many improvements have been made to the project. I know there are over 40 commits, but please go ahead and give it a test run.

@DeusData
Copy link
Copy Markdown
Owner

Hey @Koolerx, thx! Will check. Currently mainly involved in clearing technical debt. Will come back to this asap. Likely will analyze your changes and will extract what makes sense in a seperate commit where I will list you as co author so that you will be listed as contributor. Hope thats fine for u :)

@Koolerx
Copy link
Copy Markdown
Author

Koolerx commented Mar 30, 2026

Hey @Koolerx, thx! Will check. Currently mainly involved in clearing technical debt. Will come back to this asap. Likely will analyze your changes and will extract what makes sense in a seperate commit where I will list you as co author so that you will be listed as contributor. Hope thats fine for u :)

all sounds well to me

@DeusData
Copy link
Copy Markdown
Owner

DeusData commented Mar 30, 2026

But in general: Why the embeddings? Can you give a bit more reasoning for this? @Koolerx? What does this add to a coding agent? I already have investigated using embeddings but was not so convinced. Coding agents can efficiently query themselves already GraphDBs

@Koolerx
Copy link
Copy Markdown
Author

Koolerx commented Mar 30, 2026

But in general: Why the embeddings? Can you give a bit more reasoning for this? @Koolerx? What does this add to a coding agent? I already have investigated using embeddings but was not so convinced. Coding agents can efficiently query themselves already GraphDBs

the graph handles most queries really well. Like, 85% of the time BM25 plus the knowledge graph is all you need. The embeddings close a specific gap in the remaining cases, and it's worth understanding exactly what that gap is before dismissing it.

The gap is vocabulary mismatch. When the user's search terms don't appear anywhere in symbol names, qualified names, or file paths, BM25 returns nothing useful. The graph can only find things when tokens overlap that's just how keyword search works.

We ran this head to head. Query: "start video recording" against a media services repo. BM25 found startSession, startRecordingSession, startCaptureNodeSession: good results, keyword matches on "start" and "recording" and "session".

But the vector layer surfaced updateVideoConductorSession, videoTrimmingFailureNotification, videoRemuxNotification, RecordingStorage, conceptually related symbols with zero keyword overlap. The BM25 results are the functions you'd change. The vector results are the functions that would break if you changed them. That's the blast radius that keyword search can't surface on its own.

Now, for a coding agent specifically, the agent doesn't know what it doesn't know. If search_graph returns 29 results, the agent assumes that's the full picture and moves on. The vector layer surfaces what the agent would have missed entirely. We tested "error handling" BM25 found 29 functions with "error" in the name. But the functions that actually handle errors constructors in error classes, catch blocks in controllers don't have "error" or "handling" in their names. Vector search found 20 additional symbols the agent never would have seen.

Where it doesn't matter and I want to be honest about this:

if you know the exact symbol name, BM25 is sufficient. If you're tracing callers and callees, graph BFS is the right tool, period. If the codebase uses consistent naming conventions, BM25 plus the camelCase token splitting we added covers it. Embeddings don't help with any of those cases.

The implementation cost is pretty minimal. About 200 lines of C, a cosine similarity function plus an HTTP client to any OpenAI-compatible endpoint. Zero new dependencies. we used Mongoose that was already vendored for HTTP and yyjson for JSON. It's fully opt-in via a CBM_EMBEDDING_URL env var. When not set, everything works exactly as before pure BM25. Brute force cosine scan at 134K vectors takes under 10ms, so no ANN index needed at this scale.

The honest limitation is that quality depends on the embedding model. A bad model gives bad results. And it requires a running embedding server : Ollama, llamafile, whatever, so it's not self-contained like the graph. That's why we made it opt-in rather than default. It's a power-user feature for people who need discovery across vocabulary boundaries, not a replacement for the graph.

if CBM is used purely for graph traversal, trace callers, impact analysis, process flows — embeddings add nothing. If it's used for discovery, "find code related to X" where X is described in natural language, that's where embeddings close the gap that BM25 structurally cannot.

@DeusData
Copy link
Copy Markdown
Owner

DeusData commented Apr 3, 2026

Hey @Koolerx, thanks for the detailed argument on embeddings. I've been investigating this exact question in depth — whether vector/embedding search adds value for coding agents specifically (vs. human developers). Here's where I landed, and what I built instead.

The research

I went through ~15 papers, benchmarks, and industry analyses on embeddings vs. structured approaches for coding agents:

  • "Why Grep Beat Embeddings in Our SWE-Bench Agent" (Jason Liu / Augment, 2025) — A top SWE-bench team found grep+find was sufficient. Improving embedding models didn't improve end-to-end agent performance because agents are persistent and will find what they need through iteration.
  • CodeScout (OpenHands, 2026) — An RL-trained 1.7B model with just ripgrep+sed matches or beats 32B models with specialized graph tools. No embeddings needed.
  • Augment Context Engine — The most successful tool in this space explicitly chose knowledge graphs OVER embeddings. Their 80% quality improvement for Claude Code came from structural awareness, not vector similarity.
  • Google DeepMind LIMIT benchmark (arXiv 2508.21038, 2025) — Proved single-vector embeddings have fundamental mathematical ceilings. Complex queries create combination spaces that exceed any practical embedding dimension.
  • Greptile's own research — Published evidence that code and natural language occupy different semantic spaces. Similarity score for a query against actual code: 0.728. Against a NL description of that code: 0.815. Embeddings need preprocessing (code→NL translation) to work for code.
  • CodeRAG-Bench (NAACL 2025) — GPT-4 showed no improvement with retrieval on open-domain code. DeepSeekCoder actually regressed with retrieved context.

The pattern: every piece of agent-relevant evidence says embeddings don't help or actively hurt. The pro-embedding evidence is either human-facing (Copilot suggestions, Augment's search UI) or measuring retrieval quality only, not end-to-end agent performance.

The core insight

Your argument is about vocabulary mismatch — "start video recording" not matching updateVideoConductorSession. That's a real gap for human search. But coding agents don't search like humans. The LLM driving the agent IS already a semantic engine. It knows "video recording" relates to "capture", "session", "conductor". It generates search_graph(name_pattern=".*video.*record.*") and then search_graph(name_pattern=".*capture.*session.*"). The LLM closes the vocabulary gap better than any embedding model, because it can reason about context.

What I built instead: SIMILAR_TO edges

I just pushed a feature that solves the part of the similarity problem that actually matters for coding agents — near-clone detection via precomputed graph edges:

Commits:

  • 452f5a7 — Add MinHash fingerprinting and SIMILAR_TO edges for near-clone detection
  • 09ce20a — Improve MinHash quality: leaf-only tokens, structural weighting, unique trigram gate

How it works:

  • During AST extraction, compute K=64 MinHash signatures from normalized AST node-type trigrams (leaf-only, language-agnostic)
  • Structural weighting: skip all-normalized trigrams (noise), weight structural trigrams by specificity via repetition-based weighted MinHash — achieves IDF-like effect without corpus statistics
  • LSH (b=32, r=2) for O(n) candidate generation, then exact Jaccard ≥ 0.95 for edge emission
  • Emits SIMILAR_TO edges as first-class graph edges — same as CALLS, IMPORTS, IMPLEMENTS

Why this is better than vector search for agents:

  • Zero latency — it's a stored graph edge. No embedding model, no ANN query, no runtime inference.
  • No external dependency — pure integer arithmetic using vendored xxHash. No Ollama, no API server, no model file.
  • Automatic discovery — the agent sees near-clones during normal trace_call_path traversal without explicitly asking.
  • Precise — tight threshold means every edge is a genuine near-clone (~98% precision on Linux kernel spot-check).
  • Language-agnostic — leaf-only token counting + structural trigram weighting works across all 64 supported languages.

Linux kernel benchmark: 395K functions fingerprinted, 58K SIMILAR_TO edges, 7.7s similarity pass, 1:41 total pipeline. Spot-checked edges are genuine cross-subsystem near-clones (Atheros ath10k/11k/12k driver generations, AMD GPU display mode versions, SMB v1/v2 protocol handlers).

(Python Project) benchmark: 68 SIMILAR_TO edges, 100% precision — all genuine cross-service duplicates (clean_data_errors copied across 4 endpoint files, log_info across 4 services).

Your blast radius argument

You said: "The vector results are the functions that would break if you changed them. That's the blast radius."

That's exactly what SIMILAR_TO edges deliver. If an agent fixes ValidateUser(), the graph now shows a SIMILAR_TO edge to ValidateOrder() — which has the same bug pattern. The agent fixes both. No embedding model needed — the structural fingerprint caught the near-clone.

For the broader blast radius (callers, callees, transitive dependencies), trace_call_path already covers this structurally.

What I'd like you to check

Can you take a look at the SIMILAR_TO implementation (src/simhash/minhash.{h,c}, src/pipeline/pass_similarity.c) and let me know if you think this covers the use cases you were targeting with embeddings? Specifically:

  1. Does the near-clone detection via SIMILAR_TO edges address the "blast radius" gap you identified?
  2. Are there concrete coding agent workflows where you believe embeddings would still add value beyond what MinHash + structural graph provides?

I'm genuinely interested in your perspective — you clearly stress-tested against real codebases. If there's a gap SIMILAR_TO doesn't cover that embeddings would, I want to understand it.

@Koolerx
Copy link
Copy Markdown
Author

Koolerx commented Apr 4, 2026

Hey @DeusData, great work on the SIMILAR_TO implementation — I pulled your latest main, built it, and ran a head-to-head comparison against our branch across 5 repos of different sizes and languages (a large JS/TS monolith ~146K nodes, a medium Node.js/Hapi service ~3K nodes, a small Electron/TS app ~700 nodes, a Hapi.js backend ~2.5K nodes, and a mixed JS/Python monorepo ~15K nodes). Here's what I found.

The comparison

I indexed the same repos with both binaries and compared edge counts, search quality, and feature coverage.

Edge counts (5 repos, side by side)

Feature Your main Our PR Delta
CALLS 50,956 52,686 +1,730
IMPORTS 154 12,537 +12,383
HANDLES (route→handler) 15 1,193 +1,178
SIMILAR_TO 3,721 0 -3,721
WRITES 671 1,018 +347
FTS5 indexed 0 168,248 all nodes
Processes 0 693 execution flows
Channels 0 204 Socket.IO/EventEmitter

Search quality

I ran search_graph(query="...") on both binaries. Your main returns all nodes unranked for every query — the query parameter falls through without FTS5 search. Our branch returns ranked, filtered results with camelCase token splitting:

Repo size Query Your main Our PR
146K nodes "update settings" 145,814 (all nodes returned) 1,467 ranked (relevant update/settings functions first)
146K nodes "authentication login" 145,814 (all nodes returned) 230 ranked (login handlers and auth schemes first)
3K nodes "device management" 2,877 (all nodes returned) 230 ranked (device CRUD functions first)
700 nodes "audio stream" 713 (all nodes returned) 28 ranked (stream start/stop/hook functions first)

The issue is that search_graph has FTS5 BM25 search code internally, but the query parameter wasn't declared in the tool's inputSchema — so AI agents never send it. Our commit 32 exposed it, commit 33 added camelCase token splitting (updateDeviceConfig → searchable as "update", "device", "config"), and switched to contentless FTS5 for correct matching.

SIMILAR_TO analysis

Your 3,411 SIMILAR_TO edges on the large monolith are genuinely useful. The cross-file clones (2,377 of them) are the most interesting — same UI component patterns duplicated across different entity pages, same pagination logic copied across list views, same column renderers adapted for different data types.

These are real near-clones that neither keyword search nor embeddings would find — they're structurally similar but use completely different names. The MinHash approach is the right tool for this.

Features in our PR that your main doesn't have

  1. IMPORTS edges (12,537) — relative path resolution for JS/TS require() and import (commits 28-29). Creates module dependency edges that your main misses entirely.

  2. HANDLES edges (1,193) — route-to-handler resolution. GET /api/itemsitemsController.list(). Currently 0 in your main for most repos.

  3. Execution flows (693) — BFS from entry points through Louvain communities. Each process has a [module] label prefix for navigability. Deduplicated (was producing 3-8x duplicates from route resolution).

  4. Channels (204) — Socket.IO emit/listen + EventEmitter pattern detection with constant resolution (const EVENT = "foo"; socket.emit(EVENT) → channel "foo").

  5. Cross-repo intelligence — unified _cross_repo.db with search (project="*"), channel flow tracing, and cross-repo impact analysis. Builds in ~2s for 50+ repos.

  6. Incremental FTS5 — the incremental pipeline was wiping FTS5 to 0 rows after every reindex (btree dump bypasses triggers). Fixed to rebuild FTS5 after merge.

  7. C# Interface INHERITS — Interface nodes weren't registered in the symbol registry. class Foo : IBar produced 0 INHERITS→Interface edges. Fixed in both parallel and sequential paths.

On embeddings specifically

I hear your argument about agents being semantic engines themselves, and I think you're partially right. Here's the honest breakdown after running both approaches:

Where SIMILAR_TO wins over embeddings:

  • Zero dependencies, zero latency — it's a stored edge
  • Finds structural clones that embeddings miss entirely (same AST pattern, different names)
  • Language-agnostic without preprocessing
  • The "fix this bug, find the same bug elsewhere" use case is perfectly served

Where embeddings add value that SIMILAR_TO doesn't cover:

  • Conceptual queries like "start video processing" find functions related to media pipeline management that have completely different AST structures and different names. SIMILAR_TO wouldn't match these.
  • But you're right that a coding agent CAN get there iteratively with multiple search_graph(name_pattern=".*video.*process.*") calls. The LLM closes the gap in 2-3 iterations.

My honest assessment: embeddings are a power-user feature for human-in-the-loop discovery, not essential for agent workflows. That's why we made it fully opt-in via CBM_EMBEDDING_URL. When not set, everything works as pure BM25 + graph. No external dependency, no model, no server.

What I'd suggest

The features that would add the most value to your main branch (independent of embeddings):

  1. FTS5 BM25 search — expose the query parameter in search_graph schema + camelCase splitting. This alone turns search from "return all nodes" to "return ranked relevant results." (~30 lines changed)

  2. IMPORTS edge resolution for JS/TS relative paths. Goes from 154 → 12,537 edges. Major improvement for module dependency analysis.

  3. Incremental FTS5 rebuild — 5 lines in pipeline_incremental.c. Without this, any incremental reindex wipes search to 0 results.

  4. Interface registration in the registry. 2 lines (add "Interface" to the label filter). Fixes all C#/Java INHERITS→Interface edges.

These are all small, targeted fixes that don't introduce embeddings or external dependencies. Happy to split them into separate PRs if that's easier to review.

The cross-repo, channels, processes, and embeddings features are bigger additions that you can evaluate separately. And SIMILAR_TO is complementary to everything we built — the two approaches together would be stronger than either alone.

@DeusData
Copy link
Copy Markdown
Owner

DeusData commented Apr 4, 2026

Hey @Koolerx, thanks for evaluating. I am already working on trying to get a performant embedding analysis, adding additionally to the vector search itself a "SEMANTIC_SIMILAR" Edge, which should also cover cases the SIMILAR_TO edges miss atm (while also staying zero dependency and not adding much latency to the indexing). All the other things you have mentioned make sense. I just had not the time to embedd them but will do so, once done with my current TODO. I am also thinking about your cross-repo approach. Already had something similar in mind but need to think more about it. Thanks for pushing this my friend 🙏

Your Name added 2 commits April 4, 2026 18:13
…gistration, embeddings, cross-repo infrastructure

Rebased our PR DeusData#162 features onto upstream's latest main (commit 1d30971)
which includes MinHash SIMILAR_TO edges, CBM_CACHE_DIR, and major refactoring.

Ported features (building clean on upstream's refactored codebase):

1. FTS5 BM25 search infrastructure:
   - Contentless FTS5 virtual table (nodes_fts) with camelCase token splitting
   - cbm_camel_split() SQLite function: updateCloudClient → 'update Cloud Client'
   - FTS5 backfill in both full pipeline and incremental pipeline
   - Incremental reindex now preserves FTS5 (was wiping to 0 rows)

2. Interface registration in symbol registry:
   - Added 'Interface' to label filter in process_def() (pass_definitions.c)
   - Added 'Interface' to label filter in register_and_link_def() (pass_parallel.c)
   - Fixes: C# class Foo : IBar now creates INHERITS → Interface edges

3. C# base_list extraction:
   - Added 'base_list' to fallback base_types[] in extract_base_classes()

4. Embeddings infrastructure (opt-in via CBM_EMBEDDING_URL):
   - embeddings table in SQLite schema
   - cbm_cosine_sim() SQLite function for vector search
   - embedding.c/h: HTTP client, text generation, RRF merge, pipeline integration
   - Auto-generates embeddings during indexing when configured

5. Cross-repo infrastructure:
   - cross_repo.c/h: unified _cross_repo.db builder, cross-repo search,
     channel matching, trace helper

Not yet ported (follow-up commits):
- MCP tool changes (search_graph query param, generate_embeddings tool,
  cross-repo tools, get_impact tool)
- Process detection (cbm_store_detect_processes)
- Channel detection (cbm_store_detect_channels)
- C# delegate event subscription (extract_calls.c)
- WRITES expansion (extract_semantic.c)

All upstream features preserved: MinHash SIMILAR_TO, pass_similarity,
CBM_CACHE_DIR, TS_FIELD() macro, extracted helpers.
…able FTS5

Completes the rebase by adding the MCP handler layer:

1. Enable FTS5 in SQLite compile flags (-DSQLITE_ENABLE_FTS5).
   Without this, CREATE VIRTUAL TABLE USING fts5(...) silently creates
   a stub that fails on any query with 'no such module: fts5'.

2. Expose 'query' and 'sort_by' params in search_graph inputSchema.
   AI agents can now send natural language queries for BM25 ranked search
   instead of regex patterns only.

3. BM25 search path in handle_search_graph.
   When 'query' is provided, uses FTS5 MATCH with label-type structural
   boosting (Function/Method +10, Route +8, Class +5). Falls back to
   regex path when FTS5 is unavailable.

4. FTS5 backfill with contentless delete-all syntax.
   Contentless FTS5 tables (content='') require
   INSERT INTO table(table) VALUES('delete-all') instead of DELETE FROM.
   Falls back to plain names if cbm_camel_split is unavailable.

5. generate_embeddings MCP tool — manual trigger for embedding generation.

6. build_cross_repo_index MCP tool — builds unified _cross_repo.db.

7. trace_cross_repo MCP tool — cross-repo channel flow tracing.

8. Tool dispatch entries for all 3 new tools.

Tested: 'audio stream' on 713-node repo returns 28 ranked results
(useMicStream, startStream, stopStream) instead of 713 unranked.
@Koolerx Koolerx force-pushed the fix/csharp-and-trace-improvements branch from fa19d0c to 21c4537 Compare April 5, 2026 02:38
…action

The fallback base_types[] approach (find_base_from_children) includes the
':' separator in the extracted text for C# base_list nodes, producing
names like ': IExamService' instead of 'IExamService'. The registry lookup
fails because no node has a colon-prefixed name.

Fix: add explicit C# base_list handler that iterates named children of the
base_list node, extracting identifier/generic_name/qualified_name text
directly. Strips generic type args (List<int> → List).

Tested: 0 → 5 INHERITS→Interface edges on C# repo.
@DeusData
Copy link
Copy Markdown
Owner

DeusData commented Apr 5, 2026

@Koolerx I implemented now my own version of vector search + introducing semantic edges into this. Also got basically all your things working excluding for now "cross repo" + execution flow parts. These I want to think more off. You will be credited as co author to these changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request language-request Request for new language support parsing/quality Graph extraction bugs, false positives, missing edges

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants