feat+fix: C# support, investigation-grade trace, semantic search, execution flows, channels, cross-repo intelligence#162
Conversation
33a7d1d to
58fff9e
Compare
f7f94be to
70bc64f
Compare
|
@DeusData, many improvements have been made to the project. I know there are over 40 commits, but please go ahead and give it a test run. |
|
Hey @Koolerx, thx! Will check. Currently mainly involved in clearing technical debt. Will come back to this asap. Likely will analyze your changes and will extract what makes sense in a seperate commit where I will list you as co author so that you will be listed as contributor. Hope thats fine for u :) |
all sounds well to me |
|
But in general: Why the embeddings? Can you give a bit more reasoning for this? @Koolerx? What does this add to a coding agent? I already have investigated using embeddings but was not so convinced. Coding agents can efficiently query themselves already GraphDBs |
the graph handles most queries really well. Like, 85% of the time BM25 plus the knowledge graph is all you need. The embeddings close a specific gap in the remaining cases, and it's worth understanding exactly what that gap is before dismissing it. The gap is vocabulary mismatch. When the user's search terms don't appear anywhere in symbol names, qualified names, or file paths, BM25 returns nothing useful. The graph can only find things when tokens overlap that's just how keyword search works. We ran this head to head. Query: "start video recording" against a media services repo. BM25 found startSession, startRecordingSession, startCaptureNodeSession: good results, keyword matches on "start" and "recording" and "session". But the vector layer surfaced updateVideoConductorSession, videoTrimmingFailureNotification, videoRemuxNotification, RecordingStorage, conceptually related symbols with zero keyword overlap. The BM25 results are the functions you'd change. The vector results are the functions that would break if you changed them. That's the blast radius that keyword search can't surface on its own. Now, for a coding agent specifically, the agent doesn't know what it doesn't know. If search_graph returns 29 results, the agent assumes that's the full picture and moves on. The vector layer surfaces what the agent would have missed entirely. We tested "error handling" BM25 found 29 functions with "error" in the name. But the functions that actually handle errors constructors in error classes, catch blocks in controllers don't have "error" or "handling" in their names. Vector search found 20 additional symbols the agent never would have seen. Where it doesn't matter and I want to be honest about this: if you know the exact symbol name, BM25 is sufficient. If you're tracing callers and callees, graph BFS is the right tool, period. If the codebase uses consistent naming conventions, BM25 plus the camelCase token splitting we added covers it. Embeddings don't help with any of those cases. The implementation cost is pretty minimal. About 200 lines of C, a cosine similarity function plus an HTTP client to any OpenAI-compatible endpoint. Zero new dependencies. we used Mongoose that was already vendored for HTTP and yyjson for JSON. It's fully opt-in via a CBM_EMBEDDING_URL env var. When not set, everything works exactly as before pure BM25. Brute force cosine scan at 134K vectors takes under 10ms, so no ANN index needed at this scale. The honest limitation is that quality depends on the embedding model. A bad model gives bad results. And it requires a running embedding server : Ollama, llamafile, whatever, so it's not self-contained like the graph. That's why we made it opt-in rather than default. It's a power-user feature for people who need discovery across vocabulary boundaries, not a replacement for the graph. if CBM is used purely for graph traversal, trace callers, impact analysis, process flows — embeddings add nothing. If it's used for discovery, "find code related to X" where X is described in natural language, that's where embeddings close the gap that BM25 structurally cannot. |
|
Hey @Koolerx, thanks for the detailed argument on embeddings. I've been investigating this exact question in depth — whether vector/embedding search adds value for coding agents specifically (vs. human developers). Here's where I landed, and what I built instead. The researchI went through ~15 papers, benchmarks, and industry analyses on embeddings vs. structured approaches for coding agents:
The pattern: every piece of agent-relevant evidence says embeddings don't help or actively hurt. The pro-embedding evidence is either human-facing (Copilot suggestions, Augment's search UI) or measuring retrieval quality only, not end-to-end agent performance. The core insightYour argument is about vocabulary mismatch — "start video recording" not matching What I built instead: SIMILAR_TO edgesI just pushed a feature that solves the part of the similarity problem that actually matters for coding agents — near-clone detection via precomputed graph edges: Commits:
How it works:
Why this is better than vector search for agents:
Linux kernel benchmark: 395K functions fingerprinted, 58K SIMILAR_TO edges, 7.7s similarity pass, 1:41 total pipeline. Spot-checked edges are genuine cross-subsystem near-clones (Atheros ath10k/11k/12k driver generations, AMD GPU display mode versions, SMB v1/v2 protocol handlers). (Python Project) benchmark: 68 SIMILAR_TO edges, 100% precision — all genuine cross-service duplicates ( Your blast radius argumentYou said: "The vector results are the functions that would break if you changed them. That's the blast radius." That's exactly what SIMILAR_TO edges deliver. If an agent fixes For the broader blast radius (callers, callees, transitive dependencies), What I'd like you to checkCan you take a look at the SIMILAR_TO implementation (
I'm genuinely interested in your perspective — you clearly stress-tested against real codebases. If there's a gap SIMILAR_TO doesn't cover that embeddings would, I want to understand it. |
|
Hey @DeusData, great work on the SIMILAR_TO implementation — I pulled your latest main, built it, and ran a head-to-head comparison against our branch across 5 repos of different sizes and languages (a large JS/TS monolith ~146K nodes, a medium Node.js/Hapi service ~3K nodes, a small Electron/TS app ~700 nodes, a Hapi.js backend ~2.5K nodes, and a mixed JS/Python monorepo ~15K nodes). Here's what I found. The comparisonI indexed the same repos with both binaries and compared edge counts, search quality, and feature coverage. Edge counts (5 repos, side by side)
Search qualityI ran
The issue is that SIMILAR_TO analysisYour 3,411 SIMILAR_TO edges on the large monolith are genuinely useful. The cross-file clones (2,377 of them) are the most interesting — same UI component patterns duplicated across different entity pages, same pagination logic copied across list views, same column renderers adapted for different data types. These are real near-clones that neither keyword search nor embeddings would find — they're structurally similar but use completely different names. The MinHash approach is the right tool for this. Features in our PR that your main doesn't have
On embeddings specificallyI hear your argument about agents being semantic engines themselves, and I think you're partially right. Here's the honest breakdown after running both approaches: Where SIMILAR_TO wins over embeddings:
Where embeddings add value that SIMILAR_TO doesn't cover:
My honest assessment: embeddings are a power-user feature for human-in-the-loop discovery, not essential for agent workflows. That's why we made it fully opt-in via What I'd suggestThe features that would add the most value to your main branch (independent of embeddings):
These are all small, targeted fixes that don't introduce embeddings or external dependencies. Happy to split them into separate PRs if that's easier to review. The cross-repo, channels, processes, and embeddings features are bigger additions that you can evaluate separately. And SIMILAR_TO is complementary to everything we built — the two approaches together would be stronger than either alone. |
|
Hey @Koolerx, thanks for evaluating. I am already working on trying to get a performant embedding analysis, adding additionally to the vector search itself a "SEMANTIC_SIMILAR" Edge, which should also cover cases the SIMILAR_TO edges miss atm (while also staying zero dependency and not adding much latency to the indexing). All the other things you have mentioned make sense. I just had not the time to embedd them but will do so, once done with my current TODO. I am also thinking about your cross-repo approach. Already had something similar in mind but need to think more about it. Thanks for pushing this my friend 🙏 |
…gistration, embeddings, cross-repo infrastructure Rebased our PR DeusData#162 features onto upstream's latest main (commit 1d30971) which includes MinHash SIMILAR_TO edges, CBM_CACHE_DIR, and major refactoring. Ported features (building clean on upstream's refactored codebase): 1. FTS5 BM25 search infrastructure: - Contentless FTS5 virtual table (nodes_fts) with camelCase token splitting - cbm_camel_split() SQLite function: updateCloudClient → 'update Cloud Client' - FTS5 backfill in both full pipeline and incremental pipeline - Incremental reindex now preserves FTS5 (was wiping to 0 rows) 2. Interface registration in symbol registry: - Added 'Interface' to label filter in process_def() (pass_definitions.c) - Added 'Interface' to label filter in register_and_link_def() (pass_parallel.c) - Fixes: C# class Foo : IBar now creates INHERITS → Interface edges 3. C# base_list extraction: - Added 'base_list' to fallback base_types[] in extract_base_classes() 4. Embeddings infrastructure (opt-in via CBM_EMBEDDING_URL): - embeddings table in SQLite schema - cbm_cosine_sim() SQLite function for vector search - embedding.c/h: HTTP client, text generation, RRF merge, pipeline integration - Auto-generates embeddings during indexing when configured 5. Cross-repo infrastructure: - cross_repo.c/h: unified _cross_repo.db builder, cross-repo search, channel matching, trace helper Not yet ported (follow-up commits): - MCP tool changes (search_graph query param, generate_embeddings tool, cross-repo tools, get_impact tool) - Process detection (cbm_store_detect_processes) - Channel detection (cbm_store_detect_channels) - C# delegate event subscription (extract_calls.c) - WRITES expansion (extract_semantic.c) All upstream features preserved: MinHash SIMILAR_TO, pass_similarity, CBM_CACHE_DIR, TS_FIELD() macro, extracted helpers.
…able FTS5
Completes the rebase by adding the MCP handler layer:
1. Enable FTS5 in SQLite compile flags (-DSQLITE_ENABLE_FTS5).
Without this, CREATE VIRTUAL TABLE USING fts5(...) silently creates
a stub that fails on any query with 'no such module: fts5'.
2. Expose 'query' and 'sort_by' params in search_graph inputSchema.
AI agents can now send natural language queries for BM25 ranked search
instead of regex patterns only.
3. BM25 search path in handle_search_graph.
When 'query' is provided, uses FTS5 MATCH with label-type structural
boosting (Function/Method +10, Route +8, Class +5). Falls back to
regex path when FTS5 is unavailable.
4. FTS5 backfill with contentless delete-all syntax.
Contentless FTS5 tables (content='') require
INSERT INTO table(table) VALUES('delete-all') instead of DELETE FROM.
Falls back to plain names if cbm_camel_split is unavailable.
5. generate_embeddings MCP tool — manual trigger for embedding generation.
6. build_cross_repo_index MCP tool — builds unified _cross_repo.db.
7. trace_cross_repo MCP tool — cross-repo channel flow tracing.
8. Tool dispatch entries for all 3 new tools.
Tested: 'audio stream' on 713-node repo returns 28 ranked results
(useMicStream, startStream, stopStream) instead of 713 unranked.
fa19d0c to
21c4537
Compare
…action The fallback base_types[] approach (find_base_from_children) includes the ':' separator in the extracted text for C# base_list nodes, producing names like ': IExamService' instead of 'IExamService'. The registry lookup fails because no node has a colon-prefixed name. Fix: add explicit C# base_list handler that iterates named children of the base_list node, extracting identifier/generic_name/qualified_name text directly. Strips generic type args (List<int> → List). Tested: 0 → 5 INHERITS→Interface edges on C# repo.
|
@Koolerx I implemented now my own version of vector search + introducing semantic edges into this. Also got basically all your things working excluding for now "cross repo" + execution flow parts. These I want to think more off. You will be credited as co author to these changes :) |
Summary
40 commits adding major features and fixing critical bugs across the MCP handler, extraction layer, pipeline, store, and Cypher engine. Developed while stress-testing against large enterprise codebases and running real investigation scenarios.
Highlights: C# blast radius analysis (0 → 16 callers), execution flow detection (1 → 300 flows), hybrid BM25+vector semantic search, camelCase token splitting, process deduplication, route deduplication, cross-repo search/trace/impact analysis across 54 repos.
Phase 1: Core Fixes and Features (Commits 1-29)
Bug Fixes (Commits 1-6)
trace_call_pathClass → Method resolution — BFS resolves throughDEFINES_METHODedges for Class/Interface nodesdetect_changesuse-after-free — Switched tostrcpyvariants for stack buffer reusebase_list— INHERITS edges: 210 → 1,588 (7.5x)has_methodnode ID — Fixed DEFINES_METHOD BFS to use Class node IDNew Features (Commits 7-17)
get_architecturereturns full analysisget_impactblast radius with risk assessment14-17. Cypher JSON properties, investigation-grade trace, C# delegate/event resolution, C# channel constants
Gap Closure (Commits 18-29)
get_impactClass over Constructor resolution for C#count(DISTINCT)+ SQL injection fixNOT EXISTSsubquery (dead-code detection in <1s)has_propertyin tracerequire()import extractionPhase 2: Search Quality (Commits 30-33)
30. Process participation in
search_graphresultsEach BM25 search result now includes the execution flows it participates in.
31. JS/TS constant resolution for Socket.IO channel detection
Resolves
const EVENT = "foo"references insocket.emit(EVENT)patterns. Channels detected: 6 → 17 per repo.32. Expose BM25
queryandsort_byparams insearch_graphschemaThe FTS5 BM25 search path existed but
querywas not declared in the tool'sinputSchema. AI agents couldn't discover or use it. Now exposed with full documentation.33. Pure BM25 relevance ranking + camelCase token splitting
cbm_camel_split()SQLite function —updateCloudClient→updateCloudClient update Cloud Clientenabling word-level BM25 matchingcontent='') — required for camelCase split tokens to match correctly at query timePhase 3: Process and Route Quality (Commit 34, 38)
34. Deduplicate entry points +
[module]prefix on process labelsfuncA → funcZ→[controllers] funcA → funcZ— instantly navigable among 50+ flows.38. Route node deduplication — eliminate ghost nodes
Three sources of 3x route duplication: express + hapi extractors both matching same patterns, plus module-level extraction with empty QN. Fixed with
(method, path)dedup. 1665 → 555 routes, 0 ghosts.Phase 4: Semantic Vector Search (Commits 35-37)
35. Hybrid BM25+vector semantic search via external embedding API
Full semantic search architecture:
cbm_cosine_sim()custom function/v1/embeddingsendpointgenerate_embeddingsMCP tool for manual triggersemantic_resultsfield insearch_graphoutput for vector-only matchesConfiguration:
CBM_EMBEDDING_URL,CBM_EMBEDDING_MODEL,CBM_EMBEDDING_DIMSenv vars. No new dependencies — uses vendored Mongoose (HTTP) and yyjson (JSON).36. Fix use-after-free in semantic result strings
yyjson_mut_obj_add_str(borrows pointer) →yyjson_mut_obj_add_strcpy(copies string).37. Auto-generate embeddings during full indexing
When
CBM_EMBEDDING_URLis configured, the pipeline auto-generates embeddings after process and channel detection. Zero-friction: repos indexed while the embedding server is running get embeddings automatically.Phase 5: Cross-Repo Intelligence (Commits 39-40)
39. Unified cross-repository index
New
_cross_repo.dbbuilt by scanning all per-project databases:New MCP tools:
build_cross_repo_index,trace_cross_repo. Auto-rebuilds after everyindex_repository.40. Cross-repo search, flow tracing with call chains, and impact analysis
Cross-repo search (
search_graphwithproject="*"):Hybrid BM25+vector search across all 54 repos in a single call. Returns results with both short project name and full project_id for follow-up queries.
Enhanced
trace_cross_repowith call chains:When a channel filter is provided, traces depth-2 upstream callers of the emitter and depth-2 downstream callees of the listener. Handles Class→Method resolution and
(file-level)listener fallback via channels table lookup.Cross-repo impact analysis (
get_impactwithcross_repo=true):After per-repo BFS, checks if d=1 impacted symbols emit channels to other repos. For each affected channel, opens consumer project DB, traces downstream from listener, returns
cross_repo_impactsarray.Testing
All 40 commits compile clean with
-Wall -Wextra -Werror. 2,586 existing tests pass. Stress-tested against:New Files
src/pipeline/embedding.c/embedding.h— Semantic embedding generation + RRF mergesrc/store/cross_repo.c/cross_repo.h— Cross-repo index, search, channel matching, trace helperConfiguration (new env vars)
CBM_EMBEDDING_URL/v1/embeddingsendpointCBM_EMBEDDING_MODELnomic-embed-textCBM_EMBEDDING_DIMS768