Skip to content

feat: engraph v2.0 — hybrid search, smart chunking, vault profiles#1

Merged
devwhodevs merged 11 commits intomainfrom
feature/v2.0
Mar 24, 2026
Merged

feat: engraph v2.0 — hybrid search, smart chunking, vault profiles#1
devwhodevs merged 11 commits intomainfrom
feature/v2.0

Conversation

@devwhodevs
Copy link
Copy Markdown
Owner

Summary

engraph v2.0 upgrades the search engine from pure semantic search to a hybrid system with significantly improved relevance and new vault management capabilities.

What's new

  • Smart chunking — Break-point scoring algorithm replaces heading-only splitting. Finds optimal split points at headings, code fences, blank lines, and thematic breaks. Code fence protection prevents splitting inside code blocks.
  • Docid system — Every indexed file gets a deterministic 6-char hex ID (#abc123) shown in search results for quick reference.
  • FTS5 search lane — SQLite FTS5 full-text search runs alongside semantic search. BM25-ranked keyword matching catches exact terms (ticket IDs, names, dates) that embeddings miss.
  • RRF fusion engine — Reciprocal Rank Fusion merges semantic + FTS5 results with configurable lane weights. --explain flag shows per-lane score breakdown.
  • Vault profilesengraph init auto-detects vault structure (PARA/Folders/Flat), type (Obsidian/Logseq/Plain), wikilinks, frontmatter, and tags. Writes vault.toml for future configuration.
  • Pluggable model layerModelBackend trait enables future model swapping. engraph models list/info for model management. Registry ships with known-good models.

New CLI commands

engraph init [path]          # Auto-detect vault and generate vault.toml
engraph configure            # Interactive configuration (placeholder for v2.1)
engraph models list          # List available embedding models
engraph models info <name>   # Show model details
engraph search --explain     # Show per-lane RRF score breakdown

Architecture changes

  • 7 modules → 11 modules (docid, fts, fusion, model, profile added)
  • SQLite schema adds docid column on files, chunks_fts FTS5 virtual table
  • Search pipeline: query → [semantic lane, FTS lane] → RRF fusion → ranked results
  • Automatic schema migration for existing databases

Test plan

  • 91 unit tests passing (cargo test --lib)
  • Clippy clean (cargo clippy -- -D warnings)
  • Formatted (cargo fmt --check)
  • Integration tests compile (cargo test --test integration --no-run)
  • Manual: engraph index ~/vault && engraph search "test query" --explain
  • Manual: engraph init ~/vault on an Obsidian vault
  • Manual: engraph models list
  • Verify existing databases migrate seamlessly (docid column, FTS5 table)

🤖 Generated with Claude Code

devwhodevs and others added 11 commits March 24, 2026 19:22
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port qmd's scored break-point chunking to Rust. Replaces heading-only
splitting with a scoring system that finds optimal break points near
the token target. Code fence protection prevents splitting inside
code blocks. 15% overlap for context continuity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every indexed file gets a deterministic #docid (SHA-256 of path,
truncated to 6 hex chars). Shown in search results. Supports
direct lookup via 'engraph get #abc123'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SQLite FTS5 virtual table as second search lane. Populated
during indexing alongside vector embeddings. Supports exact keyword
matches for ticket IDs, names, and dates that semantic search misses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reciprocal Rank Fusion combines ranked results from HNSW semantic
search and FTS5 keyword search. Supports lane weighting and
--explain flag showing per-lane score contributions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Auto-detects vault structure (PARA, folders, flat), wikilinks,
frontmatter, tags. Generates vault.toml with detected settings.
Interactive configure command for guided customization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract embedding into a ModelBackend trait. Existing ONNX embedder
implements the trait. Users can configure models in vault.toml and
manage them via 'engraph models list/info'. Registry ships with
known-good models. Prepare for future GGUF adapter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Version bump and integration for engraph v2.0:
- Smart chunking with break-point scoring (replaces heading-only splitting)
- 6-char docid system for quick file reference
- FTS5 full-text search lane (BM25 keyword matching)
- RRF fusion engine merging semantic + FTS5 results
- Vault profile auto-detection (PARA/Folders/Flat, Obsidian/Logseq)
- Pluggable ModelBackend trait for future model swapping
- All code formatted, clippy clean, 91 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline string literal into format string to avoid print_literal
warning on newer clippy versions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rder

Two bugs found during real vault testing:

1. Smart chunker panicked on multi-byte UTF-8 chars (em dash, etc.)
   when byte offsets from break-point scoring landed inside multi-byte
   sequences. Fixed by snapping all byte offsets to valid char
   boundaries before slicing.

2. Schema migration failed on existing v0.1 databases: the SCHEMA
   constant tried to CREATE INDEX on docid column before migration
   added it. Moved index creation into the migration path so it
   runs after the column exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@devwhodevs devwhodevs merged commit 58c7ca9 into main Mar 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant