Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
*.swp
*.swo
.DS_Store
.worktrees/
36 changes: 22 additions & 14 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,35 @@ Local semantic search CLI for Obsidian vaults. Rust, MIT licensed.

## Architecture

Single binary with 7 modules behind a lib crate:

- `config.rs` — loads `~/.engraph/config.toml`, merges CLI args, provides `data_dir()`
- `chunker.rs` — splits markdown by `##` headings, strips YAML frontmatter, extracts tags. `split_oversized_chunks()` handles token-aware sub-splitting with overlap
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization
- `store.rs` — SQLite persistence. Tables: `meta`, `files`, `chunks` (with vector BLOBs), `tombstones`. Handles incremental diffing via content hashes
Single binary with 11 modules behind a lib crate:

- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization. Implements `ModelBackend` trait
- `model.rs` — pluggable `ModelBackend` trait, model registry, and `parse_model_spec()`. Enables future model swapping without changing consumer code
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 results. Supports lane weighting and `--explain` output
- `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml`
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5 virtual table), `tombstones`. Handles incremental diffing via content hashes
- `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW
- `search.rs` — embeds query, searches HNSW with tombstone filtering, formats results (human + JSON). Also handles `status` formatting
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW + FTS5

`main.rs` is a thin clap CLI that wires the modules together.
`main.rs` is a thin clap CLI that wires the modules together. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`.

## Key patterns

- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new
- **Hybrid search:** Queries run through two lanes — semantic (HNSW embeddings) and keyword (FTS5 BM25). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights
- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Chunks split at the highest-scored break point near the token target. Code fence protection prevents splitting inside code blocks
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new. FTS5 entries are cleaned up alongside vector entries
- **HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
- **Vector IDs:** Assigned sequentially, stored in both SQLite and HNSW. `next_vector_id` is derived from `MAX(vector_id)` in SQLite
- **Tombstones:** Exist in the schema but are largely unused now that we rebuild HNSW each run. Kept for future use if switching to a vector store that supports deletion
- **Docids:** Each file gets a deterministic 6-char hex ID (SHA-256 of relative path). Displayed in search results for quick reference
- **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml`
- **Pluggable models:** `ModelBackend` trait enables future model swapping. Current implementation uses ONNX all-MiniLM-L6-v2

## Data directory

`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite), `hnsw/` (index files), `models/` (ONNX model + tokenizer).
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite with FTS5), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).

Single vault only. Re-indexing a different vault path triggers a confirmation prompt.

Expand All @@ -35,10 +42,11 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
- `ignore` (0.4) — vault walking with automatic `.gitignore` support
- `rusqlite` (0.32) — bundled SQLite with FTS5 support

## Testing

- Unit tests in each module (`cargo test --lib`) — 44 tests, no network required
- Unit tests in each module (`cargo test --lib`) — 91 tests, no network required
- 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download. Use `tempfile` for isolated data dirs

Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "engraph"
version = "0.1.0"
version = "0.2.0"
edition = "2024"
description = "Local semantic search for Obsidian vaults"
license = "MIT"
Expand Down
Loading
Loading