Status: draft 2026-05-03 — awaiting answers to Q1–Q10. Once the open questions are resolved, each per-sub-phase section gets refined and implementation kicks off at sub-phase 8a.
TL;DR. Add an FTS5-style inverted index + BM25 ranking to SQLRite so users can do keyword search and (more importantly) hybrid retrieval — combining BM25 lexical scores with vector similarity (Phase 7d's HNSW) for RAG. Stay proportional: ~700-900 LOC of new engine code spread over six small sub-phases. Reuses the integration shape Phase 7d laid down for HNSW (CREATE INDEX … USING …, optimizer probe shortcut, persistent cell type, dedicated page tree).
Phase 7's Q1 deferred FTS to Phase 8. Two reasons to revisit it now:
- Hybrid retrieval is the modern RAG standard. Vector-only retrieval (Phase 7d) misses keyword-grounded queries — the user's literal query terms matter when they're rare or technical. Lexical + semantic combined consistently beats either alone. Without BM25 we're shipping half a RAG stack.
- The integration shape is already proven. Phase 7d's HNSW work taught us the pattern:
IndexMethodenum arm →try_X_probeoptimizer hook → dedicated cell-kind tag → standalone algorithm module behind a feature flag. FTS plugs into the same surface; the design risk is low.
What FTS gives users:
-- Build a full-text index on a TEXT column.
CREATE INDEX docs_fts ON docs USING fts(body);
-- Keyword search, ranked by BM25 (top-k by relevance).
SELECT id, title FROM docs
WHERE fts_match(body, 'rust embedded database')
ORDER BY bm25_score(body, 'rust embedded database') DESC
LIMIT 10;
-- Hybrid retrieval — combine lexical + vector scores at any weighting.
-- vec_distance_cosine returns DISTANCE (lower = better), so we invert it.
SELECT id, title FROM docs
WHERE fts_match(body, 'rust embedded database')
ORDER BY
0.5 * bm25_score(body, 'rust embedded database')
+ 0.5 * (1.0 - vec_distance_cosine(embedding, [0.12, -0.04, /* ... */]))
DESC
LIMIT 10;Six chunks. Each ends with a working build, full test suite, and a commit on main. The first three are the load-bearing 7d-shaped trio (algorithm → SQL → persistence); the last three are surfacing + polish.
New src/sql/fts/ module with three free-standing pieces, no SQL integration yet:
tokenizer.rs— splits text into terms. ASCII for the MVP (split on non-alphanumeric, lowercase). ~50 LOC.bm25.rs— given term frequencies + document length + corpus stats, produce a relevance score. Standard BM25 formula withk1=1.5,b=0.75(the SQLite FTS5 defaults, fixed for the MVP). ~80 LOC.posting_list.rs— in-memory inverted index:BTreeMap<term, BTreeMap<rowid, term_freq>>plus per-document length cache. Add / remove / query operations. ~120 LOC.
Tests: tokenizer round-trips, BM25 numeric reproducibility against a hand-computed reference, posting-list correctness on 1000-doc synthetic corpus.
Mirrors the shape of src/sql/hnsw.rs (Phase 7d.1). Pure algorithm, no engine dep, easy to test in isolation.
Wire 8a into the executor. New surfaces:
CREATE INDEX <name> ON <table> USING fts(<col>)— adds anIndexMethod::Ftsarm to the existing dispatcher (src/sql/executor.rslines 383–388 + 482–485).fts_match(col, 'query')scalar — Boolean predicate, true if the row's column matches any of the query terms in the FTS index. Wired ineval_function(src/sql/executor.rs:1176–1216) alongsidevec_distance_*andjson_extract.bm25_score(col, 'query')scalar — returns the per-row BM25 relevance score asValue::Real. Same dispatch.try_fts_probeoptimizer hook — recognizesWHERE fts_match(col, 'q') ORDER BY bm25_score(col, 'q') LIMIT kand serves it from the FTS index instead of full-scanning. Mirrorstry_hnsw_probe(src/sql/executor.rs:757–838).- INSERT / UPDATE / DELETE wiring — incremental updates on INSERT (single-row append, cheap); DELETE / UPDATE mark the index
needs_rebuildand rebuild from current rows on next save (same as HNSW per Q7).
Tests: end-to-end CREATE INDEX → INSERT → fts_match in WHERE → bm25_score in ORDER BY → LIMIT k → expected rows in expected order, on a 100-doc corpus.
Make FTS indexes survive cargo build && cargo run --quiet -- foo.sqlrite reopen.
- New
KIND_FTS_POSTING: u8 = 0x06cell tag insrc/sql/pager/cell.rs:56–74. - New
src/sql/pager/fts_cell.rs— encodes/decodes posting-list cells. One cell per (term, postings) pair; long posting lists chain via overflow cells. Modeled afterhnsw_cell.rs:7–70. - Each FTS index gets its own page tree parallel to secondary indexes + HNSW indexes.
- Open path: load cells directly into the in-memory inverted index — bit-for-bit reproduction, no algorithm runs.
- File-format version bump: on-demand per Q10. Existing v4 databases with no FTS indexes keep working as v4; first FTS-index creation rewrites page 0 to v5.
Tests: round-trip integrity (build index → save → reopen → query) under realistic-sized corpus (1k docs); concurrent reader doesn't see partial state.
The hybrid story from Q8: arithmetic composition over the existing bm25_score + vec_distance_cosine functions. No new SQL function needed — the example at the top of this doc works out of the box once 8b lands.
Worked example in examples/hybrid-retrieval/:
- Small corpus (the SQLRite
docs/directory itself, indexed by sentence) + a tiny embedding model (or pre-baked vectors). - A handful of queries showing pure-BM25, pure-vector, and 50/50 hybrid retrieval.
- A short README walking through what the LLM-era takeaway is — when each shape wins.
Adds the bm25_search tool to sqlrite-mcp (mirroring vector_search from Phase 7h). Per Q9 — surfaces FTS prominently to LLM agents driving SQLRite over MCP.
Schema: { table, column, query, k?, metric? }. Returns rows in descending BM25 order. Uses the optimizer hook from 8b.
docs/supported-sql.md— new section under CREATE INDEX forUSING fts; new entries under "Functions" forfts_matchandbm25_score.docs/architecture.md— addsrc/sql/fts/to the engine module map.docs/file-format.md— document theKIND_FTS_POSTINGcell layout + the v4→v5 bump.docs/sql-engine.md— add the FTS optimizer hook alongside the HNSW one.docs/smoke-test.md— add a CREATE INDEX … USING fts step + a hybrid-retrieval round-trip.- New
docs/fts.md— the canonical reference for FTS / BM25 / hybrid retrieval, mirroringdocs/ask.md's shape.
The same shape Phase 7's plan used. Each Q has a recommendation; the user resolves before coding starts.
SQLite's FTS5 uses WHERE col MATCH 'query'. sqlparser 0.61's SQLite dialect doesn't expose a BinaryOperator::Match variant for us to dispatch on, so making MATCH work in SQLRite means either custom post-parse rewriting or patching sqlparser.
| Option | Pros | Cons |
|---|---|---|
A. Function call fts_match(col, 'q') |
Zero parser changes. Composes with everything else in WHERE. Just another scalar fn in the existing dispatcher. | Doesn't look like SQLite's MATCH. |
| B. Custom pre-parse rewrite | Looks like SQLite. | Have to write a hand-rolled scanner over user input — fragile, and fights against sqlparser's role. |
| C. Patch sqlparser | Looks like SQLite. | Forks sqlparser or waits on upstream PR; either is too big a tax for this. |
Recommendation: A. Function-call shape ships clean and doesn't compromise the rest of the engine. Documented limitation: "SQLRite's FTS uses fts_match(col, 'q') instead of SQLite's col MATCH 'q' because our SQL parser doesn't expose the MATCH operator." If sqlparser adds the variant later, we add Option B as syntactic sugar in a follow-up.
SQLite's FTS5: CREATE VIRTUAL TABLE docs USING fts5(title, body); — one virtual table indexes multiple columns of the underlying data, and MATCH 'q' searches all of them.
| Option | Pros | Cons |
|---|---|---|
| A. One index per column | Mirrors HNSW (one index per vector column). Simpler optimizer hook. Composes via OR: WHERE fts_match(title, 'q') OR fts_match(body, 'q'). |
Two indexes' worth of disk + memory for the common "search title or body" case. Composing scores across columns is on the user. |
| B. Multi-column index | Matches SQLite. Saves disk for the common case. | Bigger persistence change; per-column field-weighting (FTS5's bm25(docs_fts, 10.0, 1.0)) is a whole sub-feature. |
Recommendation: A. Start single-column, add multi-column as a follow-up if real users ask. Documented in docs/fts.md.
- ASCII (split on
[^A-Za-z0-9]+, lowercase): ~50 LOC, no deps. Misses CJK, mishandles accented Latin (café≠cafe). - Unicode (use the
unicode-segmentationcrate's word-boundary splitter): correct everywhere. Adds a small dep (~30 KB compiled).
Recommendation: ASCII for MVP. Most RAG corpora are English / ASCII-Latin enough that this isn't blocking. Document the limitation prominently. Phase 8.1 can add Unicode tokenization with a unicode = ["dep:unicode-segmentation"] cargo feature.
- No stemming (default): "running" and "run" are different terms. RAG queries typically rely on exact lexical matches anyway — stemming actively hurts technical-term retrieval ("python" the language vs "pythons" the snakes is a real concern in tech docs).
- Snowball-style stemming: ~200 LOC + dep, English-only without more work.
Recommendation: no stemming. Modern RAG approaches use raw tokens; semantic matching goes through the vector half.
- No stop list: keeps every token, BM25's IDF naturally downweights common terms ("the" gets near-zero weight in any non-trivial corpus).
- English stop list: smaller index, faster queries, locks out non-English text.
Recommendation: no stop list. BM25's math is the right tool; stop lists are a pre-BM25 hack.
The optimizer hook needs to handle a WHERE that combines an FTS predicate with other conditions.
| Option | Pros | Cons |
|---|---|---|
| A. FTS pre-filter, scalar-eval the rest | Simple. Mirrors how HNSW + WHERE works today. | Misses index-on-status combinations (would do an FTS scan + sequential filter even if status is indexed). |
| B. Multi-index intersection | Faster on selective filters. | Complex optimizer + more failure modes. |
Recommendation: A. Same approach as Phase 7d's HNSW + WHERE composition. The "selective non-FTS filter" case is a tractable Phase 9 optimization.
Phase 7d's HNSW marks the index needs_rebuild on DELETE / UPDATE and rebuilds from current rows at next save.
| Option | Pros | Cons |
|---|---|---|
A. Mark needs_rebuild, rebuild at save |
Same code shape as HNSW — small reuse. Simple, correct. | Slow on big tables; DELETEs in transaction → rebuild at COMMIT could be a noticeable pause. |
| B. Incremental remove + tombstone list | Fast updates. | More state to track + more bug surface; tombstone compaction is a real follow-up. |
Recommendation: A. Match HNSW's shape so the engine has one rebuild story across both index types. Document the cost; B is a Phase 8.1 if anyone hits the perf wall.
- Arithmetic (recommended):
ORDER BY 0.5 * bm25_score(col, 'q') + 0.5 * (1.0 - vec_distance_cosine(vec, [...])) DESC LIMIT k— works the moment 8b lands; no new function needed. - Typed function:
hybrid_score(0.5, bm25_score(...), 0.5, vec_distance_cosine(...))— semantic clarity, but it's just sugar over arithmetic.
Recommendation: arithmetic. Document the pattern in docs/fts.md with the 1.0 - vec_distance_cosine inversion gotcha called out. The first-class typed function adds API weight without earning its keep — composability via raw arithmetic is more flexible (different aggregations, three-way fusion if a user later wires in a different score, etc.).
Phase 7h's spec listed bm25_search as a future tool gated behind FTS. Now that FTS is shipping:
- Yes: ~50 LOC of glue (mirrors
vector_search). Surfaces FTS prominently to LLM agents driving SQLRite over MCP. Tool description teaches the LLM about lexical retrieval as an option alongsidevector_search+query. - No: LLM uses
querytool withWHERE fts_match(col, 'q'). Works fine.
Recommendation: yes. Symmetric with vector_search; LLM clients benefit from one less affordance to fish for.
The disk format is currently v4 (bumped in 7a for VECTOR). Adding KIND_FTS_POSTING requires another bump.
| Option | Pros | Cons |
|---|---|---|
| A. On-demand bump (only when first FTS index is created) | Existing v4 databases without FTS keep working unmodified. Zero migration friction for current users. | Two valid in-the-wild versions (v4 and v5) to support. |
| B. Always-bump on next release | Single version to maintain. | Every existing user's database file needs to be opened-and-resaved by sqlrite-engine 0.1.26+; old engine versions can't open them. |
Recommendation: A. On-demand. Existing v0.1.x databases continue to open against v0.1.26+ without surprise; the bump only kicks in when the user actually creates an FTS index. Documented in docs/file-format.md.
8a (algorithms) — standalone, no deps; foundational
└── 8b (SQL surface) — needs 8a
└── 8c (persistence) — needs 8b
├── 8d (hybrid docs + example) — parallel after 8c
├── 8e (MCP bm25_search tool) — parallel after 8c
└── 8f (docs + smoke test) — last; documents what shipped
Sub-phases land as their own PR + release-PR + release wave (continuing the lockstep cadence). The 8d / 8e / 8f trio likely fold into one PR each since they're small.
Suggested release cadence:
| Release | Lands |
|---|---|
| v0.2.0 | 8a + 8b + 8c (the load-bearing FTS work — major-version bump because the file format changed and users see a new SQL surface) |
| v0.2.1 | 8d (hybrid example + docs) |
| v0.2.2 | 8e (MCP tool) |
| v0.2.3 | 8f (final docs sweep) |
Argument for the 0.1.x → 0.2.x bump: the file format changed (v4 → v5) and we're adding a substantial new SQL surface. The 0.1.x cycle covered everything from "modernize the codebase" through Phase 7's AI-era extensions; v0.2.0 is the right place to mark the FTS arrival.
Alternatively, fold all six sub-phases into one v0.1.26 release if the work runs small. We'll see how 8a-8c size up before deciding.
| Sub-phase | LOC (engine) | LOC (tests) | LOC (docs) |
|---|---|---|---|
| 8a — algorithms | ~250 | ~200 | — |
| 8b — SQL integration | ~250 | ~300 | — |
| 8c — persistence | ~250 | ~200 | — |
| 8d — hybrid example | ~50 | — | ~150 |
| 8e — MCP tool | ~50 | ~100 | — |
| 8f — docs sweep | — | — | ~600 |
| Total | ~850 | ~800 | ~750 |
About 2.4 kLOC overall, ~850 of which is engine code. Comparable to Phase 7d (HNSW) which clocked in around 1.8 kLOC across its three sub-phases.
- Multi-column FTS (Q2) — single-column for the MVP.
- Unicode tokenization (Q3) — ASCII for the MVP, follow-up cargo feature.
- Stemming + stop words (Q4 + Q5) — not on the roadmap.
- Configurable BM25 parameters —
k1andbare fixed at SQLite-FTS5 defaults (1.5 / 0.75). Per-column field-weighting deferred. - Phrase queries (
MATCH '"exact phrase"') — single-token matching only for the MVP. Phrase queries need positional postings, doubling the index size. Phase 8.1. - Query operators (
AND/OR/NOTinside the query string) — for the MVP,fts_match(col, 'a b c')matches rows containingaORbORc(any-term). Boolean query syntax deferred to Phase 8.1. - Highlight / snippet generation (
snippet(col, 'q')) — not in the MVP. Easy to add later if users want it. - Multi-index intersection optimizer (Q6) — FTS pre-filter + scalar WHERE for the MVP.
- Incremental DELETE / UPDATE (Q7) — rebuild-on-save for the MVP.
- Persistence is the load-bearing risk. Phase 7d.3 estimated 300 LOC for HNSW persistence, shipped at ~600. FTS posting lists could similarly blow the estimate — long posting lists need overflow chaining. Watch for it; budget +50% on 8c if signs of growth appear early.
- Tokenizer edge cases. Numeric tokens, hyphenated words, URL-like text — even ASCII tokenization has surprising corners. Test against a realistic corpus early, not just synthetic.
- Index update during transaction. A
BEGIN; INSERT ... INSERT ... COMMIT;block currently does N incremental FTS updates inside the in-memory snapshot, then the auto-save fires once at COMMIT. The persistence code (8c) needs to write a single batched update rather than N per-row writes. Mirrors how HNSW persistence handles transactions. - MCP
bm25_searchand thesqlrite-mcp--no-default-featuresbuild. The current--no-default-featuresMCP build (six tools, noask) should also losebm25_search— wait, that doesn't quite work becausebm25_searchdoesn't depend on the LLM. Resolution:bm25_searchis gated behind a newftscargo feature on the engine, default-on. The MCP server's tool list already follows the engine's cargo features forask; same shape forfts.
Concrete file:line touch points for anyone reading the eventual PRs:
| Component | File | Phase 8 change |
|---|---|---|
IndexMethod enum |
src/sql/executor.rs:482-485 |
add Fts variant |
IndexType::Custom dispatch |
src/sql/executor.rs:383-388 |
add "fts" arm |
try_*_probe optimizer hook |
src/sql/executor.rs:757-838 (HNSW reference) |
new try_fts_probe |
| Scalar-fn dispatcher | src/sql/executor.rs:1176-1216 |
add fts_match + bm25_score arms |
| Cell kind tag table | src/sql/pager/cell.rs:56-74 |
add KIND_FTS_POSTING: u8 = 0x06 |
| Cell encoding template | src/sql/pager/hnsw_cell.rs:7-70 |
model new fts_cell.rs after this |
| Table schema (per-table indexes) | src/sql/db/table.rs |
add fts_indexes: Vec<FtsIndex> (mirrors hnsw_indexes) |
| New module | src/sql/fts/ |
tokenizer + bm25 + posting_list + (later) cell |
| MCP tool | sqlrite-mcp/src/tools/bm25_search.rs (NEW) |
mirrors vector_search.rs |
Engine fts feature |
Cargo.toml [features] |
add fts (default-on); gate everything above |
docs/phase-7-plan.md— the template this plan follows; specifically the §7d (HNSW) sub-phases that 8a-8c mirror most closely.docs/architecture.md— workspace + engine module map.docs/sql-engine.md—process_command+ executor architecture; FTS optimizer hook will appear here once 8b lands.docs/file-format.md— page format reference;KIND_FTS_POSTINGdocumented here once 8c lands.docs/mcp.md— MCP server tool reference;bm25_searchdocumented here once 8e lands.docs/ask.md— natural-language → SQL feature; once FTS ships, the LLM's prompt gets to teach the model aboutfts_match+bm25_score(small follow-up tosqlrite-ask's system rules).