managedcode
diff --git a/‎README.md‎
Lines changed: 11 additions & 10 deletions b/‎README.md‎
Lines changed: 11 additions & 10 deletions
diff --git a/‎docs/ADR/ADR-0003-tiktoken-extraction-mode.md‎
Lines changed: 4 additions & 3 deletions b/‎docs/ADR/ADR-0003-tiktoken-extraction-mode.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎docs/ADR/ADR-0005-hybrid-graph-search.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/ADR/ADR-0005-hybrid-graph-search.md‎
Lines changed: 12 additions & 0 deletions
@@ -779,7 +779,7 @@ River sensors use cached forecasts to protect orchards from frost.
 }
 ```
 
-Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates Euclidean distance. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
+Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates token-distance ranking from cached squared magnitudes and dot products. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
 
 `SearchByTokenDistanceAsync` keeps exact token-distance behavior by default. Pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` when user queries may contain typos. The correction step checks words that are absent from the indexed corpus vocabulary, finds close corpus terms with the bounded edit-distance matcher, appends the best corrections to the query, and only then runs Tiktoken vector search. This improves recall for misspelled words in the query or corpus text while leaving the Tiktoken vector space as the ranking signal.
 
@@ -1272,6 +1272,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
 - `dotNetRDF` builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
 - Schema-aware search compiles caller profiles into local or federated SPARQL and keeps generated queries/evidence visible to callers.
 - Ranked search can use graph-native ranking, in-memory BM25, optional fuzzy BM25 token matching, optional semantic ranking, or hybrid reciprocal-rank fusion.
+- Exact BM25 counts selected query terms with span-based lookup and pooled per-query statistics; fuzzy BM25 stays opt-in because it must enumerate typo candidates.
 - Cited answers use `IChatClient` plus ranked graph retrieval and return source citations without storing conversation history.
 - Chunk evaluation and source-change planning are deterministic local helpers, not hosted indexing services.
 - `dotNetRdf.Shacl` validates built graphs with default or caller-supplied SHACL shapes.
@@ -1285,7 +1286,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
 
 ## Algorithm References
 
-- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with portable SIMD common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
+- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
 - The bit-vector path is guided by Gene Myers, "A fast bit-vector algorithm for approximate string matching based on dynamic programming", Journal of the ACM, 1999, DOI: <https://doi.org/10.1145/316542.316550>.
 - The bounded-threshold behavior is guided by Esko Ukkonen, "Algorithms for approximate string matching", Information and Control, 1985, DOI: <https://doi.org/10.1016/S0019-9958(85)80046-2>.
 - Thanks to `biegehydra/MyersBitParallelDotnet` for inspiring the practical direction we took for fast short-token typo matching.
@@ -1311,13 +1312,13 @@ Current local headline numbers from the May 3, 2026 BenchmarkDotNet 0.15.8 run o
 | Area | Current local result |
 | --- | --- |
 | Full suite | 118 BenchmarkDotNet cases using the `Default` job |
-| Graph build | `LargeCorpus` builds in 45.457 ms with 57.74 MB allocated |
-| Low-latency search | `ShortDocuments` exact ranked graph search is 1.195 ms / 2.37 MB; BM25 is 1.659 ms / 3.07 MB |
-| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` exact fuzzy search is 1.979 ms / 3.07 MB |
-| RDF query paths | `ShortDocuments` exact schema SPARQL is 41.078 ms / 60.33 MB; local federated schema search is 39.410 ms / 62.31 MB |
-| Tiktoken search | `LongDocuments` exact token-distance search is 298.1 us / 212.24 KB; typo correction is 391.5 us / 216.30 KB |
-| Persistence | `LargeCorpus` Turtle file load is 35.708 ms / 28.10 MB; JSON-LD file load is 90.663 ms / 75.32 MB |
-| Lifecycle | Build/search/save/load/export is 55.35 ms / 54.44 MB |
-| Fuzzy edit distance | Long insertion is 376.58x faster than naive Levenshtein; long no-match is 172.88x faster, both with 0 B allocated |
+| Graph build | `LargeCorpus` builds in 47.851 ms with 57.73 MB allocated |
+| Low-latency search | `ShortDocuments` exact ranked graph search is 1.092 ms / 2.17 MB; BM25 is 1.309 ms / 2.14 MB |
+| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` typo fuzzy search is 1.815 ms / 2.86 MB |
+| RDF query paths | `ShortDocuments` exact schema SPARQL is 49.212 ms / 60.32 MB; local federated schema search is 41.243 ms / 62.3 MB |
+| Tiktoken search | `LongDocuments` exact token-distance search is 159.8 us / 107.27 KB; typo correction is 225.7 us / 110.68 KB |
+| Persistence | `LargeCorpus` Turtle file load is 34.787 ms / 28.10 MB; JSON-LD file load is 98.267 ms / 75.32 MB |
+| Lifecycle | Build/search/save/load/export is 45.44 ms / 53.51 MB |
+| Fuzzy edit distance | Long insertion is 368.69x faster than naive Levenshtein; long no-match is 176.19x faster, both with 0 B allocated |
 
 These numbers are local diagnostics, not a cross-machine performance contract.
@@ -28,7 +28,7 @@ Use explicit extraction modes in `MarkdownKnowledgePipeline`:
 - `ChatClient`: require an `IChatClient` and use structured chat extraction only.
 - `Tiktoken`: build an experimental token-distance graph from Tiktoken token IDs.
 
-The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates Euclidean distance, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
+The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates normalized token-distance ranking from cached squared magnitudes and dot products, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
 
 `SearchByTokenDistanceAsync` remains exact by default. Callers can pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` to expand absent query words with close corpus vocabulary terms before Tiktoken query encoding. This uses bounded word-level edit distance as a query-normalization step; it does not compute edit distance over Tiktoken IDs.
 
@@ -65,7 +65,7 @@ flowchart LR
     Token --> Hints["Explicit front matter entity hints"]
     Token --> Structure["schema:hasPart structure"]
     Chat --> Facts["Knowledge facts"]
-    Weighting --> Segments["Segment nodes and related edges"]
+    Weighting --> Segments["Segment nodes and bounded related edges"]
     Topics --> Segments
     Hints --> HintFacts["Hint entities and mentions"]
     Structure --> Segments
@@ -90,6 +90,7 @@ flowchart LR
 - Subword TF-IDF downweights corpus-common tokens without manually curated language rules.
 - Tiktoken mode now produces named topic vertices and typed `schema:hasPart` / `schema:about` edges, not only segment similarity edges.
 - Tiktoken mode preserves explicit front matter entity hints without reintroducing Markdown link, wikilink, or arrow scanner heuristics.
+- Segment building avoids paragraph/line split arrays, related-segment selection keeps bounded nearest neighbors, and vector distance avoids recomputing magnitudes per comparison.
 - Fuzzy query correction improves typo-heavy same-language token-distance search without changing the default exact behavior.
 - Raw term frequency and binary weighting remain testable baselines.
 - The core library still avoids concrete LLM and embedding providers.
@@ -112,7 +113,7 @@ Testing methodology:
 - Chat mode builds graph facts only from `IChatClient` output and does not use Markdown link heuristics.
 - Tiktoken mode builds graph nodes/edges and supports `SearchByTokenDistanceAsync`.
 - Fuzzy query correction tests cover query-side typos, corpus-side misspellings, distractor-biased exact tokens, invalid options, opt-in behavior, and long-vocabulary performance.
-- Focused vector tests verify L2 normalization, binary count suppression, TF-IDF common-token downweighting, and Euclidean distance behavior.
+- Focused vector tests verify L2 normalization, binary count suppression, TF-IDF common-token downweighting, and cached Euclidean distance behavior.
 - English, Ukrainian, French, and German same-language sources with 10 same-language queries each must hit at least 8 top matches.
 - Cross-language translated-topic checks must stay low because no embedding or translation model is present.
 - `TermFrequency`, `Binary`, and `SubwordTfIdf` must each remain selectable and pass the English flow baseline.
 
@@ -20,20 +20,27 @@ Neither path solves cross-language mismatch between graph content and user queri
 Add an optional semantic ranked-search boundary that:
 
 - builds an in-memory semantic index from graph-native candidate text
+- supports in-memory BM25 lexical ranking over the same candidate boundary
 - uses `Microsoft.Extensions.AI.IEmbeddingGenerator<string, Embedding<float>>`
 - keeps graph results canonical
 - uses semantic results only as fallback or merge inputs
+- supports opt-in reciprocal-rank fusion when callers want rank-fused graph and semantic evidence
 - excludes `schema:keywords` from canonical ranking
 
+Exact BM25 stays provider-neutral and in-memory. It counts selected query terms with span-based lookup and pooled term statistics. Optional fuzzy BM25 uses bounded edit distance for typo tolerance, remains opt-in, and builds full candidate term dictionaries only when typo enumeration is requested.
+
 ## Boundaries
 
 ```mermaid
 flowchart LR
     Graph["KnowledgeGraph"] --> Canonical["Canonical graph ranking"]
+    Graph --> Bm25["In-memory BM25 ranking"]
     Graph --> SemanticIndex["Optional semantic index"]
     Embedder["IEmbeddingGenerator"] --> SemanticIndex
     Canonical --> Hybrid["Hybrid merge"]
+    Bm25 --> Results["Ranked results"]
     SemanticIndex --> Hybrid
+    Hybrid --> Results
     Hybrid --> Gateway["Gateway or host app"]
 ```
 
@@ -43,15 +50,20 @@ Positive:
 
 - cross-language queries have a provider-neutral recovery path
 - graph-first explainability is preserved
+- BM25 gives a local lexical ranking path without embeddings, Lucene, or a database
+- fuzzy BM25 can recover insertion, deletion, and substitution typos while staying opt-in
+- reciprocal-rank fusion is available without making it the default merge policy
 - the host application keeps ownership of embedding-provider choice
 
 Negative:
 
 - the library now owns a small additional search boundary
+- fuzzy BM25 costs more CPU than exact BM25 because it must enumerate candidate terms
 - semantic tests need a deterministic non-network embedding adapter
 
 ## Rejected Alternatives
 
 - Semantic-only ranking: rejected because graph must remain canonical.
+- Always-on fuzzy BM25: rejected because exact lexical ranking should stay the cheaper default path.
 - Provider-specific embedding package in the core library: rejected by repository rules.
 - External vector database integration in the library: rejected because infra belongs in the host application.