Skip to content

Commit ed2f4d2

Browse files
committed
Optimize search hot paths
1 parent 5f44642 commit ed2f4d2

22 files changed

Lines changed: 1328 additions & 581 deletions

README.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -779,7 +779,7 @@ River sensors use cached forecasts to protect orchards from frost.
779779
}
780780
```
781781

782-
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates Euclidean distance. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
782+
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates token-distance ranking from cached squared magnitudes and dot products. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
783783

784784
`SearchByTokenDistanceAsync` keeps exact token-distance behavior by default. Pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` when user queries may contain typos. The correction step checks words that are absent from the indexed corpus vocabulary, finds close corpus terms with the bounded edit-distance matcher, appends the best corrections to the query, and only then runs Tiktoken vector search. This improves recall for misspelled words in the query or corpus text while leaving the Tiktoken vector space as the ranking signal.
785785

@@ -1272,6 +1272,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
12721272
- `dotNetRDF` builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
12731273
- Schema-aware search compiles caller profiles into local or federated SPARQL and keeps generated queries/evidence visible to callers.
12741274
- Ranked search can use graph-native ranking, in-memory BM25, optional fuzzy BM25 token matching, optional semantic ranking, or hybrid reciprocal-rank fusion.
1275+
- Exact BM25 counts selected query terms with span-based lookup and pooled per-query statistics; fuzzy BM25 stays opt-in because it must enumerate typo candidates.
12751276
- Cited answers use `IChatClient` plus ranked graph retrieval and return source citations without storing conversation history.
12761277
- Chunk evaluation and source-change planning are deterministic local helpers, not hosted indexing services.
12771278
- `dotNetRdf.Shacl` validates built graphs with default or caller-supplied SHACL shapes.
@@ -1285,7 +1286,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
12851286

12861287
## Algorithm References
12871288

1288-
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with portable SIMD common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
1289+
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
12891290
- The bit-vector path is guided by Gene Myers, "A fast bit-vector algorithm for approximate string matching based on dynamic programming", Journal of the ACM, 1999, DOI: <https://doi.org/10.1145/316542.316550>.
12901291
- The bounded-threshold behavior is guided by Esko Ukkonen, "Algorithms for approximate string matching", Information and Control, 1985, DOI: <https://doi.org/10.1016/S0019-9958(85)80046-2>.
12911292
- Thanks to `biegehydra/MyersBitParallelDotnet` for inspiring the practical direction we took for fast short-token typo matching.
@@ -1311,13 +1312,13 @@ Current local headline numbers from the May 3, 2026 BenchmarkDotNet 0.15.8 run o
13111312
| Area | Current local result |
13121313
| --- | --- |
13131314
| Full suite | 118 BenchmarkDotNet cases using the `Default` job |
1314-
| Graph build | `LargeCorpus` builds in 45.457 ms with 57.74 MB allocated |
1315-
| Low-latency search | `ShortDocuments` exact ranked graph search is 1.195 ms / 2.37 MB; BM25 is 1.659 ms / 3.07 MB |
1316-
| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` exact fuzzy search is 1.979 ms / 3.07 MB |
1317-
| RDF query paths | `ShortDocuments` exact schema SPARQL is 41.078 ms / 60.33 MB; local federated schema search is 39.410 ms / 62.31 MB |
1318-
| Tiktoken search | `LongDocuments` exact token-distance search is 298.1 us / 212.24 KB; typo correction is 391.5 us / 216.30 KB |
1319-
| Persistence | `LargeCorpus` Turtle file load is 35.708 ms / 28.10 MB; JSON-LD file load is 90.663 ms / 75.32 MB |
1320-
| Lifecycle | Build/search/save/load/export is 55.35 ms / 54.44 MB |
1321-
| Fuzzy edit distance | Long insertion is 376.58x faster than naive Levenshtein; long no-match is 172.88x faster, both with 0 B allocated |
1315+
| Graph build | `LargeCorpus` builds in 47.851 ms with 57.73 MB allocated |
1316+
| Low-latency search | `ShortDocuments` exact ranked graph search is 1.092 ms / 2.17 MB; BM25 is 1.309 ms / 2.14 MB |
1317+
| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` typo fuzzy search is 1.815 ms / 2.86 MB |
1318+
| RDF query paths | `ShortDocuments` exact schema SPARQL is 49.212 ms / 60.32 MB; local federated schema search is 41.243 ms / 62.3 MB |
1319+
| Tiktoken search | `LongDocuments` exact token-distance search is 159.8 us / 107.27 KB; typo correction is 225.7 us / 110.68 KB |
1320+
| Persistence | `LargeCorpus` Turtle file load is 34.787 ms / 28.10 MB; JSON-LD file load is 98.267 ms / 75.32 MB |
1321+
| Lifecycle | Build/search/save/load/export is 45.44 ms / 53.51 MB |
1322+
| Fuzzy edit distance | Long insertion is 368.69x faster than naive Levenshtein; long no-match is 176.19x faster, both with 0 B allocated |
13221323

13231324
These numbers are local diagnostics, not a cross-machine performance contract.

docs/ADR/ADR-0003-tiktoken-extraction-mode.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Use explicit extraction modes in `MarkdownKnowledgePipeline`:
2828
- `ChatClient`: require an `IChatClient` and use structured chat extraction only.
2929
- `Tiktoken`: build an experimental token-distance graph from Tiktoken token IDs.
3030

31-
The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates Euclidean distance, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
31+
The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates normalized token-distance ranking from cached squared magnitudes and dot products, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
3232

3333
`SearchByTokenDistanceAsync` remains exact by default. Callers can pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` to expand absent query words with close corpus vocabulary terms before Tiktoken query encoding. This uses bounded word-level edit distance as a query-normalization step; it does not compute edit distance over Tiktoken IDs.
3434

@@ -65,7 +65,7 @@ flowchart LR
6565
Token --> Hints["Explicit front matter entity hints"]
6666
Token --> Structure["schema:hasPart structure"]
6767
Chat --> Facts["Knowledge facts"]
68-
Weighting --> Segments["Segment nodes and related edges"]
68+
Weighting --> Segments["Segment nodes and bounded related edges"]
6969
Topics --> Segments
7070
Hints --> HintFacts["Hint entities and mentions"]
7171
Structure --> Segments
@@ -90,6 +90,7 @@ flowchart LR
9090
- Subword TF-IDF downweights corpus-common tokens without manually curated language rules.
9191
- Tiktoken mode now produces named topic vertices and typed `schema:hasPart` / `schema:about` edges, not only segment similarity edges.
9292
- Tiktoken mode preserves explicit front matter entity hints without reintroducing Markdown link, wikilink, or arrow scanner heuristics.
93+
- Segment building avoids paragraph/line split arrays, related-segment selection keeps bounded nearest neighbors, and vector distance avoids recomputing magnitudes per comparison.
9394
- Fuzzy query correction improves typo-heavy same-language token-distance search without changing the default exact behavior.
9495
- Raw term frequency and binary weighting remain testable baselines.
9596
- The core library still avoids concrete LLM and embedding providers.
@@ -112,7 +113,7 @@ Testing methodology:
112113
- Chat mode builds graph facts only from `IChatClient` output and does not use Markdown link heuristics.
113114
- Tiktoken mode builds graph nodes/edges and supports `SearchByTokenDistanceAsync`.
114115
- Fuzzy query correction tests cover query-side typos, corpus-side misspellings, distractor-biased exact tokens, invalid options, opt-in behavior, and long-vocabulary performance.
115-
- Focused vector tests verify L2 normalization, binary count suppression, TF-IDF common-token downweighting, and Euclidean distance behavior.
116+
- Focused vector tests verify L2 normalization, binary count suppression, TF-IDF common-token downweighting, and cached Euclidean distance behavior.
116117
- English, Ukrainian, French, and German same-language sources with 10 same-language queries each must hit at least 8 top matches.
117118
- Cross-language translated-topic checks must stay low because no embedding or translation model is present.
118119
- `TermFrequency`, `Binary`, and `SubwordTfIdf` must each remain selectable and pass the English flow baseline.

docs/ADR/ADR-0005-hybrid-graph-search.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,27 @@ Neither path solves cross-language mismatch between graph content and user queri
2020
Add an optional semantic ranked-search boundary that:
2121

2222
- builds an in-memory semantic index from graph-native candidate text
23+
- supports in-memory BM25 lexical ranking over the same candidate boundary
2324
- uses `Microsoft.Extensions.AI.IEmbeddingGenerator<string, Embedding<float>>`
2425
- keeps graph results canonical
2526
- uses semantic results only as fallback or merge inputs
27+
- supports opt-in reciprocal-rank fusion when callers want rank-fused graph and semantic evidence
2628
- excludes `schema:keywords` from canonical ranking
2729

30+
Exact BM25 stays provider-neutral and in-memory. It counts selected query terms with span-based lookup and pooled term statistics. Optional fuzzy BM25 uses bounded edit distance for typo tolerance, remains opt-in, and builds full candidate term dictionaries only when typo enumeration is requested.
31+
2832
## Boundaries
2933

3034
```mermaid
3135
flowchart LR
3236
Graph["KnowledgeGraph"] --> Canonical["Canonical graph ranking"]
37+
Graph --> Bm25["In-memory BM25 ranking"]
3338
Graph --> SemanticIndex["Optional semantic index"]
3439
Embedder["IEmbeddingGenerator"] --> SemanticIndex
3540
Canonical --> Hybrid["Hybrid merge"]
41+
Bm25 --> Results["Ranked results"]
3642
SemanticIndex --> Hybrid
43+
Hybrid --> Results
3744
Hybrid --> Gateway["Gateway or host app"]
3845
```
3946

@@ -43,15 +50,20 @@ Positive:
4350

4451
- cross-language queries have a provider-neutral recovery path
4552
- graph-first explainability is preserved
53+
- BM25 gives a local lexical ranking path without embeddings, Lucene, or a database
54+
- fuzzy BM25 can recover insertion, deletion, and substitution typos while staying opt-in
55+
- reciprocal-rank fusion is available without making it the default merge policy
4656
- the host application keeps ownership of embedding-provider choice
4757

4858
Negative:
4959

5060
- the library now owns a small additional search boundary
61+
- fuzzy BM25 costs more CPU than exact BM25 because it must enumerate candidate terms
5162
- semantic tests need a deterministic non-network embedding adapter
5263

5364
## Rejected Alternatives
5465

5566
- Semantic-only ranking: rejected because graph must remain canonical.
67+
- Always-on fuzzy BM25: rejected because exact lexical ranking should stay the cheaper default path.
5668
- Provider-specific embedding package in the core library: rejected by repository rules.
5769
- External vector database integration in the library: rejected because infra belongs in the host application.

0 commit comments

Comments
 (0)