You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -779,7 +779,7 @@ River sensors use cached forecasts to protect orchards from frost.
779
779
}
780
780
```
781
781
782
-
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates Euclidean distance. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
782
+
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates token-distance ranking from cached squared magnitudes and dot products. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
783
783
784
784
`SearchByTokenDistanceAsync` keeps exact token-distance behavior by default. Pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` when user queries may contain typos. The correction step checks words that are absent from the indexed corpus vocabulary, finds close corpus terms with the bounded edit-distance matcher, appends the best corrections to the query, and only then runs Tiktoken vector search. This improves recall for misspelled words in the query or corpus text while leaving the Tiktoken vector space as the ranking signal.
785
785
@@ -1272,6 +1272,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
1272
1272
- `dotNetRDF`builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
1273
1273
- Schema-aware search compiles caller profiles into local or federated SPARQL and keeps generated queries/evidence visible to callers.
1274
1274
- Ranked search can use graph-native ranking, in-memory BM25, optional fuzzy BM25 token matching, optional semantic ranking, or hybrid reciprocal-rank fusion.
1275
+
- Exact BM25 counts selected query terms with span-based lookup and pooled per-query statistics; fuzzy BM25 stays opt-in because it must enumerate typo candidates.
1275
1276
- Cited answers use `IChatClient` plus ranked graph retrieval and return source citations without storing conversation history.
1276
1277
- Chunk evaluation and source-change planning are deterministic local helpers, not hosted indexing services.
1277
1278
- `dotNetRdf.Shacl`validates built graphs with default or caller-supplied SHACL shapes.
@@ -1285,7 +1286,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
1285
1286
1286
1287
## Algorithm References
1287
1288
1288
-
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with portable SIMD common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
1289
+
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
1289
1290
- The bit-vector path is guided by Gene Myers, "A fast bit-vector algorithm for approximate string matching based on dynamic programming", Journal of the ACM, 1999, DOI: <https://doi.org/10.1145/316542.316550>.
1290
1291
- The bounded-threshold behavior is guided by Esko Ukkonen, "Algorithms for approximate string matching", Information and Control, 1985, DOI: <https://doi.org/10.1016/S0019-9958(85)80046-2>.
1291
1292
- Thanks to `biegehydra/MyersBitParallelDotnet` for inspiring the practical direction we took for fast short-token typo matching.
@@ -1311,13 +1312,13 @@ Current local headline numbers from the May 3, 2026 BenchmarkDotNet 0.15.8 run o
1311
1312
| Area | Current local result |
1312
1313
| --- | --- |
1313
1314
| Full suite | 118 BenchmarkDotNet cases using the `Default` job |
1314
-
| Graph build | `LargeCorpus` builds in 45.457 ms with 57.74 MB allocated |
1315
-
| Low-latency search | `ShortDocuments` exact ranked graph search is 1.195 ms / 2.37 MB; BM25 is 1.659 ms / 3.07 MB |
1316
-
| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` exact fuzzy search is 1.979 ms / 3.07 MB |
1317
-
| RDF query paths | `ShortDocuments` exact schema SPARQL is 41.078 ms / 60.33 MB; local federated schema search is 39.410 ms / 62.31 MB |
1318
-
| Tiktoken search | `LongDocuments` exact token-distance search is 298.1 us / 212.24 KB; typo correction is 391.5 us / 216.30 KB |
1319
-
| Persistence | `LargeCorpus` Turtle file load is 35.708 ms / 28.10 MB; JSON-LD file load is 90.663 ms / 75.32 MB |
1320
-
| Lifecycle | Build/search/save/load/export is 55.35 ms / 54.44 MB |
1321
-
| Fuzzy edit distance | Long insertion is 376.58x faster than naive Levenshtein; long no-match is 172.88x faster, both with 0 B allocated |
1315
+
| Graph build | `LargeCorpus` builds in 47.851 ms with 57.73 MB allocated |
1316
+
| Low-latency search | `ShortDocuments` exact ranked graph search is 1.092 ms / 2.17 MB; BM25 is 1.309 ms / 2.14 MB |
1317
+
| Typo-tolerant search | BM25 fuzzy stays opt-in; `ShortDocuments` typo fuzzy search is 1.815 ms / 2.86 MB |
1318
+
| RDF query paths | `ShortDocuments` exact schema SPARQL is 49.212 ms / 60.32 MB; local federated schema search is 41.243 ms / 62.3 MB |
1319
+
| Tiktoken search | `LongDocuments` exact token-distance search is 159.8 us / 107.27 KB; typo correction is 225.7 us / 110.68 KB |
1320
+
| Persistence | `LargeCorpus` Turtle file load is 34.787 ms / 28.10 MB; JSON-LD file load is 98.267 ms / 75.32 MB |
1321
+
| Lifecycle | Build/search/save/load/export is 45.44 ms / 53.51 MB |
1322
+
| Fuzzy edit distance | Long insertion is 368.69x faster than naive Levenshtein; long no-match is 176.19x faster, both with 0 B allocated |
1322
1323
1323
1324
These numbers are local diagnostics, not a cross-machine performance contract.
Copy file name to clipboardExpand all lines: docs/ADR/ADR-0003-tiktoken-extraction-mode.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ Use explicit extraction modes in `MarkdownKnowledgePipeline`:
28
28
-`ChatClient`: require an `IChatClient` and use structured chat extraction only.
29
29
-`Tiktoken`: build an experimental token-distance graph from Tiktoken token IDs.
30
30
31
-
The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates Euclidean distance, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
31
+
The Tiktoken mode uses `Microsoft.ML.Tokenizers` and `Microsoft.ML.Tokenizers.Data.O200kBase`. It segments Markdown through heading or loose document sections and paragraph/line blocks, encodes each segment with Tiktoken, fits a corpus-local sparse vector space, calculates normalized token-distance ranking from cached squared magnitudes and dot products, creates segment entities, links the source document to each segment with `schema:mentions`, and links near segments with `kb:relatedTo`.
32
32
33
33
`SearchByTokenDistanceAsync` remains exact by default. Callers can pass `TokenDistanceSearchOptions` with `EnableFuzzyQueryCorrection = true` to expand absent query words with close corpus vocabulary terms before Tiktoken query encoding. This uses bounded word-level edit distance as a query-normalization step; it does not compute edit distance over Tiktoken IDs.
34
34
@@ -65,7 +65,7 @@ flowchart LR
65
65
Token --> Hints["Explicit front matter entity hints"]
66
66
Token --> Structure["schema:hasPart structure"]
67
67
Chat --> Facts["Knowledge facts"]
68
-
Weighting --> Segments["Segment nodes and related edges"]
68
+
Weighting --> Segments["Segment nodes and bounded related edges"]
69
69
Topics --> Segments
70
70
Hints --> HintFacts["Hint entities and mentions"]
71
71
Structure --> Segments
@@ -90,6 +90,7 @@ flowchart LR
90
90
- Subword TF-IDF downweights corpus-common tokens without manually curated language rules.
91
91
- Tiktoken mode now produces named topic vertices and typed `schema:hasPart` / `schema:about` edges, not only segment similarity edges.
92
92
- Tiktoken mode preserves explicit front matter entity hints without reintroducing Markdown link, wikilink, or arrow scanner heuristics.
93
+
- Segment building avoids paragraph/line split arrays, related-segment selection keeps bounded nearest neighbors, and vector distance avoids recomputing magnitudes per comparison.
93
94
- Fuzzy query correction improves typo-heavy same-language token-distance search without changing the default exact behavior.
94
95
- Raw term frequency and binary weighting remain testable baselines.
95
96
- The core library still avoids concrete LLM and embedding providers.
@@ -112,7 +113,7 @@ Testing methodology:
112
113
- Chat mode builds graph facts only from `IChatClient` output and does not use Markdown link heuristics.
113
114
- Tiktoken mode builds graph nodes/edges and supports `SearchByTokenDistanceAsync`.
- uses semantic results only as fallback or merge inputs
27
+
- supports opt-in reciprocal-rank fusion when callers want rank-fused graph and semantic evidence
26
28
- excludes `schema:keywords` from canonical ranking
27
29
30
+
Exact BM25 stays provider-neutral and in-memory. It counts selected query terms with span-based lookup and pooled term statistics. Optional fuzzy BM25 uses bounded edit distance for typo tolerance, remains opt-in, and builds full candidate term dictionaries only when typo enumeration is requested.
0 commit comments