You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-21Lines changed: 25 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1285,7 +1285,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
1285
1285
1286
1286
## Algorithm References
1287
1287
1288
-
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with portable SIMD common-affix trimming, a single-word bit-vector dynamic-programming path for short residual tokens, and a bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
1288
+
- Optional fuzzy lexical matching is shared by BM25 typo-tolerant ranking and Tiktoken fuzzy query correction. It uses bounded edit distance with portable SIMD common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It is not a naive full-matrix Levenshtein implementation and does not use platform-specific SIMD intrinsics.
1289
1289
- The bit-vector path is guided by Gene Myers, "A fast bit-vector algorithm for approximate string matching based on dynamic programming", Journal of the ACM, 1999, DOI: <https://doi.org/10.1145/316542.316550>.
1290
1290
- The bounded-threshold behavior is guided by Esko Ukkonen, "Algorithms for approximate string matching", Information and Control, 1985, DOI: <https://doi.org/10.1016/S0019-9958(85)80046-2>.
1291
1291
- Thanks to `biegehydra/MyersBitParallelDotnet` for inspiring the practical direction we took for fast short-token typo matching.
@@ -1363,28 +1363,28 @@ Graph search exact-query mean time:
Allocation and GC columns come directly from BenchmarkDotNet diagnosers. Treat the ratios and relative pressure inside the same run as the useful signal; ShortRun is a fast diagnostic pass, not a release-grade SLA measurement.
| Short deletion | 6.778 ns | 91.780 ns | 13.54x | 0 B | 112 B |
1417
-
| Short substitution | 31.011 ns | 82.948 ns | 2.67x | 216 B | 112 B |
1418
-
| Long insertion | 21.980 ns | 7,990.146 ns | 363.53x | 0 B | 640 B |
1419
-
| Long no-match | 70.283 ns | 8,990.700 ns | 127.92x | 328 B | 672 B |
1418
+
| Short deletion | 6.793 ns | 94.900 ns | 13.97x | 0 B | 112 B |
1419
+
| Short substitution | 32.973 ns | 83.927 ns | 2.55x | 0 B | 112 B |
1420
+
| Long insertion | 22.062 ns | 8,261.735 ns | 374.48x | 0 B | 640 B |
1421
+
| Long no-match | 53.873 ns | 9,292.649 ns | 172.49x | 0 B | 672 B |
1422
+
1423
+
This run reflects the allocation-focused search hot-path pass: BM25 now uses the shared allocation-aware tokenizer, direct scoring loops, and bounded top-N match retention; fuzzy edit distance uses stack-backed bit-vector masks for short residual tokens and pooled rows for the long-token fallback; and Tiktoken search keeps only bounded top-N candidates while TF-IDF weighting updates dictionary values without temporary key arrays.
1420
1424
1421
1425
These numbers are local measurements, not a cross-machine performance contract. The README keeps compact slices only; [Performance Benchmarks](docs/Features/PerformanceBenchmarks.md) and the full Markdown, CSV, and JSON BenchmarkDotNet reports remain the source for detailed diagnostics.
Copy file name to clipboardExpand all lines: docs/Features/HybridGraphSearch.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,7 @@ flowchart LR
40
40
-`schema:keywords` are excluded from canonical ranking.
41
41
- BM25 mode does not require an embedding provider, semantic index, Lucene index, or database.
42
42
- Build-result BM25 can find body-only terms that are absent from title, summary, and front matter.
43
-
- Fuzzy BM25 token matching is opt-in through `KnowledgeGraphRankedSearchOptions.EnableFuzzyTokenMatching`, `MaxFuzzyEditDistance`, and `MinimumFuzzyTokenLength`. It handles insertion, deletion, and substitution typos with portable SIMD common-affix trimming, a single-word bit-vector path for short residual tokens, and a bounded banded dynamic-programming fallback for longer residual tokens. It does not use platform-specific SIMD intrinsics.
43
+
- Fuzzy BM25 token matching is opt-in through `KnowledgeGraphRankedSearchOptions.EnableFuzzyTokenMatching`, `MaxFuzzyEditDistance`, and `MinimumFuzzyTokenLength`. It handles insertion, deletion, and substitution typos with portable SIMD common-affix trimming, stack-backed bit-vector masks for short residual tokens, and a pooled bounded banded dynamic-programming fallback for longer residual tokens. It does not use platform-specific SIMD intrinsics.
44
44
- A hit present in both graph and semantic ranking is marked as merged and keeps its graph-first position.
45
45
- Semantic-only hits never outrank canonical graph hits in hybrid mode.
46
46
-`KnowledgeGraphHybridFusionStrategy.ReciprocalRank` is opt-in. It applies reciprocal rank fusion across graph and semantic result lists while preserving the canonical and semantic component scores for diagnostics.
| Local federated |51.551 ms | 62.32 MB | 26.27x |8500.0000|2000.0000|333.3333| 552 |314.5000|
134
134
135
135
Allocation, GC, work-item, and lock-contention columns come directly from BenchmarkDotNet diagnosers. Treat ratios and relative pressure inside the same run as the useful signal; ShortRun is a fast diagnostic pass, not a release-grade SLA measurement.
136
136
@@ -152,13 +152,13 @@ Tiktoken token-distance search over the semantic profiles:
|`LongDocuments`| NoMatch |254.3 us |257.7 us |212.19 KB|213.41 KB|
158
+
|`TokenizedMultilingual`| Exact |219.4 us |220.5 us |139.18 KB|140.13 KB|
159
+
|`TokenizedMultilingual`| Typo |246.2 us |267.8 us |139.59 KB|142.02 KB|
160
+
|`TokenizedMultilingual`| NoMatch |200.3 us |184.3 us |138.91 KB|140.06 KB|
161
161
162
-
Interpretation: ranked graph, BM25, BM25 fuzzy, and focused search are the low-latency retrieval paths. BM25 fuzzy deliberately spends more time and allocation on typo-heavy queries and should stay opt-in. Schema-aware SPARQL and local federation are explainable RDF query paths, but dotNetRDF query-plan execution keeps them materially heavier for repeated low-latency calls. JSON-LD load is the highest persistence cost in the current local run; Turtle load and snapshot/serialization are cheaper. Use ranked graph or BM25 search when the caller needs low-latency retrieval, and use schema/federation when caller-visible evidence and graph-shape constraints matter more than raw latency.
162
+
Interpretation: ranked graph, BM25, BM25 fuzzy, focused search, and Tiktoken token-distance search are the low-latency retrieval paths. The current BM25 implementation keeps exact and fuzzy allocation close by sharing the same tokenizer, dictionary shape, bounded top-N match retention, stack-backed short-token edit-distance masks, and pooled long-token fallback rows. Tiktoken search keeps bounded top-N candidates and updates TF-IDF dictionary values without temporary key arrays. Fuzzy BM25 still costs more CPU on typo-heavy queries and should stay opt-in. Schema-aware SPARQL and local federation are explainable RDF query paths, but dotNetRDF query-plan execution keeps them materially heavier for repeated low-latency calls. JSON-LD load is the highest persistence cost in the current local run; Turtle load and snapshot/serialization are cheaper. Use ranked graph or BM25 search when the caller needs low-latency retrieval, and use schema/federation when caller-visible evidence and graph-shape constraints matter more than raw latency.
163
163
164
-
The fuzzy edit-distance suite measured the bounded bit-vector/banded path faster than the naive Levenshtein baseline in every measured scenario, including 363.53x faster for the long-insertion case and 127.92x faster for the long no-match case.
164
+
The fuzzy edit-distance suite measured the bounded bit-vector/banded path with zero allocated bytes and faster than the naive Levenshtein baseline in every measured scenario, including 374.48x faster for the long-insertion case and 172.49x faster for the long no-match case.
0 commit comments