You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+81-45Lines changed: 81 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@
11
11
12
12
Markdown-LD Knowledge Bank is a .NET 10 library for turning Markdown knowledge-base files into an in-memory RDF graph that can be searched, queried with read-only SPARQL, exported as RDF, and rendered as a diagram.
13
13
14
-
It ports the core idea from [lqdev/markdown-ld-kb](https://github.com/lqdev/markdown-ld-kb) into a C# library package. The runtime is local and in-memory: no localhost server, no Azure Functions host, no database server, and no hosted graph service are required.
14
+
The package is a C# library implementation of the Markdown-LD knowledge graph workflow. The runtime is local and in-memory: no localhost server, no Azure Functions host, no database server, and no hosted graph service are required.
15
15
16
16
Use it when you want plain Markdown notes to become a queryable knowledge graph without making your application depend on a specific model provider, graph server, or hosted indexing service.
17
17
@@ -21,10 +21,13 @@ Use it when you want plain Markdown notes to become a queryable knowledge graph
**Deterministic extraction** produces facts without any network call:
39
+
Extraction is explicit:
37
40
38
-
- article identity, title, summary, dates, tags, authors, and topics from YAML front matter
39
-
- heading sections and document identity from Markdown structure
40
-
- Markdown links such as `[SPARQL](https://www.w3.org/TR/sparql11-query/)`
41
-
- optional wikilinks such as `[[RDF]]`
42
-
- optional assertion arrows such as `article --mentions--> RDF`
41
+
-`Auto` uses `IChatClient` when one is supplied, otherwise extracts no facts and reports a diagnostic.
42
+
-`None` builds document metadata only.
43
+
-`ChatClient` builds facts only from structured `Microsoft.Extensions.AI.IChatClient` output.
44
+
-`Tiktoken` builds a local corpus graph from Tiktoken token IDs, section/segment structure, explicit front matter entity hints, and local keyphrase topics using `Microsoft.ML.Tokenizers`.
43
45
44
-
**Optional AI extraction** enriches the graph with LLM-produced entities and assertions through `Microsoft.Extensions.AI.IChatClient`. No provider-specific SDK is required in the core library.
46
+
Tiktoken mode is deterministic and network-free. It uses lexical token-distance search rather than semantic embedding search. Its default local weighting is subword TF-IDF; raw term frequency and binary presence are also available. It creates `schema:DefinedTerm` topic nodes, explicit front matter hint entities, and `schema:hasPart` / `schema:about` / `schema:mentions` edges.
45
47
46
48
**Graph outputs:**
47
49
@@ -75,9 +77,7 @@ using ManagedCode.MarkdownLd.Kb.Pipeline;
@@ -161,7 +146,7 @@ The library uses `urn:managedcode:markdown-ld-kb:/` as an internal default base
161
146
162
147
## Optional AI Extraction
163
148
164
-
Optional AI extraction enriches the deterministic Markdown graph with entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, the pipeline still runs fully locally and builds the graph from Markdown/front matter/link extraction only.
149
+
AI extraction builds graph facts from entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, `Auto` mode extracts no facts and reports a diagnostic; choose `Tiktoken` mode explicitly for local token-distance extraction.
165
150
166
151
```csharp
167
152
usingManagedCode.MarkdownLd.Kb.Pipeline;
@@ -204,7 +189,48 @@ ASK WHERE {
204
189
}
205
190
```
206
191
207
-
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, merges it with deterministic facts, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
192
+
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
193
+
194
+
## Local Tiktoken Extraction
195
+
196
+
```csharp
197
+
usingManagedCode.MarkdownLd.Kb.Pipeline;
198
+
199
+
internalstaticclassTiktokenGraphDemo
200
+
{
201
+
privateconststringMarkdown="""
202
+
The observatory stores telescope images in a cold archive near the mountain lab.
203
+
River sensors use cached forecasts to protect orchards from frost.
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates Euclidean distance. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
220
+
221
+
Tiktoken mode also builds a corpus graph:
222
+
223
+
- heading or loose document sections and paragraph/line segments become `schema:CreativeWork` nodes
224
+
- local Unicode word n-gram keyphrases become `schema:DefinedTerm` topic nodes
225
+
- explicit front matter `entity_hints` / `entityHints` become graph entities with stable hash IDs and preserved `sameAs` links
The local lexical design follows [Multilingual Search with Subword TF-IDF](https://arxiv.org/abs/2209.14281): use subword tokenization plus TF-IDF instead of manually curated tokenization, stop words, or stemming rules. It is designed for same-language lexical retrieval. Cross-language semantic retrieval requires a translation or embedding layer owned by the host application.
232
+
233
+
The current test corpus validates top-1 token-distance retrieval across English, Ukrainian, French, and German. Same-language queries hit the expected segment at `10/10` for each language in the test corpus. Sampled cross-language aligned hits stay low at `3/40`, which matches the lexical design.
208
234
209
235
## Query The Graph
210
236
@@ -321,6 +347,9 @@ var rows = await shared.Graph.SearchAsync("rdf");
321
347
|`SparqlQueryResult`| Query result with `Variables` and `Rows` of `SparqlRow`. |
322
348
|`KnowledgeSourceDocumentConverter`| Converts files and directories into pipeline-ready source documents. |
323
349
|`ChatClientKnowledgeFactExtractor`| AI extraction adapter behind `IChatClient`. |
350
+
|`TiktokenKnowledgeGraphOptions`| Options for explicit Tiktoken token-distance extraction. |
351
+
|`TokenVectorWeighting`| Local token weighting mode: `SubwordTfIdf`, `TermFrequency`, or `Binary`. |
352
+
|`TokenDistanceSearchResult`| Search result returned by `SearchByTokenDistanceAsync`. |
324
353
325
354
## Markdown Conventions
326
355
@@ -354,9 +383,9 @@ Recognized front matter keys:
354
383
|`tags` / `keywords`|`schema:keywords`| list |
355
384
|`about`|`schema:about`| list |
356
385
|`canonicalUrl` / `canonical_url`| low-level Markdown parser document identity; use `KnowledgeDocumentConversionOptions.CanonicalUri` for pipeline identity | string (URL) |
357
-
|`entity_hints` / `entityHints`|entity hints| list of `{label, type, sameAs}`|
386
+
|`entity_hints` / `entityHints`|explicit graph entities in `Tiktoken` mode; parsed as front matter metadata otherwise| list of `{label, type, sameAs}`|
358
387
359
-
Optional advanced predicate forms:
388
+
Predicate normalization for explicit chat/token facts:
- prefixed predicates such as `schema:mentions`, `kb:relatedTo`, `prov:wasDerivedFrom`, and `rdf:type` are preserved
368
397
- absolute predicate URIs are preserved when valid
369
398
399
+
Markdown links, wikilinks, and arrow assertions are not implicitly converted into graph facts. Use `IChatClient` extraction or explicit `Tiktoken` mode when you want body content to produce graph nodes and edges.
400
+
370
401
## Architecture Choices
371
402
372
403
-`Markdig` parses Markdown structure.
373
404
-`YamlDotNet` parses front matter.
374
405
-`dotNetRDF` builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
375
406
-`Microsoft.Extensions.AI.IChatClient` is the only AI boundary in the core pipeline.
376
-
- Embeddings are not required for the current graph/search flow.
377
-
- Microsoft Agent Framework is treated as host-level orchestration for future workflows, not a core package dependency.
407
+
-`Microsoft.ML.Tokenizers` powers the explicit Tiktoken token-distance mode.
408
+
- Subword TF-IDF is the default local token weighting because it downweights corpus-common tokens without adding language-specific preprocessing or model runtime dependencies.
409
+
- Local topic graph construction uses Unicode word n-gram keyphrases and RDF `schema:DefinedTerm`, `schema:hasPart`, and `schema:about` edges.
410
+
- Embeddings are not required for the current graph/search flow; Tiktoken mode uses token IDs, not embedding vectors.
411
+
- Microsoft Agent Framework is treated as host-level orchestration, not a core package dependency.
378
412
379
-
See [docs/Architecture.md](docs/Architecture.md), [ADR-0001](docs/ADR/ADR-0001-rdf-sparql-library.md), and [ADR-0002](docs/ADR/ADR-0002-llm-extraction-ichatclient.md).
413
+
See [docs/Architecture.md](docs/Architecture.md), [ADR-0001](docs/ADR/ADR-0001-rdf-sparql-library.md), [ADR-0002](docs/ADR/ADR-0002-llm-extraction-ichatclient.md), and [ADR-0003](docs/ADR/ADR-0003-tiktoken-extraction-mode.md).
380
414
381
415
## Inspiration And Attribution
382
416
@@ -388,7 +422,7 @@ This project is inspired by Luis Quintanilla's Markdown-LD / AI Memex work:
388
422
-[W3C SPARQL Federated Query](https://github.com/w3c/sparql-federated-query) - SPARQL federation reference material
389
423
-[dotNetRDF](https://github.com/dotnetrdf/dotnetrdf) - RDF/SPARQL engine used by this C# implementation
390
424
391
-
The original repository is kept as a read-only submodule under `external/lqdev-markdown-ld-kb`. This package ports the technology and API direction into a reusable .NET library instead of copying the Python repository layout.
425
+
The upstream reference repository is kept as a read-only submodule under `external/lqdev-markdown-ld-kb`.
392
426
393
427
## Development
394
428
@@ -400,10 +434,12 @@ dotnet format MarkdownLd.Kb.slnx --verify-no-changes
Coverage is collected through `Microsoft.Testing.Extensions.CodeCoverage`. Cobertura is the XML output format used for line and branch reporting; the test project does not reference Coverlet.
The verified coverage baseline for the Tiktoken graph extraction work was 95.39% line coverage and 83.28% branch coverage. The branch number was low because several parser, converter, search, and Tiktoken entity-hint boundary paths were exercised only on their common paths.
6
+
7
+
## Scope
8
+
9
+
In scope:
10
+
11
+
- Add meaningful tests for public or flow-level behavior.
12
+
- Use coverage XML to target uncovered branch paths.
13
+
- Update README with verified coverage numbers and coverage-tool wording.
14
+
15
+
Out of scope:
16
+
17
+
- Adding Coverlet.
18
+
- Changing the production coverage collector.
19
+
- Refactoring defensive branches only to improve coverage.
20
+
21
+
## Options
22
+
23
+
### Tiktoken Entity-Hint Branches
24
+
25
+
Cover scalar, numeric, `value`, `same_as`, blank, null, and empty-map entity hint front matter shapes through the real Tiktoken graph flow.
26
+
27
+
### Parser And Converter Boundary Branches
28
+
29
+
Cover empty front matter, null metadata values, BOM-only source, blank YAML keys, default converter paths, media type override trimming, unsupported directory entries, and no-extension files.
30
+
31
+
### Query Search Merge Branches
32
+
33
+
Cover duplicate SPARQL result rows produced by repeated optional values, proving the search service keeps one caller-visible article result.
34
+
35
+
## Recommendation
36
+
37
+
Use all three categories because they are caller-visible boundary conditions and map directly to uncovered coverage report lines. Avoid direct private-method testing or production refactors unless a test exposes a real behavior bug.
38
+
39
+
## Result
40
+
41
+
The implemented tests raised coverage from 95.39% line / 83.28% branch to 96.30% line / 85.23% branch, with 77 tests passing.
Raise branch coverage above the 83.28% baseline with meaningful boundary tests for the current Tiktoken graph extraction, Markdown parsing, converter, and search behavior. Keep coverage collection on `Microsoft.Testing.Extensions.CodeCoverage`; Cobertura remains only the report output format.
- Coverlet packages are not referenced by the solution.
15
+
16
+
## Already Failing Tests
17
+
18
+
-[x] None known.
19
+
20
+
## Ordered Steps
21
+
22
+
-[x] Add Tiktoken entity-hint boundary tests for scalar, numeric, `value`, `same_as`, blank, null, and empty-map shapes.
23
+
-[x] Add parser boundary tests for empty front matter, null metadata values, BOM-only source, and blank YAML keys.
24
+
-[x] Add converter boundary tests for default content conversion, media type override trimming, missing directories, no-extension files, and unsupported files when skipping is disabled.
25
+
-[x] Add search-service merge test for repeated optional SPARQL rows.
26
+
-[x] Run build, test, coverage, format, and diff checks.
27
+
-[x] Update README with verified test and coverage numbers.
0 commit comments