Skip to content

Commit 4e641de

Browse files
committed
feat: add tiktoken graph extraction
1 parent 1a50bcb commit 4e641de

File tree

53 files changed

+2732
-2437
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+2732
-2437
lines changed

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
<PackageReadmeFile>README.md</PackageReadmeFile>
2626
<EnablePackageValidation>true</EnablePackageValidation>
2727
<Product>Markdown-LD Knowledge Bank</Product>
28-
<Version>0.0.1</Version>
29-
<PackageVersion>0.0.1</PackageVersion>
28+
<Version>0.1.0</Version>
29+
<PackageVersion>0.1.0</PackageVersion>
3030
</PropertyGroup>
3131

3232
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">

Directory.Packages.props

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@
99
<PackageVersion Include="Markdig" Version="1.1.2" />
1010
<PackageVersion Include="Microsoft.Extensions.AI" Version="10.4.1" />
1111
<PackageVersion Include="Microsoft.Extensions.AI.Abstractions" Version="10.4.1" />
12+
<PackageVersion Include="Microsoft.Bcl.Memory" Version="10.0.5" />
13+
<PackageVersion Include="Microsoft.ML.Tokenizers" Version="2.0.0" />
14+
<PackageVersion Include="Microsoft.ML.Tokenizers.Data.O200kBase" Version="2.0.0" />
1215
<PackageVersion Include="Microsoft.NET.Test.Sdk" Version="18.4.0" />
1316
<PackageVersion Include="Microsoft.Testing.Extensions.CodeCoverage" Version="18.6.2" />
1417
<PackageVersion Include="Shouldly" Version="4.3.0" />

README.md

Lines changed: 81 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
Markdown-LD Knowledge Bank is a .NET 10 library for turning Markdown knowledge-base files into an in-memory RDF graph that can be searched, queried with read-only SPARQL, exported as RDF, and rendered as a diagram.
1313

14-
It ports the core idea from [lqdev/markdown-ld-kb](https://github.com/lqdev/markdown-ld-kb) into a C# library package. The runtime is local and in-memory: no localhost server, no Azure Functions host, no database server, and no hosted graph service are required.
14+
The package is a C# library implementation of the Markdown-LD knowledge graph workflow. The runtime is local and in-memory: no localhost server, no Azure Functions host, no database server, and no hosted graph service are required.
1515

1616
Use it when you want plain Markdown notes to become a queryable knowledge graph without making your application depend on a specific model provider, graph server, or hosted indexing service.
1717

@@ -21,10 +21,13 @@ Use it when you want plain Markdown notes to become a queryable knowledge graph
2121
flowchart LR
2222
Source["Markdown / MDX / text\nJSON / YAML / CSV"] --> Converter["KnowledgeSourceDocumentConverter"]
2323
Converter --> Parser["MarkdownDocumentParser\n→ MarkdownDocument"]
24-
Parser --> Det["DeterministicKnowledgeFactExtractor\n→ entities, assertions"]
25-
Parser --> Chat["ChatClientKnowledgeFactExtractor\n(optional IChatClient)"]
26-
Det --> Merge["KnowledgeFactMerger\n→ merged KnowledgeExtractionResult"]
24+
Parser --> Mode["Extraction mode\nAuto / None / ChatClient / Tiktoken"]
25+
Mode --> None["None\nmetadata only"]
26+
Mode --> Chat["ChatClientKnowledgeFactExtractor\nIChatClient"]
27+
Mode --> Token["Tiktoken token-distance extractor\nMicrosoft.ML.Tokenizers"]
28+
None --> Merge["KnowledgeFactMerger\n→ merged KnowledgeExtractionResult"]
2729
Chat --> Merge
30+
Token --> Merge
2831
Merge --> Builder["KnowledgeGraphBuilder\n→ dotNetRDF in-memory graph"]
2932
Builder --> Search["SearchAsync"]
3033
Builder --> Sparql["ExecuteSelectAsync\nExecuteAskAsync"]
@@ -33,15 +36,14 @@ flowchart LR
3336
Builder --> Export["SerializeTurtle\nSerializeJsonLd"]
3437
```
3538

36-
**Deterministic extraction** produces facts without any network call:
39+
Extraction is explicit:
3740

38-
- article identity, title, summary, dates, tags, authors, and topics from YAML front matter
39-
- heading sections and document identity from Markdown structure
40-
- Markdown links such as `[SPARQL](https://www.w3.org/TR/sparql11-query/)`
41-
- optional wikilinks such as `[[RDF]]`
42-
- optional assertion arrows such as `article --mentions--> RDF`
41+
- `Auto` uses `IChatClient` when one is supplied, otherwise extracts no facts and reports a diagnostic.
42+
- `None` builds document metadata only.
43+
- `ChatClient` builds facts only from structured `Microsoft.Extensions.AI.IChatClient` output.
44+
- `Tiktoken` builds a local corpus graph from Tiktoken token IDs, section/segment structure, explicit front matter entity hints, and local keyphrase topics using `Microsoft.ML.Tokenizers`.
4345

44-
**Optional AI extraction** enriches the graph with LLM-produced entities and assertions through `Microsoft.Extensions.AI.IChatClient`. No provider-specific SDK is required in the core library.
46+
Tiktoken mode is deterministic and network-free. It uses lexical token-distance search rather than semantic embedding search. Its default local weighting is subword TF-IDF; raw term frequency and binary presence are also available. It creates `schema:DefinedTerm` topic nodes, explicit front matter hint entities, and `schema:hasPart` / `schema:about` / `schema:mentions` edges.
4547

4648
**Graph outputs:**
4749

@@ -75,9 +77,7 @@ using ManagedCode.MarkdownLd.Kb.Pipeline;
7577

7678
internal static class MinimalGraphDemo
7779
{
78-
private const string SearchTerm = "rdf";
79-
private const string NameKey = "name";
80-
private const string RdfLabel = "RDF";
80+
private const string SearchTerm = "RDF SPARQL Markdown graph";
8181

8282
private const string ArticleMarkdown = """
8383
---
@@ -92,33 +92,18 @@ author:
9292
# Zero Cost Knowledge Graph
9393
9494
Markdown-LD Knowledge Bank links [RDF](https://www.w3.org/RDF/) and [SPARQL](https://www.w3.org/TR/sparql11-query/).
95-
""";
96-
97-
private const string SelectFactsQuery = """
98-
PREFIX schema: <https://schema.org/>
99-
SELECT ?article ?entity ?name WHERE {
100-
?article a schema:Article ;
101-
schema:name "Zero Cost Knowledge Graph" ;
102-
schema:keywords "markdown" ;
103-
schema:mentions ?entity .
104-
?entity schema:name ?name ;
105-
schema:sameAs <https://www.w3.org/RDF/> .
106-
}
10795
""";
10896

10997
public static async Task RunAsync()
11098
{
111-
var pipeline = new MarkdownKnowledgePipeline();
99+
var pipeline = new MarkdownKnowledgePipeline(
100+
extractionMode: MarkdownKnowledgeExtractionMode.Tiktoken);
112101

113102
var result = await pipeline.BuildFromMarkdownAsync(ArticleMarkdown);
114103

115-
var graphRows = await result.Graph.ExecuteSelectAsync(SelectFactsQuery);
116-
var search = await result.Graph.SearchAsync(SearchTerm);
104+
var search = await result.Graph.SearchByTokenDistanceAsync(SearchTerm);
117105

118-
Console.WriteLine(graphRows.Rows.Count);
119-
Console.WriteLine(search.Rows.Any(row =>
120-
row.Values.TryGetValue(NameKey, out var name) &&
121-
name == RdfLabel));
106+
Console.WriteLine(search[0].Text);
122107
}
123108
}
124109
```
@@ -161,7 +146,7 @@ The library uses `urn:managedcode:markdown-ld-kb:/` as an internal default base
161146

162147
## Optional AI Extraction
163148

164-
Optional AI extraction enriches the deterministic Markdown graph with entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, the pipeline still runs fully locally and builds the graph from Markdown/front matter/link extraction only.
149+
AI extraction builds graph facts from entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, `Auto` mode extracts no facts and reports a diagnostic; choose `Tiktoken` mode explicitly for local token-distance extraction.
165150

166151
```csharp
167152
using ManagedCode.MarkdownLd.Kb.Pipeline;
@@ -204,7 +189,48 @@ ASK WHERE {
204189
}
205190
```
206191

207-
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, merges it with deterministic facts, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
192+
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
193+
194+
## Local Tiktoken Extraction
195+
196+
```csharp
197+
using ManagedCode.MarkdownLd.Kb.Pipeline;
198+
199+
internal static class TiktokenGraphDemo
200+
{
201+
private const string Markdown = """
202+
The observatory stores telescope images in a cold archive near the mountain lab.
203+
River sensors use cached forecasts to protect orchards from frost.
204+
""";
205+
206+
public static async Task RunAsync()
207+
{
208+
var pipeline = new MarkdownKnowledgePipeline(
209+
extractionMode: MarkdownKnowledgeExtractionMode.Tiktoken);
210+
211+
var result = await pipeline.BuildFromMarkdownAsync(Markdown);
212+
var matches = await result.Graph.SearchByTokenDistanceAsync("telescope image archive");
213+
214+
Console.WriteLine(matches[0].Text);
215+
}
216+
}
217+
```
218+
219+
Tiktoken mode uses `Microsoft.ML.Tokenizers` to encode section/paragraph text into token IDs, builds normalized sparse vectors, and calculates Euclidean distance. The default weighting is `SubwordTfIdf`, fitted over the current build corpus and reused for query vectors. `TermFrequency` uses raw token counts, and `Binary` uses token presence/absence.
220+
221+
Tiktoken mode also builds a corpus graph:
222+
223+
- heading or loose document sections and paragraph/line segments become `schema:CreativeWork` nodes
224+
- local Unicode word n-gram keyphrases become `schema:DefinedTerm` topic nodes
225+
- explicit front matter `entity_hints` / `entityHints` become graph entities with stable hash IDs and preserved `sameAs` links
226+
- containment uses `schema:hasPart`
227+
- segment/topic membership uses `schema:about`
228+
- document/entity-hint membership uses `schema:mentions`
229+
- segment similarity uses `kb:relatedTo`
230+
231+
The local lexical design follows [Multilingual Search with Subword TF-IDF](https://arxiv.org/abs/2209.14281): use subword tokenization plus TF-IDF instead of manually curated tokenization, stop words, or stemming rules. It is designed for same-language lexical retrieval. Cross-language semantic retrieval requires a translation or embedding layer owned by the host application.
232+
233+
The current test corpus validates top-1 token-distance retrieval across English, Ukrainian, French, and German. Same-language queries hit the expected segment at `10/10` for each language in the test corpus. Sampled cross-language aligned hits stay low at `3/40`, which matches the lexical design.
208234

209235
## Query The Graph
210236

@@ -321,6 +347,9 @@ var rows = await shared.Graph.SearchAsync("rdf");
321347
| `SparqlQueryResult` | Query result with `Variables` and `Rows` of `SparqlRow`. |
322348
| `KnowledgeSourceDocumentConverter` | Converts files and directories into pipeline-ready source documents. |
323349
| `ChatClientKnowledgeFactExtractor` | AI extraction adapter behind `IChatClient`. |
350+
| `TiktokenKnowledgeGraphOptions` | Options for explicit Tiktoken token-distance extraction. |
351+
| `TokenVectorWeighting` | Local token weighting mode: `SubwordTfIdf`, `TermFrequency`, or `Binary`. |
352+
| `TokenDistanceSearchResult` | Search result returned by `SearchByTokenDistanceAsync`. |
324353

325354
## Markdown Conventions
326355

@@ -354,9 +383,9 @@ Recognized front matter keys:
354383
| `tags` / `keywords` | `schema:keywords` | list |
355384
| `about` | `schema:about` | list |
356385
| `canonicalUrl` / `canonical_url` | low-level Markdown parser document identity; use `KnowledgeDocumentConversionOptions.CanonicalUri` for pipeline identity | string (URL) |
357-
| `entity_hints` / `entityHints` | entity hints | list of `{label, type, sameAs}` |
386+
| `entity_hints` / `entityHints` | explicit graph entities in `Tiktoken` mode; parsed as front matter metadata otherwise | list of `{label, type, sameAs}` |
358387

359-
Optional advanced predicate forms:
388+
Predicate normalization for explicit chat/token facts:
360389

361390
- `mentions` becomes `schema:mentions`
362391
- `about` becomes `schema:about`
@@ -367,16 +396,21 @@ Optional advanced predicate forms:
367396
- prefixed predicates such as `schema:mentions`, `kb:relatedTo`, `prov:wasDerivedFrom`, and `rdf:type` are preserved
368397
- absolute predicate URIs are preserved when valid
369398

399+
Markdown links, wikilinks, and arrow assertions are not implicitly converted into graph facts. Use `IChatClient` extraction or explicit `Tiktoken` mode when you want body content to produce graph nodes and edges.
400+
370401
## Architecture Choices
371402

372403
- `Markdig` parses Markdown structure.
373404
- `YamlDotNet` parses front matter.
374405
- `dotNetRDF` builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
375406
- `Microsoft.Extensions.AI.IChatClient` is the only AI boundary in the core pipeline.
376-
- Embeddings are not required for the current graph/search flow.
377-
- Microsoft Agent Framework is treated as host-level orchestration for future workflows, not a core package dependency.
407+
- `Microsoft.ML.Tokenizers` powers the explicit Tiktoken token-distance mode.
408+
- Subword TF-IDF is the default local token weighting because it downweights corpus-common tokens without adding language-specific preprocessing or model runtime dependencies.
409+
- Local topic graph construction uses Unicode word n-gram keyphrases and RDF `schema:DefinedTerm`, `schema:hasPart`, and `schema:about` edges.
410+
- Embeddings are not required for the current graph/search flow; Tiktoken mode uses token IDs, not embedding vectors.
411+
- Microsoft Agent Framework is treated as host-level orchestration, not a core package dependency.
378412

379-
See [docs/Architecture.md](docs/Architecture.md), [ADR-0001](docs/ADR/ADR-0001-rdf-sparql-library.md), and [ADR-0002](docs/ADR/ADR-0002-llm-extraction-ichatclient.md).
413+
See [docs/Architecture.md](docs/Architecture.md), [ADR-0001](docs/ADR/ADR-0001-rdf-sparql-library.md), [ADR-0002](docs/ADR/ADR-0002-llm-extraction-ichatclient.md), and [ADR-0003](docs/ADR/ADR-0003-tiktoken-extraction-mode.md).
380414

381415
## Inspiration And Attribution
382416

@@ -388,7 +422,7 @@ This project is inspired by Luis Quintanilla's Markdown-LD / AI Memex work:
388422
- [W3C SPARQL Federated Query](https://github.com/w3c/sparql-federated-query) - SPARQL federation reference material
389423
- [dotNetRDF](https://github.com/dotnetrdf/dotnetrdf) - RDF/SPARQL engine used by this C# implementation
390424

391-
The original repository is kept as a read-only submodule under `external/lqdev-markdown-ld-kb`. This package ports the technology and API direction into a reusable .NET library instead of copying the Python repository layout.
425+
The upstream reference repository is kept as a read-only submodule under `external/lqdev-markdown-ld-kb`.
392426

393427
## Development
394428

@@ -400,10 +434,12 @@ dotnet format MarkdownLd.Kb.slnx --verify-no-changes
400434
dotnet test --solution MarkdownLd.Kb.slnx --configuration Release -- --coverage --coverage-output-format cobertura --coverage-output "$PWD/TestResults/TUnitCoverage/coverage.cobertura.xml" --coverage-settings "$PWD/CodeCoverage.runsettings"
401435
```
402436

403-
Current verification baseline:
437+
Coverage is collected through `Microsoft.Testing.Extensions.CodeCoverage`. Cobertura is the XML output format used for line and branch reporting; the test project does not reference Coverlet.
438+
439+
Current verification:
404440

405-
- tests: 70 passed, 0 failed
406-
- line coverage: 95.93%
407-
- branch coverage: 84.55%
441+
- tests: 77 passed, 0 failed
442+
- line coverage: 96.30%
443+
- branch coverage: 85.23%
408444
- target framework: .NET 10
409445
- package version: 0.0.1
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Branch Coverage Improvement Brainstorm
2+
3+
## Problem
4+
5+
The verified coverage baseline for the Tiktoken graph extraction work was 95.39% line coverage and 83.28% branch coverage. The branch number was low because several parser, converter, search, and Tiktoken entity-hint boundary paths were exercised only on their common paths.
6+
7+
## Scope
8+
9+
In scope:
10+
11+
- Add meaningful tests for public or flow-level behavior.
12+
- Use coverage XML to target uncovered branch paths.
13+
- Update README with verified coverage numbers and coverage-tool wording.
14+
15+
Out of scope:
16+
17+
- Adding Coverlet.
18+
- Changing the production coverage collector.
19+
- Refactoring defensive branches only to improve coverage.
20+
21+
## Options
22+
23+
### Tiktoken Entity-Hint Branches
24+
25+
Cover scalar, numeric, `value`, `same_as`, blank, null, and empty-map entity hint front matter shapes through the real Tiktoken graph flow.
26+
27+
### Parser And Converter Boundary Branches
28+
29+
Cover empty front matter, null metadata values, BOM-only source, blank YAML keys, default converter paths, media type override trimming, unsupported directory entries, and no-extension files.
30+
31+
### Query Search Merge Branches
32+
33+
Cover duplicate SPARQL result rows produced by repeated optional values, proving the search service keeps one caller-visible article result.
34+
35+
## Recommendation
36+
37+
Use all three categories because they are caller-visible boundary conditions and map directly to uncovered coverage report lines. Avoid direct private-method testing or production refactors unless a test exposes a real behavior bug.
38+
39+
## Result
40+
41+
The implemented tests raised coverage from 95.39% line / 83.28% branch to 96.30% line / 85.23% branch, with 77 tests passing.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Branch Coverage Improvement Plan
2+
3+
Chosen brainstorm: `branch-coverage-improvement.brainstorm.md`
4+
5+
## Goal And Scope
6+
7+
Raise branch coverage above the 83.28% baseline with meaningful boundary tests for the current Tiktoken graph extraction, Markdown parsing, converter, and search behavior. Keep coverage collection on `Microsoft.Testing.Extensions.CodeCoverage`; Cobertura remains only the report output format.
8+
9+
## Baseline
10+
11+
- Full test suite baseline: 73 passed, 0 failed.
12+
- Coverage baseline: line 95.39%, branch 83.28%.
13+
- Coverage collector: `Microsoft.Testing.Extensions.CodeCoverage`.
14+
- Coverlet packages are not referenced by the solution.
15+
16+
## Already Failing Tests
17+
18+
- [x] None known.
19+
20+
## Ordered Steps
21+
22+
- [x] Add Tiktoken entity-hint boundary tests for scalar, numeric, `value`, `same_as`, blank, null, and empty-map shapes.
23+
- [x] Add parser boundary tests for empty front matter, null metadata values, BOM-only source, and blank YAML keys.
24+
- [x] Add converter boundary tests for default content conversion, media type override trimming, missing directories, no-extension files, and unsupported files when skipping is disabled.
25+
- [x] Add search-service merge test for repeated optional SPARQL rows.
26+
- [x] Run build, test, coverage, format, and diff checks.
27+
- [x] Update README with verified test and coverage numbers.
28+
29+
## Testing Methodology
30+
31+
The added tests exercise public flows:
32+
33+
```mermaid
34+
flowchart LR
35+
Markdown["Boundary Markdown/front matter"] --> Parser["Markdown parser"]
36+
Parser --> Pipeline["Tiktoken graph pipeline"]
37+
Pipeline --> Graph["In-memory RDF graph"]
38+
Graph --> Query["SPARQL/search assertions"]
39+
Converter["Source converter"] --> Pipeline
40+
```
41+
42+
## Result
43+
44+
- Final tests: 77 passed, 0 failed.
45+
- Final line coverage: 96.30%.
46+
- Final branch coverage: 85.23%.
47+
- Branch coverage increase: 83.28% to 85.23%.

docs/ADR/ADR-0001-rdf-sparql-library.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,13 @@ Key points:
5555
```mermaid
5656
flowchart LR
5757
Markdown["Markdown documents"] --> Parser["Markdig + YamlDotNet parser"]
58-
Parser --> Extractor["Deterministic + IChatClient extractors"]
59-
Extractor --> Builder["Graph builder"]
58+
Parser --> Extractor["Explicit extraction modes"]
59+
Extractor --> Chat["IChatClient extractor"]
60+
Extractor --> Token["Tiktoken extractor"]
61+
Extractor --> None["No extractor"]
62+
Chat --> Builder["Graph builder"]
63+
Token --> Builder
64+
None --> Builder
6065
Builder --> DotNetRdf["dotNetRDF graph"]
6166
DotNetRdf --> Sparql["Local SPARQL execution"]
6267
DotNetRdf --> Turtle["Turtle writer"]
@@ -146,7 +151,8 @@ Mitigations:
146151
- Serialize the graph and parse/inspect the output.
147152
- Negative flows:
148153
- Reject mutating SPARQL operations.
149-
- Ignore malformed deterministic assertion syntax.
154+
- Default no-extractor mode.
155+
- Explicit Tiktoken token-distance mode.
150156
- Edge flows:
151157
- Empty Markdown input.
152158
- Duplicate entity mentions and assertions.

docs/ADR/ADR-0002-llm-extraction-ichatclient.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ flowchart LR
8686
Mitigations:
8787

8888
- Depend on abstractions only.
89-
- Keep deterministic extraction available for Markdown-native cues and non-network tests.
89+
- Keep non-network tests around the `IChatClient` boundary and the explicit Tiktoken token-distance mode.
9090
- Document provider/runtime assumptions in future app-level docs.
9191

9292
## Verification

0 commit comments

Comments
 (0)