You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The production source tree now follows feature-oriented slices instead of a mostly flat technical grouping:
80
+
81
+
-`src/MarkdownLd.Kb/Documents`
82
+
`Models`, `Parsing`, and `Chunking`
83
+
-`src/MarkdownLd.Kb/Extraction`
84
+
`Chat`, `Cache`, and `Processing`
85
+
-`src/MarkdownLd.Kb/Pipeline`
86
+
orchestration-only files such as `MarkdownKnowledgePipeline`
87
+
-`src/MarkdownLd.Kb/Graph`
88
+
`Build` and `Runtime`
89
+
-`src/MarkdownLd.Kb/Tokenization`
90
+
local Tiktoken graph extraction
91
+
-`src/MarkdownLd.Kb/Query`
92
+
`Search`, `Sparql`, and `NaturalLanguage`
93
+
-`src/MarkdownLd.Kb/Rdf`
94
+
low-level RDF helpers and serialization
95
+
96
+
This layout mirrors [docs/Architecture.md](docs/Architecture.md) and keeps orchestration separate from parsing, extraction, graph runtime, and query capabilities.
97
+
77
98
## Minimal Example
78
99
79
100
```csharp
@@ -209,6 +230,8 @@ Entities with the same `schema:sameAs` target are merged before assertions are e
209
230
210
231
AI extraction builds graph facts from entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, `Auto` mode extracts no facts and reports a diagnostic; choose `Tiktoken` mode explicitly for local token-distance extraction.
211
232
233
+
Chat extraction is chunk-based. The pipeline parses Markdown into deterministic chunks, sends each chunk through the structured extractor in order, and merges the resulting facts into one canonical graph. Optional cache reuse can be enabled through `MarkdownKnowledgePipelineOptions.ExtractionCache`.
234
+
212
235
```csharp
213
236
usingManagedCode.MarkdownLd.Kb.Pipeline;
214
237
usingMicrosoft.Extensions.AI;
@@ -250,7 +273,7 @@ ASK WHERE {
250
273
}
251
274
```
252
275
253
-
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
276
+
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model. When cache reuse is enabled, the cache key includes document identity, chunk fingerprints, chunker profile, prompt version, and model identity so stale reuse stays explicit and controllable.
254
277
255
278
## Local Tiktoken Extraction
256
279
@@ -495,9 +518,39 @@ Recognized front matter keys:
495
518
|`author`|`schema:author`| string or list |
496
519
|`tags` / `keywords`|`schema:keywords`| list |
497
520
|`about`|`schema:about`| list |
521
+
|`entryType` / `entry_type`| compatibility metadata plus optional additional `schema.org` article subtype typing | string or list |
522
+
|`sourceProject` / `source_project`|`kb:sourceProject`| string or list |
498
523
|`canonicalUrl` / `canonical_url`| low-level Markdown parser document identity; use `KnowledgeDocumentConversionOptions.CanonicalUri` for pipeline identity | string (URL) |
499
524
|`entity_hints` / `entityHints`| explicit graph entities in `Tiktoken` mode; parsed as front matter metadata otherwise | list of `{label, type, sameAs}`|
500
525
526
+
Generic RDF front matter mapping is also supported for richer document metadata beyond article-only defaults:
|`rdf_types` / `rdfTypes`| additional RDF types for the document node | string or list |
532
+
|`rdf_properties` / `rdfProperties`| arbitrary predicate/value mappings for the document node | object |
533
+
534
+
Example:
535
+
536
+
```yaml
537
+
rdf_prefixes:
538
+
dcterms: http://purl.org/dc/terms/
539
+
skos: http://www.w3.org/2004/02/skos/core#
540
+
rdf_types:
541
+
- schema:HowTo
542
+
- skos:ConceptScheme
543
+
rdf_properties:
544
+
schema:isPartOf:
545
+
id: https://example.com/projects/ai-memex
546
+
dcterms:issued:
547
+
value: 2026-04-21
548
+
datatype: xsd:date
549
+
skos:prefLabel: Flexible Graph Spec
550
+
```
551
+
552
+
Scalar values become literals by default. Object values may use `id` to emit a URI node or `value` plus optional `datatype` to emit a typed literal. Unknown prefixes fail explicitly instead of being silently guessed.
553
+
501
554
Predicate normalization for explicit chat/token facts:
502
555
503
556
- `mentions`becomes `schema:mentions`
@@ -518,6 +571,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
518
571
- `dotNetRDF`builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
519
572
- `dotNetRdf.Shacl`validates built graphs with default or caller-supplied SHACL shapes.
520
573
- `Microsoft.Extensions.AI.IChatClient`is the only AI boundary in the core pipeline.
574
+
- The production source tree is organized by feature slices: Documents, Extraction, Pipeline, Graph, Tokenization, Query, and Rdf.
521
575
- `Microsoft.ML.Tokenizers`powers the explicit Tiktoken token-distance mode.
522
576
- Subword TF-IDF is the default local token weighting because it downweights corpus-common tokens without adding language-specific preprocessing or model runtime dependencies.
523
577
- Local topic graph construction uses Unicode word n-gram keyphrases and RDF `schema:DefinedTerm`, `schema:hasPart`, and `schema:about` edges.
@@ -552,8 +606,8 @@ Coverage is collected through `Microsoft.Testing.Extensions.CodeCoverage`. Cober
0 commit comments