Skip to content

Commit fbedc86

Browse files
committed
Add chunked extraction and reorganize source tree
1 parent 0bf6a0d commit fbedc86

86 files changed

Lines changed: 2694 additions & 829 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
<PackageReadmeFile>README.md</PackageReadmeFile>
2626
<EnablePackageValidation>true</EnablePackageValidation>
2727
<Product>Markdown-LD Knowledge Bank</Product>
28-
<Version>0.1.3</Version>
29-
<PackageVersion>0.1.3</PackageVersion>
28+
<Version>0.1.5</Version>
29+
<PackageVersion>0.1.5</PackageVersion>
3030
</PropertyGroup>
3131

3232
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">

README.md

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,27 @@ For local repository development:
7474
dotnet add reference ./src/MarkdownLd.Kb/MarkdownLd.Kb.csproj
7575
```
7676

77+
## Project Structure
78+
79+
The production source tree now follows feature-oriented slices instead of a mostly flat technical grouping:
80+
81+
- `src/MarkdownLd.Kb/Documents`
82+
`Models`, `Parsing`, and `Chunking`
83+
- `src/MarkdownLd.Kb/Extraction`
84+
`Chat`, `Cache`, and `Processing`
85+
- `src/MarkdownLd.Kb/Pipeline`
86+
orchestration-only files such as `MarkdownKnowledgePipeline`
87+
- `src/MarkdownLd.Kb/Graph`
88+
`Build` and `Runtime`
89+
- `src/MarkdownLd.Kb/Tokenization`
90+
local Tiktoken graph extraction
91+
- `src/MarkdownLd.Kb/Query`
92+
`Search`, `Sparql`, and `NaturalLanguage`
93+
- `src/MarkdownLd.Kb/Rdf`
94+
low-level RDF helpers and serialization
95+
96+
This layout mirrors [docs/Architecture.md](docs/Architecture.md) and keeps orchestration separate from parsing, extraction, graph runtime, and query capabilities.
97+
7798
## Minimal Example
7899

79100
```csharp
@@ -209,6 +230,8 @@ Entities with the same `schema:sameAs` target are merged before assertions are e
209230

210231
AI extraction builds graph facts from entities and assertions returned by an injected `Microsoft.Extensions.AI.IChatClient`. The package stays provider-neutral: it does not reference OpenAI, Azure OpenAI, Anthropic, or any other model-specific SDK. If no chat client is provided, `Auto` mode extracts no facts and reports a diagnostic; choose `Tiktoken` mode explicitly for local token-distance extraction.
211232

233+
Chat extraction is chunk-based. The pipeline parses Markdown into deterministic chunks, sends each chunk through the structured extractor in order, and merges the resulting facts into one canonical graph. Optional cache reuse can be enabled through `MarkdownKnowledgePipelineOptions.ExtractionCache`.
234+
212235
```csharp
213236
using ManagedCode.MarkdownLd.Kb.Pipeline;
214237
using Microsoft.Extensions.AI;
@@ -250,7 +273,7 @@ ASK WHERE {
250273
}
251274
```
252275

253-
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model.
276+
The built-in chat extractor requests structured output through `GetResponseAsync<T>()`, normalizes the returned entity/assertion payload, and then builds the same in-memory RDF graph used by search and SPARQL. Tests use one local non-network `IChatClient` implementation so the full extraction-to-graph flow is covered without a live model. When cache reuse is enabled, the cache key includes document identity, chunk fingerprints, chunker profile, prompt version, and model identity so stale reuse stays explicit and controllable.
254277

255278
## Local Tiktoken Extraction
256279

@@ -495,9 +518,39 @@ Recognized front matter keys:
495518
| `author` | `schema:author` | string or list |
496519
| `tags` / `keywords` | `schema:keywords` | list |
497520
| `about` | `schema:about` | list |
521+
| `entryType` / `entry_type` | compatibility metadata plus optional additional `schema.org` article subtype typing | string or list |
522+
| `sourceProject` / `source_project` | `kb:sourceProject` | string or list |
498523
| `canonicalUrl` / `canonical_url` | low-level Markdown parser document identity; use `KnowledgeDocumentConversionOptions.CanonicalUri` for pipeline identity | string (URL) |
499524
| `entity_hints` / `entityHints` | explicit graph entities in `Tiktoken` mode; parsed as front matter metadata otherwise | list of `{label, type, sameAs}` |
500525

526+
Generic RDF front matter mapping is also supported for richer document metadata beyond article-only defaults:
527+
528+
| Key | Purpose | Type |
529+
|---|---|---|
530+
| `rdf_prefixes` / `rdfPrefixes` | additional vocabulary prefixes | object |
531+
| `rdf_types` / `rdfTypes` | additional RDF types for the document node | string or list |
532+
| `rdf_properties` / `rdfProperties` | arbitrary predicate/value mappings for the document node | object |
533+
534+
Example:
535+
536+
```yaml
537+
rdf_prefixes:
538+
dcterms: http://purl.org/dc/terms/
539+
skos: http://www.w3.org/2004/02/skos/core#
540+
rdf_types:
541+
- schema:HowTo
542+
- skos:ConceptScheme
543+
rdf_properties:
544+
schema:isPartOf:
545+
id: https://example.com/projects/ai-memex
546+
dcterms:issued:
547+
value: 2026-04-21
548+
datatype: xsd:date
549+
skos:prefLabel: Flexible Graph Spec
550+
```
551+
552+
Scalar values become literals by default. Object values may use `id` to emit a URI node or `value` plus optional `datatype` to emit a typed literal. Unknown prefixes fail explicitly instead of being silently guessed.
553+
501554
Predicate normalization for explicit chat/token facts:
502555

503556
- `mentions` becomes `schema:mentions`
@@ -518,6 +571,7 @@ Markdown links, wikilinks, and arrow assertions are not implicitly converted int
518571
- `dotNetRDF` builds the RDF graph, runs local SPARQL, and serializes Turtle/JSON-LD.
519572
- `dotNetRdf.Shacl` validates built graphs with default or caller-supplied SHACL shapes.
520573
- `Microsoft.Extensions.AI.IChatClient` is the only AI boundary in the core pipeline.
574+
- The production source tree is organized by feature slices: Documents, Extraction, Pipeline, Graph, Tokenization, Query, and Rdf.
521575
- `Microsoft.ML.Tokenizers` powers the explicit Tiktoken token-distance mode.
522576
- Subword TF-IDF is the default local token weighting because it downweights corpus-common tokens without adding language-specific preprocessing or model runtime dependencies.
523577
- Local topic graph construction uses Unicode word n-gram keyphrases and RDF `schema:DefinedTerm`, `schema:hasPart`, and `schema:about` edges.
@@ -552,8 +606,8 @@ Coverage is collected through `Microsoft.Testing.Extensions.CodeCoverage`. Cober
552606

553607
Current verification:
554608

555-
- tests: 87 passed, 0 failed
556-
- line coverage: 96.76%
557-
- branch coverage: 87.12%
609+
- tests: 109 passed, 0 failed
610+
- line coverage: 95.97%
611+
- branch coverage: 84.01%
558612
- target framework: .NET 10
559613
- package version: 0.0.1

0 commit comments

Comments
 (0)