Skip to content

Commit b79bca8

Browse files
committed
reranking
1 parent 11ad756 commit b79bca8

13 files changed

Lines changed: 908 additions & 14 deletions

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
<PackageReadmeFile>README.md</PackageReadmeFile>
2626
<EnablePackageValidation>true</EnablePackageValidation>
2727
<Product>Markdown-LD Knowledge Bank</Product>
28-
<Version>0.1.1</Version>
29-
<PackageVersion>0.1.1</PackageVersion>
28+
<Version>0.1.2</Version>
29+
<PackageVersion>0.1.2</PackageVersion>
3030
</PropertyGroup>
3131

3232
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# ADR-0005: Hybrid Graph Search Boundary
2+
3+
Date: 2026-04-16
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
The library already had two retrieval styles:
12+
13+
- graph-native lexical search over RDF metadata
14+
- Tiktoken token-distance search for local lexical structure
15+
16+
Neither path solves cross-language mismatch between graph content and user queries. At the same time, the repository rules forbid turning the core library into a hosted vector-search subsystem or tying it to a provider-specific embedding SDK.
17+
18+
## Decision
19+
20+
Add an optional semantic ranked-search boundary that:
21+
22+
- builds an in-memory semantic index from graph-native candidate text
23+
- uses `Microsoft.Extensions.AI.IEmbeddingGenerator<string, Embedding<float>>`
24+
- keeps graph results canonical
25+
- uses semantic results only as fallback or merge inputs
26+
- excludes `schema:keywords` from canonical ranking
27+
28+
## Boundaries
29+
30+
```mermaid
31+
flowchart LR
32+
Graph["KnowledgeGraph"] --> Canonical["Canonical graph ranking"]
33+
Graph --> SemanticIndex["Optional semantic index"]
34+
Embedder["IEmbeddingGenerator"] --> SemanticIndex
35+
Canonical --> Hybrid["Hybrid merge"]
36+
SemanticIndex --> Hybrid
37+
Hybrid --> Gateway["Gateway or host app"]
38+
```
39+
40+
## Consequences
41+
42+
Positive:
43+
44+
- cross-language queries have a provider-neutral recovery path
45+
- graph-first explainability is preserved
46+
- the host application keeps ownership of embedding-provider choice
47+
48+
Negative:
49+
50+
- the library now owns a small additional search boundary
51+
- semantic tests need a deterministic non-network embedding adapter
52+
53+
## Rejected Alternatives
54+
55+
- Semantic-only ranking: rejected because graph must remain canonical.
56+
- Provider-specific embedding package in the core library: rejected by repository rules.
57+
- External vector database integration in the library: rejected because infra belongs in the host application.

docs/Architecture.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The upstream reference repository is kept as a read-only submodule at `external/
1010

1111
The core runtime has no localhost, HTTP server, background service, database server, or hosted API dependency. Callers pass files, directories, or in-memory document content into the library, and the library returns in-memory graph/search/query results.
1212

13-
The graph/search model does not require semantic embeddings. The AI boundary in the core pipeline is `Microsoft.Extensions.AI.IChatClient` for entity/assertion extraction. The library also exposes an explicit experimental Tiktoken mode that creates lexical sparse vectors from `Microsoft.ML.Tokenizers` token IDs and builds a local corpus graph. Its default weighting is corpus-fitted subword TF-IDF, with raw term frequency and binary presence kept as experimental baselines. Tiktoken mode also creates section/segment structure, local TF-IDF keyphrase topics, and explicit front matter entity hint nodes, but it is not a semantic embedding model. Capability graph rules add deterministic caller-authored entities and edges for groups, related nodes, and next-step nodes so applications can build workflow/capability graphs without relying on a flat document-topic graph. If semantic vector search is added later, it should be a separate optional adapter over `Microsoft.Extensions.AI.IEmbeddingGenerator<,>` or an equivalent small port, with the concrete provider owned by the host app.
13+
The graph/search model does not require semantic embeddings. The AI boundary in the core pipeline is `Microsoft.Extensions.AI.IChatClient` for entity/assertion extraction. The library also exposes an explicit experimental Tiktoken mode that creates lexical sparse vectors from `Microsoft.ML.Tokenizers` token IDs and builds a local corpus graph. Its default weighting is corpus-fitted subword TF-IDF, with raw term frequency and binary presence kept as experimental baselines. Tiktoken mode also creates section/segment structure, local TF-IDF keyphrase topics, and explicit front matter entity hint nodes, but it is not a semantic embedding model. Capability graph rules add deterministic caller-authored entities and edges for groups, related nodes, and next-step nodes so applications can build workflow/capability graphs without relying on a flat document-topic graph. Cross-language retrieval can now use an optional semantic ranked-search adapter over `Microsoft.Extensions.AI.IEmbeddingGenerator<,>` that builds an in-memory semantic index from graph-native labels, descriptions, and related labels. The graph remains canonical; semantic hits are fallback or merge inputs rather than the source of truth.
1414

1515
## System Boundaries
1616

@@ -31,10 +31,12 @@ flowchart LR
3131
Builder --> Graph["In-memory knowledge graph"]
3232
Graph --> Sparql["In-memory SPARQL executor API"]
3333
Graph --> Search["In-memory graph search API"]
34+
Graph --> Ranked["Graph / semantic / hybrid ranked search API"]
3435
Graph --> Focused["Focused graph search API"]
3536
Graph --> Shacl["SHACL validation API"]
3637
Graph --> Serializers["Turtle and JSON-LD serializers"]
3738
Graph --> Merge["Thread-safe graph merge API"]
39+
Embedder["Optional IEmbeddingGenerator semantic index"] --> Ranked
3840
IChatClient["Microsoft.Extensions.AI IChatClient"] --> ChatExtractor
3941
Tokenizer["Microsoft.ML.Tokenizers Tiktoken"] --> TokenExtractor
4042
AgentFramework["Future Microsoft Agent Framework orchestration"] -. "wraps IChatClient" .-> IChatClient
@@ -53,6 +55,7 @@ sequenceDiagram
5355
participant Graph as KnowledgeGraphBuilder
5456
participant BuiltGraph as KnowledgeGraph
5557
participant Query as InMemorySparqlExecutor
58+
participant Embedder as Optional IEmbeddingGenerator
5659
5760
Caller->>Pipeline: BuildAsync(documents, options)
5861
Pipeline->>Parser: Parse Markdown and front matter
@@ -72,6 +75,10 @@ sequenceDiagram
7275
Pipeline->>Graph: Add facts as RDF triples
7376
Graph-->>Pipeline: In-memory KnowledgeGraph
7477
Pipeline-->>Caller: MarkdownKnowledgeBuildResult
78+
Caller->>BuiltGraph: BuildSemanticIndexAsync(embedder)
79+
BuiltGraph->>Embedder: Generate graph-node embeddings
80+
Embedder-->>BuiltGraph: Semantic index
81+
Caller->>BuiltGraph: SearchRankedAsync(query, Graph | Semantic | Hybrid)
7582
Caller->>BuiltGraph: ValidateShacl(optional shapes)
7683
BuiltGraph-->>Caller: SHACL validation report
7784
Caller->>Query: ExecuteSelect(graph, sparql)
@@ -90,7 +97,7 @@ flowchart TB
9097
Rules["Capability rules: graph_entities, graph_edges, graph_groups, graph_related, graph_next_steps"]
9198
Rdf["RDF: graph construction, namespaces, serialization"]
9299
Shacl["SHACL: default shapes, validation reports, assertion metadata"]
93-
Query["Query: SPARQL and graph search"]
100+
Query["Query: SPARQL, ranked search, and graph search"]
94101
end
95102
96103
subgraph Tests["tests/MarkdownLd.Kb.Tests"]
@@ -148,6 +155,7 @@ flowchart LR
148155
- SHACL validation depends on `dotNetRdf.Shacl` and runs against the in-memory graph through `VDS.RDF.Shacl.ShapesGraph`.
149156
- LLM extraction depends on `Microsoft.Extensions.AI.Abstractions` and accepts `IChatClient`.
150157
- Tiktoken extraction depends on `Microsoft.ML.Tokenizers` and the O200k data package. It uses tokenizer IDs and Unicode word n-gram keyphrase candidates only, and does not add an embedding provider. The default vector weighting is subword TF-IDF fitted over the current build corpus.
158+
- Optional semantic ranked search depends only on `Microsoft.Extensions.AI.IEmbeddingGenerator<string, Embedding<float>>` and keeps the concrete embedding provider in the host application.
151159
- Embeddings are not required for the core graph build/query flow.
152160
- Public API should prefer repository types over raw dependency types when feasible.
153161
- AI adapters depend on the core extraction port. The core library must not depend on concrete provider packages or agent orchestration packages in the first slice.
@@ -162,6 +170,7 @@ Required first-slice scenarios:
162170
- Empty Markdown input produces an empty graph without throwing.
163171
- Explicit Tiktoken mode builds section/segment/topic/entity-hint nodes plus `schema:hasPart`, `schema:about`, `schema:mentions`, and token-distance `kb:relatedTo` edges without network access.
164172
- Capability graph rules build `kb:memberOf`, `kb:relatedTo`, and `kb:nextStep` workflow edges from Markdown front matter or caller options, and focused search returns primary, related, and next-step result groups.
173+
- Ranked search supports `Graph`, `Semantic`, and `Hybrid` modes. Hybrid mode keeps canonical graph hits first, then appends semantic-only fallback hits when graph-native recall is insufficient.
165174
- SHACL validation uses default Markdown-LD Knowledge Bank shapes or caller-supplied shapes, and assertion confidence/provenance metadata is represented as RDF statements so validation remains RDF-native.
166175
- English, Ukrainian, French, and German queries over same-language token graphs produce a higher hit rate than cross-language translated-topic queries.
167176
- Term frequency, binary presence, and subword TF-IDF token weighting modes are covered by focused and flow tests.
@@ -170,6 +179,7 @@ Required first-slice scenarios:
170179
- `IChatClient` extractor accepts structured extraction output without depending on a provider-specific SDK.
171180
- Default no-chat mode emits no extracted facts and reports a diagnostic telling callers to connect `IChatClient` or choose Tiktoken mode.
172181
- No-match search returns an empty result instead of an error.
182+
- Semantic-only or hybrid search without a semantic index fails explicitly.
173183
- Turtle and JSON-LD serialization produce parseable output where dependency support is available.
174184

175185
Coverage requirement: 95%+ line coverage for changed production code.
@@ -192,4 +202,5 @@ Coverage requirement: 95%+ line coverage for changed production code.
192202
- LLM extraction dependency decision: `docs/ADR/ADR-0002-llm-extraction-ichatclient.md`
193203
- Capability graph rules decision: `docs/ADR/ADR-0004-capability-graph-rules.md`
194204
- Capability graph rules feature: `docs/Features/CapabilityGraphRules.md`
205+
- Hybrid ranked search feature: `docs/Features/HybridGraphSearch.md`
195206
- SHACL validation feature: `docs/Features/GraphShaclValidation.md`

docs/Features/CapabilityGraphRules.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ Rule values can be strings or maps. Strings become node labels. Maps can use `id
3939
`SearchFocusedAsync` returns:
4040

4141
- primary matches from token-distance search when the graph was built in Tiktoken mode
42-
- primary matches from graph metadata search when no token index is present
42+
- primary matches from hybrid ranked search when a semantic index is supplied
43+
- primary matches from graph metadata search when no token index or semantic index is present
4344
- related matches from direct `kb:relatedTo` edges and shared `kb:memberOf` groups
4445
- next-step matches from direct `kb:nextStep` edges
4546
- a bounded focused graph snapshot containing selected matches plus explanatory group nodes

docs/Features/HybridGraphSearch.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Hybrid Graph Search
2+
3+
## Purpose
4+
5+
Hybrid graph search solves the case where the graph language and the user query language differ.
6+
7+
The graph remains canonical. Semantic vectors are optional fallback or merge inputs built from graph-native text, not an alternative source of truth.
8+
9+
## Flow
10+
11+
```mermaid
12+
flowchart LR
13+
Query["User query"] --> Mode["Graph / Semantic / Hybrid mode"]
14+
Graph["KnowledgeGraph snapshot"] --> Canonical["Canonical graph candidates\nlabel + description + related labels\nno keywords"]
15+
Canonical --> GraphRank["Graph ranking"]
16+
Graph --> SemanticIndex["BuildSemanticIndexAsync"]
17+
Embedder["IEmbeddingGenerator"] --> SemanticIndex
18+
Query --> SemanticRank["Semantic ranking"]
19+
SemanticIndex --> SemanticRank
20+
GraphRank --> Merge["Hybrid merge"]
21+
SemanticRank --> Merge
22+
Merge --> Results["Ranked results with provenance"]
23+
```
24+
25+
## Modes
26+
27+
- `Graph`: rank only graph-native matches from `schema:name`, `schema:description`, and graph-related labels such as `schema:mentions` and `schema:about`.
28+
- `Semantic`: rank only semantic matches from the optional semantic index.
29+
- `Hybrid`: keep graph-ranked results first and append semantic-only fallback matches only when graph recall is insufficient.
30+
31+
## Behavior
32+
33+
- `schema:keywords` are excluded from canonical ranking.
34+
- A hit present in both graph and semantic ranking is marked as merged and keeps its graph-first position.
35+
- Semantic-only hits never outrank canonical graph hits in hybrid mode.
36+
- `SearchFocusedAsync` can use hybrid primary matching when `KnowledgeGraphFocusedSearchOptions.SemanticIndex` is supplied.
37+
- Calling `Semantic` or `Hybrid` mode without a semantic index fails explicitly.
38+
39+
## Intended Library Boundary
40+
41+
- The library owns the graph-native candidate extraction and merge policy.
42+
- The host application owns the concrete embedding provider and decides whether to call `Graph`, `Semantic`, or `Hybrid`.
43+
- The library does not own a vector database, gateway endpoint, or hosted ranking infrastructure.
44+
45+
## Verification
46+
47+
- `dotnet test --solution MarkdownLd.Kb.slnx --configuration Release -- --treenode-filter "/*/*/HybridGraphSearchFlowTests/*" --no-progress`
48+
- `dotnet test --solution MarkdownLd.Kb.slnx --configuration Release`

src/MarkdownLd.Kb/Pipeline/KnowledgeGraph.FocusedSearch.cs

Lines changed: 43 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,26 @@ private async Task<IReadOnlyList<KnowledgeGraphFocusedSearchMatch>> ResolvePrima
3030
IReadOnlyDictionary<string, KnowledgeGraphNode> nodesById,
3131
CancellationToken cancellationToken)
3232
{
33+
if (options.SemanticIndex is not null)
34+
{
35+
var rankedMatches = await SearchRankedAsync(
36+
query,
37+
new KnowledgeGraphRankedSearchOptions
38+
{
39+
Mode = KnowledgeGraphSearchMode.Hybrid,
40+
MaxResults = Math.Max(1, options.MaxPrimaryResults),
41+
},
42+
options.SemanticIndex,
43+
cancellationToken)
44+
.ConfigureAwait(false);
45+
46+
return rankedMatches
47+
.Where(match => nodesById.ContainsKey(match.NodeId))
48+
.Select(CreatePrimaryMatch)
49+
.Take(Math.Max(1, options.MaxPrimaryResults))
50+
.ToArray();
51+
}
52+
3353
if (_tokenIndex is not null)
3454
{
3555
var limit = Math.Max(options.MaxPrimaryResults * 4, options.MaxPrimaryResults);
@@ -44,19 +64,33 @@ private async Task<IReadOnlyList<KnowledgeGraphFocusedSearchMatch>> ResolvePrima
4464
.ToArray();
4565
}
4666

47-
var rows = await SearchAsync(query, cancellationToken).ConfigureAwait(false);
48-
return rows.Rows
49-
.Select(static row => row.Values.TryGetValue("subject", out var id) ? id : null)
50-
.Where(id => !string.IsNullOrWhiteSpace(id) && nodesById.ContainsKey(id!))
51-
.Select(id => new KnowledgeGraphFocusedSearchMatch(
52-
id!,
53-
nodesById[id!].Label,
54-
KnowledgeGraphFocusedSearchRole.Primary,
55-
FullConfidence))
67+
var rankedGraphMatches = await SearchRankedAsync(
68+
query,
69+
new KnowledgeGraphRankedSearchOptions
70+
{
71+
Mode = KnowledgeGraphSearchMode.Graph,
72+
MaxResults = Math.Max(1, options.MaxPrimaryResults),
73+
},
74+
cancellationToken: cancellationToken)
75+
.ConfigureAwait(false);
76+
77+
return rankedGraphMatches
78+
.Where(match => nodesById.ContainsKey(match.NodeId))
79+
.Select(CreatePrimaryMatch)
5680
.Take(Math.Max(1, options.MaxPrimaryResults))
5781
.ToArray();
5882
}
5983

84+
private static KnowledgeGraphFocusedSearchMatch CreatePrimaryMatch(
85+
KnowledgeGraphRankedSearchMatch match)
86+
{
87+
return new KnowledgeGraphFocusedSearchMatch(
88+
match.NodeId,
89+
match.Label,
90+
KnowledgeGraphFocusedSearchRole.Primary,
91+
match.Score);
92+
}
93+
6094
private static KnowledgeGraphFocusedSearchMatch CreatePrimaryMatch(
6195
KnowledgeGraphNode node,
6296
double distance)

0 commit comments

Comments
 (0)