| status | Proposed | ||
|---|---|---|---|
| date | 2026-01-03 | ||
| deciders |
|
The knowledge graph system excels at concept-level search - finding concepts semantically similar to a query. However, users often need to find the original documents that contain relevant information, not just the extracted concepts.
Current state:
POST /query/sources/searchsearches source embeddings, returns chunks:DocumentMetanodes track documents withgarage_keylinking to original files(:DocumentMeta)-[:HAS_SOURCE]->(:Source)links documents to their chunks(:Concept)-[:APPEARS]->(:Source)links concepts to source chunks
Use cases:
- "Find documents about recursive algorithms" → ranked list of original documents
- "Which papers discuss this topic?" → documents with related concepts shown
- Load multiple relevant documents into a graph view for comparison
Metadata endpoints return document references and linked concepts:
- Lightweight, fast responses
- Used for discovery and navigation
Content endpoints return actual document data:
- Full document file(s) from Garage
- Chunks (source nodes the document was split into)
- Heavier payloads, fetched on demand
POST /query/documents/search - Semantic search, returns metadata
// Request
{
"query": "recursive depth patterns",
"min_similarity": 0.7, // score threshold (water level)
"limit": 20, // max results (default: 20, max: 100)
"ontology": "optional-filter"
}
// Response
{
"documents": [
{
"document_id": "sha256:abc123...",
"filename": "algorithms.md",
"ontology": "CS Research",
"content_type": "document",
"best_similarity": 0.92,
"source_count": 5,
"resources": [
{"type": "document", "garage_key": "docs/abc123.md"}
],
"concept_ids": ["c-123", "c-456", "c-789"]
}
],
"returned": 20,
"total_matches": 42
}Limiting behavior:
min_similarityfilters by score threshold firstlimitcaps results after threshold filter- Both combine: "top N documents above threshold"
- No pagination in v1 (cursor-based pagination if needed later)
GET /documents/{document_id}/content - Fetch actual document
// Response for text document
{
"document_id": "sha256:abc123...",
"content_type": "document",
"content": {
"document": "# Full markdown content here...",
"encoding": "utf-8"
},
"chunks": [
{
"source_id": "sha256:abc123_chunk0",
"paragraph": 0,
"full_text": "First chunk content..."
}
]
}
// Response for image (paired resources)
{
"document_id": "sha256:img456...",
"content_type": "image",
"content": {
"image": "base64-encoded-jpg...",
"prose": "The diagram shows a recursive tree traversal...",
"encoding": "base64"
},
"chunks": [...]
}Content definition:
- Documents: Single file (markdown, text, json, html)
- Images: Image file + prose description (always paired)
- Chunks: Source nodes created during ingestion (part of content)
All document endpoints support optional ontology filtering:
Search with ontology filter:
POST /query/documents/search
{
"query": "recursive patterns",
"ontology": "CS Research" // optional - scope to single ontology
}Browse ontology documents:
GET /ontology/{name}/documents?limit=50
Returns same structure as search (without similarity scores).
For complete ontology cloning/export, use the existing backup system (ADR-015):
kg admin backup --type ontology --ontology "CS Research"Future enhancement (ADR-015 extension):
- Add
--include-garageflag to include original Garage documents - Enables full ontology clone with raw source files
Documents (single resource):
- Markdown (
.md) - Plain text (
.txt) - JSON (
.json) - HTML (
.html)
Images (two resources - always paired):
- JPEG, PNG, GIF, WebP, BMP
- Prose description file (LLM-generated text describing the image)
Document ranking:
- Score = max chunk similarity (best match wins)
- Tie-breaker: count of matching chunks (more matches = more relevant)
Why max, not average:
- A document with one highly relevant section is more valuable than one with many mediocre matches
- Prevents dilution from unrelated sections in long documents
For each document, concept_ids are derived by:
- Find all
:Sourcenodes linked viaHAS_SOURCE - Find all
:Conceptnodes linked viaAPPEARSrelationship - Return unique concept IDs (no ranking in metadata response)
Phase 1: API Endpoints
POST /query/documents/search- metadata search (with ontology filter)GET /documents/{id}/content- content retrievalGET /ontology/{name}/documents- list documents in ontology- Reuse existing
source_embeddingssearch infrastructure
Phase 2: CLI
kg document search "query"- search and list metadatakg document search "query" --ontology "Name"- scoped searchkg document show <id>- fetch contentkg document list --ontology <name>- browse by ontology- Table output with
--jsonflag for structured output
Phase 3: MCP Tool
searchtool withtype: "documents"parameterdocumenttool for content retrieval- Returns structured data for Claude analysis
Phase 4: Web Explorer
- Document search panel with ontology filter
- Load multiple documents into graph view
- Show document→concept relationships visually
- Users can find original source documents, not just concepts
- Enables document-centric workflows (compare papers, find sources)
- Leverages existing source embedding infrastructure
- Metadata/content split keeps responses lightweight
- Ontology filtering enables scoped exploration
- Defers to backup system for full exports (no duplication)
- Additional query complexity (chunk→document aggregation)
- Two-step flow for content (search metadata, then fetch)
- Document retrieval requires Garage access
- Builds on existing
:DocumentMetainfrastructure (ADR-051) - Complements concept search (different discovery patterns)
- Backup extends naturally for full ontology cloning (ADR-015)
- Rejected: Would require new indexing infrastructure
- Semantic search via embeddings handles synonyms better
- Can add later as complementary feature
- Rejected: Would require generating document-level embeddings
- Current chunk embeddings provide finer granularity
- Aggregation from chunks gives similar results with existing data
- Rejected: Users need context (matching chunks, concepts)
- Single query should provide actionable results
- ADR-015: Backup/restore streaming (ontology-scoped export, extend for Garage content)
- ADR-051: Document deduplication and
:DocumentMetanodes - ADR-057: Image storage with prose descriptions (image + prose file pairing)
- ADR-068: Source text embeddings (chunk-level search)
- ADR-081: Source document lifecycle and Garage storage