| status | Accepted | |
|---|---|---|
| date | 2025-11-27 | |
| deciders |
|
Your knowledge graph has embeddings for concepts (the ideas extracted from documents) and embeddings for relationship types (SUPPORTS, CONTRADICTS, etc.), but here's what's missing: embeddings for the source documents themselves—the actual paragraphs and passages that concepts came from. This creates a blind spot. You can search for concepts semantically ("find concepts similar to 'recursive depth tracking'"), but you can't search the original text that way. You're forced to use keyword search, which misses semantic matches.
Think about the difference between these two questions: "Which concepts are related to performance optimization?" versus "Show me the original passages that discuss performance optimization." The first searches concept labels; the second searches the source text. Without source embeddings, you can only answer the first question semantically. For the second, you're stuck with keyword matching—"WHERE full_text LIKE '%performance%'"—which misses passages that discuss optimization without using that exact word.
This matters for RAG (Retrieval-Augmented Generation) workflows. When you want to answer a question using your knowledge graph, you need to retrieve relevant context—not just concept names, but the actual text that provides nuanced detail. Source embeddings enable this: generate an embedding for the user's question, find the most similar source passages, and feed that rich context to an LLM for generation. It's the difference between saying "there's a concept called 'caching strategies'" (shallow) versus showing the actual paragraph explaining different caching approaches (deep).
This ADR implements source text embeddings as a first-class system feature, stored in a separate table with chunk-level granularity and hash-based verification. Each source passage gets split into embedding chunks (around 100-120 words each, stored with character offsets for precise highlighting), and the system tracks which chunks came from which source using content hashes. This enables three new capabilities: semantic search over source passages, hybrid queries that blend concept matches with source matches, and complete RAG workflows that retrieve evidence-rich context for generation. The vision is a "Large Concept Model" where everything in the graph—concepts, sources, relationships, even images—participates in semantic search and retrieval.
The system currently generates embeddings for:
- Concepts: Label + description + search terms (text embeddings)
- Relationship Types: Vocabulary embeddings for grounding calculations (ADR-044)
- Images: Visual embeddings (Nomic Vision v1.5, 768-dim, ADR-057)
However, Source nodes (the grounding truth documents) do NOT have embeddings:
# From api/api/workers/ingestion_worker.py:294
text_embedding=None # Will be generated during concept extractionSource nodes contain:
full_text- Raw paragraph/chunk text (potentially 500-1500 words)document- Ontology nameparagraph- Chunk numbercontent_type- "document" or "image"- No embedding field for text similarity search
This creates a critical gap in retrieval capabilities:
- No Direct Source Search: Cannot find similar source passages via embedding similarity
- Lost Context: When a concept matches, we can't easily find related context from neighboring source text
- Incomplete RAG: The system has concept embeddings but not the underlying evidence embeddings
- Search Mode Gap:
- ✅ Text search (full-text indexes on Source.full_text)
- ✅ Concept search (embedding similarity on Concept.embedding)
- ❌ Source passage search (no embedding on Source nodes)
This ADR is a foundational piece toward a Large Concept Model architecture where ALL graph elements participate in semantic search:
Current state (Concept-centric):
Text → Concepts → Embeddings → Graph
Target state (LCM - Everything embedded):
Text → {Concepts, Sources, Edges} → Embeddings → Multi-modal Graph
↓ ↓
Recursive Relationships Constructive Queries
LCM Characteristics:
- Text Search: Traditional full-text indexes
- Text Embeddings: Dense vector search on passages
- RAG: Retrieve and generate from source chunks
- Visual Embeddings: Image similarity search (✅ ADR-057)
- Graph Embeddings: Concept and edge embeddings (✅ ADR-044, ADR-045)
- Source Embeddings: Grounding truth chunk search (❌ This ADR)
- Emergent Edges: Relationships discovered via embedding proximity
- Constructive Queries: Build knowledge paths from multi-modal signals
IMPORTANT: Source embeddings serve a fundamentally different purpose than grounding calculation. This distinction is critical to understanding the architecture:
Source Text → Extraction → Concept
"The recursive depth tracker maintains state..."
↓
[Concept: "Recursive Depth Tracking"]
Purpose: Provenance and evidence retrieval
- Nature: Observational, neutral representation
- Language: "Concept" (intentionally NOT "fact" or "truth")
- What it captures: Ideas stated/observed in source text
- Judgment: None - purely descriptive
- Query use case: "Show me the original text where this concept came from"
Graph Traversal:
(:Concept)-[:EVIDENCED_BY]->(:Instance)-[:FROM_SOURCE]->(:Source)NOT used for grounding calculation - only for citation and provenance.
Concept ↔ Concept (relationships)
[:SUPPORTS], [:CONTRADICTS], [:ENABLES], etc.
↓
Polarity projection → Grounding strength
Purpose: Truth convergence and validation
- Nature: Interpretive, evaluative assessment
- Method: Semantic projection of concept relationships onto polarity axis
- What it measures: How concepts validate/contradict each other
- Source: Concept-to-concept relationships, NOT source citations
- Algorithm: Polarity Axis Triangulation (ADR-058)
Graph Traversal:
MATCH (c:Concept) <-[r]-(other:Concept)
// Project r onto SUPPORTS ↔ CONTRADICTS axisEvidence ≠ Validation:
- Just because source text states something doesn't make it grounded
- Concepts from sources are neutral observations of what was written
- Grounding emerges from how concepts relate to each other, not from source citations
Example:
Source A: "The earth is flat"
→ Concept: "Flat Earth Model" (neutral observation)
Source B: "Spherical earth confirmed by gravity"
→ Concept: "Spherical Earth Model" (neutral observation)
Relationship: (Spherical Earth)-[:CONTRADICTS]->(Flat Earth)
→ Grounding: Flat Earth has negative grounding (contradicted)
The source text itself doesn't determine truth - the semantic relationships between concepts do.
This ADR addresses evidence retrieval only. Grounding calculation is handled separately by ADR-044 (Probabilistic Truth Convergence) and ADR-058 (Polarity Axis Triangulation).
We will implement asynchronous source text embedding generation with the following design:
Key Insight: Source nodes remain the canonical source of truth. Embeddings are stored separately with offset tracking and hash verification.
Understanding the Chunking Architecture:
Document (100KB)
↓ Ingestion chunking (smart chunker with overlap)
├─ Source node 1 (500-1500 words) ────→ Embedding chunk(s)
├─ Source node 2 (500-1500 words) ────→ Embedding chunk(s)
├─ Source node 3 (500-1500 words) ────→ Embedding chunk(s)
...
└─ Source node N (500-1500 words) ────→ Embedding chunk(s)
↓
Concepts extracted (references Sources)
Two-level chunking:
- Ingestion chunking (existing): Document → Source nodes (500-1500 words each)
- Embedding chunking (this ADR): Source.full_text → Embedding chunks (~100-120 words each)
Typical scenario:
- Large document → 10 Source nodes (ingestion chunks)
- 100 concepts extracted → reference those 10 Sources
- Each Source → 1-3 embedding chunks (depending on length)
- Total: 10-30 embeddings for entire document
-- Source node (canonical truth)
(:Source {
source_id: "doc123_chunk5",
full_text: "...", -- Canonical text (500-1500 words from ingestion)
content_hash: "sha256..." -- Hash for verification (NULL for existing Sources)
})
-- Separate embeddings table with offsets
CREATE TABLE kg_api.source_embeddings (
embedding_id SERIAL PRIMARY KEY,
source_id TEXT NOT NULL,
-- Chunk tracking
chunk_index INT NOT NULL, -- 0-based chunk number
chunk_strategy TEXT NOT NULL, -- 'sentence', 'paragraph', 'semantic'
-- Offset in Source.full_text (character positions)
start_offset INT NOT NULL,
end_offset INT NOT NULL,
chunk_text TEXT NOT NULL, -- Actual chunk (for verification)
-- Referential integrity (double hash verification)
chunk_hash TEXT NOT NULL, -- SHA256 of chunk_text
source_hash TEXT NOT NULL, -- SHA256 of Source.full_text
-- Embedding data
embedding BYTEA NOT NULL,
embedding_model TEXT NOT NULL,
embedding_dimension INT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(source_id, chunk_index, chunk_strategy)
);Why separate table?
- One Source can have multiple embedding chunks (granular retrieval)
- Offsets enable precise text highlighting and context extraction
- Hash verification ensures embeddings match current source text
- Stale embeddings detectable when source text changes
- Supports multiple strategies per Source (sentence + paragraph)
Source text will be chunked using simple, tunable strategies:
# In api/api/workers/source_embedding_worker.py
# Tuning constants (easy to adjust, no complex config needed)
CHUNKING_STRATEGIES = {
"sentence": {
"max_chars": 500, # ~100-120 words
"splitter": "sentence" # Use sentence boundaries
},
"paragraph": {
"max_chars": None, # Use full Source.full_text
"splitter": None # No splitting needed
},
"semantic": {
"max_chars": 1000, # ~200-250 words
"splitter": "semantic" # Use existing SemanticChunk logic
}
}
# Default strategy (simplest - no chunking)
DEFAULT_STRATEGY = "paragraph"Key Constraints:
- Source.full_text already bounded (500-1500 words from ingestion chunker)
- No chunk can exceed embedding model context window
- Simple constants in code - easy to tune, no database config needed
Configuration (use existing embedding_config):
# NO separate source_embedding_config table!
# Source embeddings use system-wide kg_api.embedding_config:
embedding_config = load_active_embedding_config()
{
"provider": "local" | "openai",
"model_name": "nomic-ai/nomic-embed-text-v1.5",
"embedding_dimensions": 768, # MUST match concept embeddings!
"precision": "float16" | "float32",
...
}
# Why? Source embeddings must be comparable to concept embeddings.
# Using different dimensions would break cosine similarity.Always Enabled:
- Source embedding generation is a first-class system feature
- No opt-in/opt-out flags
- Runs automatically for all ingested Sources
- Can be regenerated via existing regenerate embeddings worker
Source.content_hash field:
- Migration 068 adds field to Source nodes
- NULL for existing Sources (no backfill in migration)
- Computed on-demand when embeddings generated
- Existing regenerate embeddings worker handles backfill
Rationale:
- Avoid expensive migration (computing hash for all existing Sources)
- Leverage existing worker pattern (cures non-existent embeddings)
- Operators can regenerate at their leisure
- Non-blocking rollout
Backfill process (optional, any time after migration):
# Use existing regenerate embeddings pattern
kg admin regenerate-embeddings --type source --all
# Or per ontology
kg admin regenerate-embeddings --type source --ontology "MyDocs"Double verification prevents silent corruption:
# At embedding generation
source_text = source['full_text']
source_hash = sha256(source_text) # Hash of full source
for chunk in chunks:
chunk_hash = sha256(chunk.text) # Hash of this chunk
db.insert_source_embedding(
source_id=source_id,
chunk_text=chunk.text,
chunk_hash=chunk_hash, # ✓ Verifies chunk integrity
source_hash=source_hash, # ✓ Verifies source hasn't changed
start_offset=chunk.start,
end_offset=chunk.end,
embedding=generate_embedding(chunk.text)
)
# At query time
current_source_hash = sha256(source['full_text'])
for embedding in embeddings:
if embedding.source_hash != current_source_hash:
# Source text changed - embedding is stale
flag_for_regeneration(embedding)
# Verify chunk extraction
extracted_chunk = source_text[embedding.start_offset:embedding.end_offset]
if sha256(extracted_chunk) != embedding.chunk_hash:
# Corruption detected!
raise IntegrityError("Chunk hash mismatch")Benefits:
- Detect when Source.full_text changes (invalidates embeddings)
- Verify chunk extraction matches original
- Enable automatic regeneration triggers
- Prevent serving stale embeddings
Leverage existing job system (ADR-014) for embedding generation:
# New job type: "source_embedding"
{
"job_type": "source_embedding",
"status": "pending",
"job_data": {
"ontology": "MyOntology",
"strategy": "paragraph",
"source_ids": ["src_123", "src_456", ...], // Batch of sources
"embedding_provider": "local",
"embedding_model": "nomic-ai/nomic-embed-text-v1.5"
}
}Worker: api/api/workers/source_embedding_worker.py
def run_source_embedding_worker(
job_data: Dict[str, Any],
job_id: str,
job_queue
) -> Dict[str, Any]:
"""
Generate embeddings for source text chunks.
Processing:
1. Fetch Source nodes by source_ids
2. Apply chunking strategy to full_text
3. Generate embeddings via EmbeddingWorker (ADR-045)
4. Update Source.embedding field
5. Report progress to job queue
"""At Ingestion Time (always enabled):
# In api/api/workers/ingestion_worker.py
# After creating Source node
# No enable/disable check - always generate embeddings
job_queue.submit_job({
"job_type": "source_embedding",
"job_data": {
"source_ids": [source_id],
"ontology": ontology,
"strategy": "sentence" # Default strategy
}
})Bulk Regeneration (admin tool):
# Regenerate embeddings for entire ontology
kg admin regenerate-embeddings --ontology "MyOntology" --type source
# Regenerate for entire system (provider change)
kg admin regenerate-embeddings --type source --allSelective Regeneration (configuration change):
# Change embedding provider, regenerate affected sources
kg admin source-embeddings config --ontology "MyOntology" --strategy paragraph
kg admin source-embeddings generate --ontology "MyOntology" --forceNew Search Mode: Source Similarity Search
# API endpoint: POST /queries/sources/search
{
"query_text": "How does recursive depth affect performance?",
"ontology": "SystemDocs",
"limit": 10,
"include_concepts": true // Also return attached concepts
}
# Response
{
"sources": [
{
"source_id": "doc123_chunk5",
"document": "SystemDocs",
"full_text": "...",
"similarity": 0.87,
"concepts": [...] // Concepts extracted from this source
}
]
}Hybrid Search: Concept + Source
# Find concepts, then return supporting source passages
{
"query_text": "recursive relationships",
"mode": "hybrid", // Search both concepts AND sources
"concept_limit": 5,
"source_limit": 10
}
# Returns both concept matches AND similar source passagesContext Window: Source Neighbors
// Given a matched concept, find neighboring source context
MATCH (c:Concept {concept_id: $concept_id})-[:APPEARS_IN]->(s:Source)
MATCH (neighbor:Source {document: s.document})
WHERE neighbor.paragraph >= s.paragraph - 2
AND neighbor.paragraph <= s.paragraph + 2
RETURN neighbor
ORDER BY neighbor.paragraphStorage:
- 768-dim float16 embedding = 1.5KB per chunk
- Typical: 1-2 chunks per Source (500-1500 word Sources)
- Avg 1.5 chunks per Source = 2.25KB per Source
- 1M sources = ~2.25GB embedding storage
- Plus ~500 bytes metadata per chunk = ~750MB
- Total: ~3GB for 1M sources (acceptable for PostgreSQL)
Note: Most Sources (500-1500 words) will have 1-2 embedding chunks at 500 char (~100 word) granularity.
Generation:
- Local embeddings (Nomic): ~5-10ms per chunk (CPU fallback: ~20-50ms)
- Typical: 1-2 chunks per Source = ~10-20ms per Source
- OpenAI API: ~50-100ms per batch (rate limited)
- Async processing prevents ingestion blocking
- Hash calculation: <1ms (negligible)
- Content_hash computed once per Source, cached in node
Regeneration:
- Leverage existing regenerate embeddings worker
- Worker cures non-existent embeddings (NULL content_hash)
- 1M sources @ 15ms = ~4 hours (local, 1-2 chunks per Source)
- Progress tracking via job system
- Resumable on failure
- Can regenerate entire system or per-ontology
-
Referential Integrity
- Double hash verification (source + chunk)
- Detect stale embeddings automatically
- Prevent serving outdated results
- Enable automatic regeneration triggers
-
Granular Retrieval
- 1-2 embeddings per Source (typical)
- Precise offset tracking for highlighting
- Context-aware search results
- Chunking overlap from ingestion ensures continuity
-
Complete Retrieval Coverage
- Text search (full-text)
- Concept search (embeddings)
- Source search (embeddings) ← NEW
- Visual search (image embeddings)
-
Enhanced RAG
- Retrieve source passages directly
- Combine with concept context
- Build richer prompts for LLM generation
-
Context Discovery
- Find similar passages across documents
- Identify conceptual overlap via source similarity
- Build "source graphs" of related passages
-
LCM Foundation
- All graph elements become searchable
- Enables emergent relationship discovery
- Supports constructive multi-modal queries
-
Provider Flexibility
- Regenerate embeddings when provider changes
- A/B test embedding models
- Mix providers per ontology
-
Simple Configuration
- Uses existing kg_api.embedding_config (system-wide)
- No separate configuration table
- Must match concept embedding dimensions
- Always enabled (first-class feature)
-
Leverage Existing Patterns
- Uses existing regenerate embeddings worker
- Worker cures NULL content_hash on-demand
- No expensive migration backfill
- Operators control regeneration timing
-
Storage Overhead
- +2.25KB per Source (1.5 chunks @ 1.5KB each, typical)
- Plus ~750MB metadata (1M sources)
- For 1M sources: ~3GB storage
- Acceptable for PostgreSQL at scale
-
Ingestion Latency
- Async job adds ~15ms per source (1-2 chunks typical)
- Hash calculation adds <1ms (cached in Source.content_hash)
- Mitigated by background processing
- Total impact negligible
-
Schema Complexity
- Additional table to maintain (source_embeddings)
- Hash verification logic required
- Stale embedding detection needed
-
API Complexity
- New search modes to maintain
- Hybrid search requires careful tuning
- Offset extraction and highlighting logic
-
Migration Cost
- Migration 068 adds field only (fast, no backfill)
- Existing Sources have NULL content_hash initially
- Backfill via regenerate embeddings worker (optional, at leisure)
- No downtime required (graceful degradation)
-
Always-On Feature
- Source embedding generation runs for all ingestions
- No opt-in/opt-out (first-class system feature)
- Simplified architecture (no conditional logic)
-
Backward Compatible
- Migration adds field, NULL for existing Sources
- Existing Source nodes continue working
- Regenerate embeddings worker handles backfill
- Queries gracefully handle NULL content_hash
- Migration 068: Create
kg_api.source_embeddingstable - Migration 068: Add
Source.content_hashfield (NULL for existing) - Implement hash verification utilities (SHA256)
- Implement sentence chunking with offset tracking (500 chars)
- Implement
SourceEmbeddingWorkerskeleton - Query active
embedding_configfor dimensions - Add job type "source_embedding" to queue
- Implement full
SourceEmbeddingWorkerwith chunking - Add hash verification at generation time
- Store embeddings in
source_embeddingstable - Update
Source.content_hashfield when embedding - Add ingestion-time embedding generation (always enabled)
- Test with small ontology (verify chunks, offsets, hashes)
- Implement
/queries/sources/searchendpoint - Add stale embedding detection in queries
- Return matched chunks with offsets for highlighting
- Add context window expansion (neighboring chunks)
- Implement hash verification at query time
Critical Infrastructure: Enables cross-entity semantic queries and global model migrations.
Rationale: The system currently has embeddings in three namespaces:
- Concepts:
Concept.embedding(AGE graph nodes) - Sources:
kg_api.source_embeddingstable (this ADR) - Vocabulary:
kg_api.vocabulary_embeddingstable (ADR-044)
Without unified regeneration:
- ❌ Cannot switch embedding models globally (must manually regenerate 3 systems)
- ❌ Cannot guarantee cross-entity semantic compatibility
- ❌ Cannot execute blended queries (concept + source + relationship)
- ❌ Cannot discover emergent relationships via embedding proximity
Phase 4 Solution: Single interface for ALL graph text embeddings
- Implement
regenerate_source_embeddings()function insource_embedding_worker.py - Fetch sources from AGE (filter by ontology, detect missing embeddings)
- Batch process sources with progress tracking
- Support
--only-missingflag (skip sources with valid embeddings) - Detect and regenerate stale embeddings (hash mismatch)
- Implement
regenerate_vocabulary_embeddings()function - Regenerate embeddings for all relationship types in vocabulary
- Update
kg_api.vocabulary_embeddingstable - Support categorical filtering (semantic, structural, epistemic, etc.)
- Add
/admin/regenerate-embeddingsendpoint (replaces/admin/regenerate-concept-embeddings) - Support
typeparameter:concept,source,vocabulary,all - Support filters:
ontology,only_missing,limit,offset - Return unified progress tracking and statistics
- Update
kg admin regenerate-embeddingscommand - Add
--type <concept|source|vocabulary|all>flag (default:concept) - Support
--ontology <name>(limit to specific namespace) - Support
--only-missing(skip entities with valid embeddings) - Support
--limit <n>and--offset <n>for batching - Unified progress display for all entity types
- Document semantic query patterns (see "Cross-Entity Query Capabilities" below)
- Add examples for blended search (concept + source + relationship)
- Performance benchmarks for cross-entity queries
- Add MCP tools for unified semantic search
Example Usage:
# Model migration: Regenerate ALL embeddings with new model
kg admin regenerate-embeddings --all
# Selective regeneration
kg admin regenerate-embeddings --type concept --ontology "MyDocs"
kg admin regenerate-embeddings --type source --only-missing
kg admin regenerate-embeddings --type vocabulary
# Batch processing
kg admin regenerate-embeddings --type source --limit 1000 --offset 0Implementation Date: 2025-11-29
Branch: feature/adr-068-phase5-interfaces
Goal: Provide source text search access across all user interaction methods.
-
kg search sourcescommand with full parameter support - Query, limit, similarity, ontology filtering
- Formatted output with source passages, concepts, and similarity scores
- Integrated with existing search command structure
- Extended
searchtool withtypeparameter ('concepts' | 'sources') - Source search results formatter (
formatSourceSearchResults) - Automatic routing based on search type
- Rich text output for AI consumption (passages, concepts, offsets)
- SourceSearchBlock component (Smart Block category)
- Query input, ontology filter, similarity slider, limit controls
- Execution logic extracting concepts from source passages
- Block compiler integration with comment annotations
- Help content and palette integration
- Amber color scheme (consistent with Smart Blocks)
Files Modified:
cli/src/mcp/formatters.ts- AddedformatSourceSearchResultscli/src/mcp-server.ts- Extended search tool with type parameterweb/src/api/client.ts- AddedsearchSourcesmethodweb/src/components/blocks/SourceSearchBlock.tsx- New component (142 lines)web/src/components/blocks/BlockBuilder.tsx- Execution logicweb/src/components/blocks/BlockPalette.tsx- Added to Smart Blocksweb/src/components/blocks/blockHelpContent.ts- Help documentationweb/src/types/blocks.ts- Type definitionsweb/src/lib/blockCompiler.ts- Block compilation logic
Testing:
- ✅ CLI:
kg search sources "data"returns 5 results - ✅ CLI:
kg search query "towers"returns 2 concepts - ✅ Web UI: Block renders correctly with all controls
- ✅ API:
/query/sources/searchendpoint working - ⏳ MCP: Requires restart to test (in progress)
Commits:
feat(mcp): add source search tool with type parameter (ADR-068 Phase 5)- 0744e837feat(web): add SourceSearchBlock for source text search (ADR-068 Phase 5)- 7dfe0b8bfix(web): add source search execution logic to BlockBuilder- 345304a0fix(web): add sourceSearch case to block compiler- 5b479746fix(web): correct source search endpoint path- 34ee7ce4
- Hybrid search (concept + source combined)
- Semantic chunking strategy
- Multiple strategies per Source
- Cross-document source similarity
- Edge embeddings for emergent relationships
The Emergent Power of Unified Semantic Space
Once concepts, sources, and vocabulary (relationship types) share the same semantic space with compatible embeddings, powerful cross-entity query patterns emerge. This is the foundation of the Large Concept Model (LCM) architecture.
Route queries to the most relevant entity type automatically:
# Single query → multiple semantic entry points
query = "recursive depth management"
results = {
"via_concepts": search_concepts(query), # Direct concept match
"via_sources": search_sources(query), # Evidence passage match
"via_relationships": search_relationships(query), # Semantic edge match
}
# System automatically selects best entry point by similarity
best_entry = max(results, key=lambda r: r.max_similarity)Use Case: User doesn't know whether their query matches a concept name, a source passage, or a relationship type. The system finds the best match across all three and uses that as the entry point.
Discover relationships not by exact type, but by semantic meaning:
// Traditional: Exact relationship traversal
MATCH (c:Concept {concept_id: $id})-[:SUPPORTS]->(target)
// With embeddings: Semantic relationship discovery
MATCH (c:Concept {concept_id: $id})-[r]->(target:Concept)
WHERE vocabulary_embedding_similarity(type(r), "strengthens, enables, reinforces") > 0.8
RETURN target
ORDER BY vocabulary_embedding_similarity(type(r), $query_embedding) DESCUse Case: Find all concepts that "support" a given concept, but include relationships with semantically similar meanings (ENABLES, REINFORCES, STRENGTHENS, etc.).
Merge results from multiple entity types for comprehensive coverage:
# Query: "How does probabilistic reasoning work?"
# Strategy A: Find concepts directly
concepts_direct = search_concepts("probabilistic reasoning", limit=10)
# Strategy B: Find source passages → extract their concepts
sources = search_sources("probabilistic reasoning", limit=10)
concepts_from_sources = get_concepts_for_sources(sources)
# Strategy C: Find relationships → traverse to concepts
relationships = search_relationships("probabilistic reasoning", limit=10)
concepts_via_edges = get_concepts_connected_by(relationships)
# BLEND: Merge + deduplicate + rank by combined signals
blended_results = merge_and_rank([
(concepts_direct, weight=1.0), # Direct matches
(concepts_from_sources, weight=0.8), # Evidence-based
(concepts_via_edges, weight=0.6) # Relationship-based
])Use Case: Comprehensive search that considers all perspectives—concepts mentioned explicitly, concepts discussed in sources, and concepts connected via semantically relevant relationships.
Rank evidence by semantic relevance to the query, not just presence:
# Query: "grounding strength calculation"
# Step 1: Find best concept match
concept = search_concepts("grounding strength")[0]
# Step 2: Get evidence, but rank by CONTEXT similarity
evidence = get_concept_evidence(concept.id)
for source in evidence:
# Traditional: "This source mentions this concept" (binary)
# Enhanced: "This source passage is contextually relevant to the query" (scored)
source.relevance_score = cosine_similarity(
embed("grounding strength calculation"),
source.embedding
)
# Return context-aware evidence ranking
return sorted(evidence, key=lambda s: s.relevance_score, reverse=True)Use Case: Show the most relevant evidence first—passages that not only mention the concept but discuss it in the context the user cares about.
Extract connected subgraphs based on semantic similarity, not just explicit edges:
# "Show me everything semantically related to 'epistemic status'"
query_emb = embed("epistemic status")
# Find ALL entities semantically close (threshold = 0.7)
semantic_neighborhood = {
"concepts": cosine_search(Concept.embedding, query_emb, threshold=0.7),
"sources": cosine_search(source_embeddings, query_emb, threshold=0.7),
"relationships": cosine_search(vocabulary_embeddings, query_emb, threshold=0.7)
}
# Extract connected subgraph containing these entities
subgraph = extract_connected_subgraph(semantic_neighborhood)
# Visualize: Everything semantically related, regardless of entity typeUse Case: Explore a topic by finding all concepts, sources, and relationships semantically related to it—not just those explicitly linked.
Find implicit relationships via embedding proximity:
// Find concepts that are semantically similar but not explicitly connected
MATCH (c1:Concept), (c2:Concept)
WHERE embedding_similarity(c1, c2) > 0.85
AND NOT (c1)-[]-(c2) // Not explicitly connected
// Find sources that bridge them
MATCH (s:Source)
WHERE source_embedding_similarity(s, c1) > 0.75
AND source_embedding_similarity(s, c2) > 0.75
RETURN c1, c2, s
// Result: "These concepts aren't linked, but this source passage
// discusses both → potential emergent relationship"Use Case: Discover hidden connections—concepts that should be related based on semantic proximity but haven't been explicitly linked yet.
With visual embeddings (ADR-057), blend text + visual semantics:
# Query: "system architecture"
results = blend_multimodal([
search_concepts("system architecture"),
search_sources("architecture diagrams"),
search_relationships("defines structure"),
search_images(visual_query="architecture diagram") # Visual similarity
])
# Result: Concepts + passages + diagrams, all ranked by semantic relevanceUse Case: Find everything related to a topic—concepts, source passages, AND diagrams/images—all ranked by unified semantic similarity.
Traditional RAG (Retrieval-Augmented Generation):
Documents → Chunks → Embeddings → Vector DB → Retrieve → Generate
Large Concept Model (LCM) with Unified Embeddings:
Documents → {Concepts, Sources, Relationships} → Embeddings → Multi-Entity Graph
↓
Dynamic Routing + Blending + Emergent Discovery
↓
Semantic Subgraphs + Context-Aware Ranking
Key Differences:
- Multi-Entity: Not just document chunks, but concepts + sources + relationships
- Semantic Graph: Explicit edges PLUS embedding-based proximity
- Dynamic Routing: Query finds best entry point automatically
- Blended Results: Combine signals from multiple entity types
- Emergent Discovery: Find implicit relationships via embedding similarity
This is only possible with unified embedding regeneration (Phase 4).
Rejected: Too slow for real-time queries. Source embedding generation would block response.
Rejected: Cannot support multiple chunks per Source. Loses granularity and offset tracking.
Rejected: Loses access to full source context. Cannot retrieve similar passages directly.
Rejected: Full-text search is lexical, not semantic. Misses conceptual similarity.
- Migration adds field to Source nodes
- NULL for existing Sources
- Computed on-demand during embedding generation
- Leverage existing regenerate embeddings worker for backfill
- Use existing
kg_api.embedding_config(system-wide) - Source embeddings MUST match concept embedding dimensions
- No opt-in/opt-out flags
- Balances granularity vs overhead
- Most Sources (500-1500 words) → 1-2 embedding chunks
- Large document: 10 Sources → 10-20 embeddings total
- Chunking overlap from ingestion ensures continuity
- Source embedding generation is first-class feature
- Runs automatically for all ingestions
- Simplified architecture (no conditional logic)
- Existing regenerate embeddings worker handles backfill
- Worker cures NULL content_hash
- Operators control regeneration timing
- ADR-022: Semantic Relationship Taxonomy (Porter stemmer hybrid chunking with overlap)
- ADR-044: Probabilistic Truth Convergence (relationship embeddings for grounding)
- ADR-045: Unified Embedding Generation (EmbeddingWorker architecture)
- ADR-057: Multimodal Image Ingestion (visual embeddings for images)
- ADR-014: Job Approval Workflow (async job processing)
- ADR-039: Local Embedding Service (embedding configuration system)
The term "Large Concept Model" extends the RAG paradigm to full graph embeddings:
Traditional RAG Stack:
- Chunk documents
- Embed chunks
- Store in vector DB
- Retrieve similar chunks
- Generate response
LCM Stack (Proposed):
- Chunk documents → Sources
- Extract concepts → Concepts
- Generate relationships → Edges
- Embed EVERYTHING → Sources, Concepts, Edges
- Multi-modal retrieval → Text, concept, relationship, visual
- Graph-aware generation → Context from graph structure + embeddings
- Emergent synthesis → Discover new relationships via proximity
This ADR implements step 4 for Sources, completing the embedding coverage.
- Dense Passage Retrieval (DPR) - Dual-encoder architecture for passage retrieval
- ColBERT - Late interaction for efficient passage ranking
- REALM - Retrieval-augmented language model pre-training
- Graph Neural Networks - Comprehensive survey of GNN architectures
# Find passages similar to query
client.search_sources(
query="How does grounding strength work?",
ontology="ADRs",
limit=5
)
# Returns:
# - Top 5 most similar source passages
# - Attached concepts for each passage
# - Similarity scores# Find concepts, then expand to source context
results = client.hybrid_search(
query="epistemic status measurement",
concept_limit=3,
source_limit=10,
expand_context=True # Include neighboring source paragraphs
)
# Returns:
# - 3 most relevant concepts
# - 10 most similar source passages
# - Context window around matched sources# Given a concept, find surrounding source context
client.get_concept_context(
concept_id="concept-123",
window_size=2 # ±2 paragraphs
)
# Returns:
# - Source paragraph containing concept
# - 2 paragraphs before
# - 2 paragraphs after
# - Enables reading concept in original context# Find similar passages across multiple documents
client.cross_document_similarity(
source_id="doc1_chunk5",
ontologies=["ADRs", "CodeDocs", "Research"],
limit=10
)
# Returns:
# - Similar passages from other documents
# - Identifies conceptual overlap
# - Builds "source graph" of related passagesLast Updated: 2025-11-27 Next Review: After Phase 1 implementation (1 month)