Superseded by ADR-007
Note: This ADR was never implemented. The core technical decisions (Qdrant, embeddings, hybrid search) remain valid and are incorporated into ADR-007, which adds user-controlled background job management, task queuing, multi-user scheduling, and web UI integration. See ADR-007: Background Vector Sync with User-Controlled Job Management for the implemented architecture.
ADR-001 introduced token-based keyword search with relevance ranking, which improved upon simple substring matching. However, this approach still has fundamental limitations:
- Lexical Matching Only: Requires exact word matches (e.g., "automobile" won't match "car")
- No Semantic Understanding: Cannot understand intent or context (e.g., "how to bake bread" won't match "bread recipe")
- Language Barriers: Poor support for synonyms, related terms, or multilingual content
- No Cross-Content Search: Cannot find related content across different apps (notes, files, calendar)
- Scaling Issues: Performance degrades with large content collections
LLM-powered applications (Claude via MCP) benefit significantly from semantic search capabilities:
- Context Discovery: Find relevant information based on meaning, not just keywords
- Knowledge Retrieval: Retrieve contextually relevant notes/files for task completion
- Cross-Referencing: Connect related information across different content types
- Natural Language Queries: Support conversational search patterns
- Multi-User Environment: OAuth-based with per-user isolation and permissions
- Multi-Tenant: Single deployment serving multiple users with strict data isolation
- Real-Time Search: Sub-second query latency for good UX
- Large Content: Support for documents, PDFs, images with text extraction
- Privacy: No external API calls for sensitive content (optionally self-hosted)
- Hybrid Search: Combine semantic and keyword search for best results
We will implement semantic search using a vector database with the following architecture:
- Vector Database: Qdrant as external sidecar service
- Embedding Strategy: Configurable (OpenAI API / local models / self-hosted)
- Search Pattern: Hybrid search (semantic + keyword fusion)
- Multi-Tenancy: Single collection with user_id filtering
- Authorization: Dual-phase (vector search + Nextcloud API verification)
- Sync Strategy: Background worker with incremental updates (see ADR-002)
┌─────────────────────────────────────────────────────────────┐
│ User Request (OAuth) │
│ "find notes about baking" │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ MCP Server (Semantic Search Tool) │
│ │
│ 1. Generate query embedding │
│ 2. Search vector DB (user_id filter) │
│ 3. Verify permissions via Nextcloud API │
│ 4. Return ranked results │
└──────────┬─────────────────────────────┬────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────┐
│ Embedding Service │ │ Qdrant Vector Database │
│ - OpenAI API │ │ │
│ - Local Model │ │ Collection: nextcloud_content │
│ - Self-hosted │ │ - User-filtered vectors │
└──────────────────────┘ │ - Metadata for auth │
│ - HNSW index │
└───────────────────────────────┘
▲
│
│ Indexing
│
┌──────────┴────────────────────┐
│ Background Sync Worker │
│ (see ADR-002 for auth) │
│ │
│ 1. Fetch user content │
│ 2. Generate embeddings │
│ 3. Upsert to Qdrant │
│ 4. Update metadata │
└───────────────────────────────┘
After evaluating multiple options, we select Qdrant for the following reasons:
Qdrant Advantages:
- ✅ Native async Python client (
qdrant-client) - ✅ Efficient multi-tenancy via filtered search (no collection-per-user needed)
- ✅ Built-in hybrid search support (dense + sparse vectors)
- ✅ HNSW index with excellent performance
- ✅ Lightweight Docker deployment
- ✅ Persistent storage with snapshots
- ✅ API key authentication
- ✅ Active development and documentation
Comparison with Alternatives:
| Feature | Qdrant | Chroma | Weaviate | pgvector |
|---|---|---|---|---|
| Async Python | ✅ | ✅ | ✅ | |
| Multi-tenant filtering | ✅ | ✅ | ✅ | |
| Hybrid search | ✅ | ❌ | ✅ | |
| Docker deployment | ✅ Easy | ✅ Easy | ✅ Complex | |
| Memory usage | ✅ Low | ✅ Low | ||
| Maturity | ✅ Production | ✅ Production | ✅ Mature |
Decision: Qdrant provides the best balance of features, performance, and ease of deployment.
Support multiple embedding backends with automatic fallback:
class EmbeddingService:
"""Unified interface for embedding generation"""
def __init__(self):
self.provider = self._detect_provider()
def _detect_provider(self) -> EmbeddingProvider:
"""Auto-detect available embedding provider"""
# Tier 1: OpenAI API (best quality, requires API key)
if os.getenv("OPENAI_API_KEY"):
return OpenAIEmbedding(
model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
api_key=os.getenv("OPENAI_API_KEY")
)
# Tier 2: Self-hosted embedding service (good quality, privacy-preserving)
if os.getenv("EMBEDDING_SERVICE_URL"):
return HTTPEmbedding(
url=os.getenv("EMBEDDING_SERVICE_URL"),
model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
)
# Tier 3: Local model (fallback, CPU-only)
logger.warning("No cloud/hosted embeddings available, using local model")
return LocalEmbedding(
model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
)
async def embed(self, text: str) -> list[float]:
"""Generate embedding vector for text"""
return await self.provider.embed(text)
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
"""Generate embeddings for multiple texts (optimized)"""
return await self.provider.embed_batch(texts)class OpenAIEmbedding(EmbeddingProvider):
"""OpenAI embedding API"""
def __init__(self, model: str, api_key: str):
self.client = AsyncOpenAI(api_key=api_key)
self.model = model
self.dimension = 1536 if "3-small" in model else 1536 # Model-dependent
async def embed(self, text: str) -> list[float]:
response = await self.client.embeddings.create(
model=self.model,
input=text
)
return response.data[0].embedding
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
# OpenAI supports batch up to 2048 inputs
response = await self.client.embeddings.create(
model=self.model,
input=texts
)
return [item.embedding for item in response.data]Costs: text-embedding-3-small: $0.02 per 1M tokens (~4M characters)
- 10,000 notes × 500 words avg = ~$0.10 to index
- Searches are extremely cheap (~$0.00002 per query)
class HTTPEmbedding(EmbeddingProvider):
"""Self-hosted embedding service (Infinity, TEI, Ollama)"""
def __init__(self, url: str, model: str):
self.client = httpx.AsyncClient()
self.url = url
self.model = model
self.dimension = 384 # Model-dependent (bge-small: 384, bge-base: 768)
async def embed(self, text: str) -> list[float]:
response = await self.client.post(
f"{self.url}/embeddings",
json={"input": text, "model": self.model}
)
response.raise_for_status()
return response.json()["data"][0]["embedding"]Self-Hosted Options:
- Infinity: Lightweight, OpenAI-compatible API, GPU support
- Text Embeddings Inference (TEI): HuggingFace official, optimized, Rust-based
- Ollama: Easy setup, multi-model support, CPU/GPU
class LocalEmbedding(EmbeddingProvider):
"""Local embedding using sentence-transformers (CPU fallback)"""
def __init__(self, model: str):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(model)
self.dimension = self.model.get_sentence_embedding_dimension()
async def embed(self, text: str) -> list[float]:
# Run in thread pool to avoid blocking
loop = asyncio.get_event_loop()
embedding = await loop.run_in_executor(
None,
self.model.encode,
text
)
return embedding.tolist()Recommended Local Models:
all-MiniLM-L6-v2: 384 dims, fast, good qualityall-mpnet-base-v2: 768 dims, slower, better qualityparaphrase-multilingual-MiniLM-L12-v2: Multilingual support
# Qdrant collection configuration
collection_config = {
"collection_name": "nextcloud_content",
"vectors_config": {
"size": 384, # Embedding dimension (model-dependent)
"distance": "Cosine" # Cosine similarity for semantic search
},
"optimizers_config": {
"indexing_threshold": 10000 # Start indexing after 10k vectors
},
"hnsw_config": {
"m": 16, # Number of edges per node (balance speed/accuracy)
"ef_construct": 100 # Quality of index construction
}
}
# Payload schema (metadata)
payload_schema = {
"user_id": str, # Required: owner of content
"content_type": str, # "note", "file", "calendar_event"
"content_id": str, # Source ID (note_id, file_path, event_id)
"title": str, # Searchable title
"excerpt": str, # First 200 chars for preview
"category": str, # Optional: category/folder
"mime_type": str, # Optional: file MIME type
"shared_with": list[str], # Optional: list of user_ids with access
"tags": list[str], # Optional: user tags
"created_at": int, # Unix timestamp
"modified_at": int, # Unix timestamp
"indexed_at": int # Unix timestamp (for sync tracking)
}# User-specific search with filtering
search_results = await qdrant_client.search(
collection_name="nextcloud_content",
query_vector=query_embedding,
query_filter=models.Filter(
must=[
# User owns the content OR it's shared with them
models.Filter(
should=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=current_user_id)
),
models.FieldCondition(
key="shared_with",
match=models.MatchAny(any=[current_user_id])
)
]
),
# Optional: filter by content type
models.FieldCondition(
key="content_type",
match=models.MatchValue(value="note")
)
]
),
limit=20,
score_threshold=0.7 # Only return confident matches
)Combine semantic and keyword search for best results:
@mcp.tool()
@require_scopes("notes:read")
async def nc_notes_hybrid_search(
query: str,
ctx: Context,
limit: int = 10,
semantic_weight: float = 0.7,
keyword_weight: float = 0.3
) -> SearchNotesResponse:
"""
Hybrid search combining semantic understanding with keyword precision.
Args:
query: Natural language search query
limit: Maximum results to return
semantic_weight: Weight for semantic similarity (0-1)
keyword_weight: Weight for keyword matching (0-1)
"""
client = get_client(ctx)
username = client.username
# Run searches in parallel
semantic_task = asyncio.create_task(
semantic_search(query, username, limit=limit * 2)
)
keyword_task = asyncio.create_task(
keyword_search(query, username, limit=limit * 2)
)
semantic_results, keyword_results = await asyncio.gather(
semantic_task, keyword_task
)
# Fusion: Combine and rerank results
fused_results = reciprocal_rank_fusion(
semantic_results,
keyword_results,
semantic_weight=semantic_weight,
keyword_weight=keyword_weight
)
# Verify permissions via Nextcloud API (dual-phase authorization)
verified_results = []
for result in fused_results[:limit * 2]: # Get extra for filtering
try:
note = await client.notes.get_note(result["note_id"])
verified_results.append({
"note": note,
"score": result["score"],
"match_type": result["match_type"] # "semantic", "keyword", "both"
})
if len(verified_results) >= limit:
break
except HTTPStatusError as e:
if e.response.status_code == 403:
continue # User lost access
raise
return SearchNotesResponse(
results=verified_results,
query=query,
total_found=len(verified_results),
search_method="hybrid"
)
def reciprocal_rank_fusion(
semantic_results: list[dict],
keyword_results: list[dict],
semantic_weight: float = 0.7,
keyword_weight: float = 0.3,
k: int = 60 # RRF constant
) -> list[dict]:
"""
Reciprocal Rank Fusion for combining search results.
RRF is more robust than score normalization because it only
depends on ranks, not absolute scores.
"""
# Build rank maps
semantic_ranks = {r["note_id"]: i for i, r in enumerate(semantic_results)}
keyword_ranks = {r["note_id"]: i for i, r in enumerate(keyword_results)}
# Get all unique note IDs
all_note_ids = set(semantic_ranks.keys()) | set(keyword_ranks.keys())
# Calculate fused scores
fused = []
for note_id in all_note_ids:
# RRF formula: score = sum(weight_i / (k + rank_i))
semantic_score = 0
keyword_score = 0
match_type = []
if note_id in semantic_ranks:
semantic_score = semantic_weight / (k + semantic_ranks[note_id])
match_type.append("semantic")
if note_id in keyword_ranks:
keyword_score = keyword_weight / (k + keyword_ranks[note_id])
match_type.append("keyword")
fused.append({
"note_id": note_id,
"score": semantic_score + keyword_score,
"match_type": "+".join(match_type)
})
# Sort by fused score
fused.sort(key=lambda x: x["score"], reverse=True)
return fusedFor large documents (>1000 tokens), implement semantic chunking:
class DocumentChunker:
"""Chunk large documents for optimal embedding"""
def __init__(self, chunk_size: int = 512, overlap: int = 50):
self.chunk_size = chunk_size # tokens
self.overlap = overlap # overlapping tokens
def chunk_document(
self,
content: str,
metadata: dict
) -> list[tuple[str, dict]]:
"""
Split document into overlapping chunks with metadata.
Returns list of (chunk_text, chunk_metadata) tuples.
"""
# Tokenize (approximate with words for simplicity)
tokens = content.split()
if len(tokens) <= self.chunk_size:
# Document fits in single chunk
return [(content, metadata)]
chunks = []
start = 0
while start < len(tokens):
end = start + self.chunk_size
chunk_tokens = tokens[start:end]
chunk_text = " ".join(chunk_tokens)
# Add chunk metadata
chunk_metadata = {
**metadata,
"chunk_index": len(chunks),
"chunk_start": start,
"chunk_end": end,
"is_chunk": True
}
chunks.append((chunk_text, chunk_metadata))
# Move to next chunk with overlap
start = end - self.overlap
return chunks
# Usage in sync worker
async def index_document(doc: Document, user_id: str):
"""Index a document with chunking"""
chunker = DocumentChunker(chunk_size=512, overlap=50)
chunks = chunker.chunk_document(
content=doc.content,
metadata={
"user_id": user_id,
"content_type": "file",
"content_id": doc.path,
"title": doc.title,
"mime_type": doc.mime_type
}
)
# Generate embeddings in batch
chunk_texts = [chunk[0] for chunk in chunks]
embeddings = await embedding_service.embed_batch(chunk_texts)
# Upsert all chunks
points = []
for (chunk_text, chunk_metadata), embedding in zip(chunks, embeddings):
points.append(
models.PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
**chunk_metadata,
"excerpt": chunk_text[:200] # Preview
}
)
)
await qdrant_client.upsert(
collection_name="nextcloud_content",
points=points
)# nextcloud_mcp_server/sync/vector_indexer.py
class VectorIndexer:
"""Indexes content into vector database"""
def __init__(
self,
qdrant_client: AsyncQdrantClient,
embedding_service: EmbeddingService,
auth_provider: SyncAuthProvider # From ADR-002
):
self.qdrant = qdrant_client
self.embeddings = embedding_service
self.auth = auth_provider
async def sync_user_notes(self, user_id: str):
"""Sync all notes for a user"""
# Get authenticated client for user
client = await self.auth.get_user_client(user_id)
# Fetch all notes
notes = await client.notes.list_notes()
logger.info(f"Syncing {len(notes)} notes for {user_id}")
# Check which notes need updating
existing_ids = await self._get_indexed_note_ids(user_id)
notes_to_update = [
n for n in notes
if f"note_{n.id}" not in existing_ids
or n.modified > existing_ids[f"note_{n.id}"]
]
if not notes_to_update:
logger.info(f"All notes up-to-date for {user_id}")
return
# Generate embeddings in batch
contents = [f"{n.title}\n\n{n.content}" for n in notes_to_update]
embeddings = await self.embeddings.embed_batch(contents)
# Prepare points for upsert
points = []
for note, embedding in zip(notes_to_update, embeddings):
points.append(
models.PointStruct(
id=f"note_{note.id}",
vector=embedding,
payload={
"user_id": user_id,
"content_type": "note",
"content_id": str(note.id),
"note_id": note.id,
"title": note.title,
"excerpt": note.content[:200],
"category": note.category,
"created_at": note.created,
"modified_at": note.modified,
"indexed_at": int(time.time())
}
)
)
# Upsert to Qdrant
await self.qdrant.upsert(
collection_name="nextcloud_content",
points=points
)
logger.info(f"Indexed {len(points)} notes for {user_id}")
async def _get_indexed_note_ids(self, user_id: str) -> dict[str, int]:
"""Get map of note_id -> modified_at for indexed notes"""
# Query Qdrant for existing notes
scroll_result = await self.qdrant.scroll(
collection_name="nextcloud_content",
scroll_filter=models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id)
),
models.FieldCondition(
key="content_type",
match=models.MatchValue(value="note")
)
]
),
with_payload=["content_id", "modified_at"],
limit=10000
)
return {
point.payload["content_id"]: point.payload["modified_at"]
for point, _ in scroll_result
}
async def delete_note(self, user_id: str, note_id: int):
"""Remove deleted note from index"""
await self.qdrant.delete(
collection_name="nextcloud_content",
points_selector=models.FilterSelector(
filter=models.Filter(
must=[
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id)
),
models.FieldCondition(
key="note_id",
match=models.MatchValue(value=note_id)
)
]
)
)
)# Vector Database
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY=<secure-api-key>
QDRANT_COLLECTION=nextcloud_content
# Embedding Strategy (choose one)
# Option 1: OpenAI
OPENAI_API_KEY=sk-...
OPENAI_EMBEDDING_MODEL=text-embedding-3-small # or text-embedding-3-large
# Option 2: Self-hosted
EMBEDDING_SERVICE_URL=http://embeddings:7997
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
# Option 3: Local (fallback, no config needed)
# Search Configuration
SEMANTIC_SEARCH_ENABLED=true
HYBRID_SEARCH_DEFAULT_SEMANTIC_WEIGHT=0.7
HYBRID_SEARCH_DEFAULT_KEYWORD_WEIGHT=0.3
SEARCH_SCORE_THRESHOLD=0.7
# Sync Configuration
VECTOR_SYNC_INTERVAL=300 # seconds
VECTOR_SYNC_BATCH_SIZE=100services:
# Vector Database
qdrant:
image: qdrant/qdrant:latest
restart: always
ports:
- 127.0.0.1:6333:6333 # REST API
- 127.0.0.1:6334:6334 # gRPC
volumes:
- qdrant_storage:/qdrant/storage
environment:
- QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY}
- QDRANT__SERVICE__HTTP_PORT=6333
- QDRANT__SERVICE__GRPC_PORT=6334
# Embedding Service (optional - for self-hosted)
embeddings:
image: michaelf34/infinity:latest
restart: always
ports:
- 127.0.0.1:7997:7997
volumes:
- embedding_models:/app/.cache
environment:
- MODEL_ID=BAAI/bge-small-en-v1.5
- BATCH_SIZE=32
- ENGINE=torch # or optimum for better CPU performance
# Optional: GPU support
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# MCP Server with vector search
mcp:
build: .
command: ["--transport", "streamable-http"]
depends_on:
- app
- qdrant
- embeddings # optional
environment:
# ... existing env vars ...
- SEMANTIC_SEARCH_ENABLED=true
- QDRANT_URL=http://qdrant:6333
- QDRANT_API_KEY=${QDRANT_API_KEY}
# Choose embedding strategy
- EMBEDDING_SERVICE_URL=http://embeddings:7997
# OR
# - OPENAI_API_KEY=${OPENAI_API_KEY}
# Vector Sync Worker
mcp-vector-sync:
build: .
command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
depends_on:
- app
- qdrant
- embeddings # optional
environment:
# Nextcloud + Auth (from ADR-002)
- NEXTCLOUD_HOST=http://app:80
- ENABLE_OFFLINE_ACCESS=true
- TOKEN_ENCRYPTION_KEY=${TOKEN_ENCRYPTION_KEY}
# Vector Database
- QDRANT_URL=http://qdrant:6333
- QDRANT_API_KEY=${QDRANT_API_KEY}
# Embeddings
- EMBEDDING_SERVICE_URL=http://embeddings:7997
volumes:
- sync-tokens:/app/data
volumes:
qdrant_storage:
embedding_models:
sync-tokens:# Batch embedding generation
async def embed_batch_chunked(
texts: list[str],
batch_size: int = 100
) -> list[list[float]]:
"""Generate embeddings in chunks to avoid memory issues"""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = await embedding_service.embed_batch(batch)
embeddings.extend(batch_embeddings)
await asyncio.sleep(0.1) # Rate limiting
return embeddings
# Parallel upsert with batching
async def upsert_points_batched(
points: list[models.PointStruct],
batch_size: int = 100
):
"""Upsert points in batches"""
for i in range(0, len(points), batch_size):
batch = points[i:i + batch_size]
await qdrant_client.upsert(
collection_name="nextcloud_content",
points=batch,
wait=False # Don't wait for indexing
)# Search with prefetch for better accuracy
search_results = await qdrant_client.search(
collection_name="nextcloud_content",
query_vector=query_embedding,
query_filter=user_filter,
limit=20,
with_payload=True,
with_vectors=False, # Don't return vectors (saves bandwidth)
search_params=models.SearchParams(
hnsw_ef=128, # Higher = more accurate but slower
exact=False # Use HNSW index
)
)# Cache embeddings for common queries
from functools import lru_cache
@lru_cache(maxsize=1000)
def cache_key(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
async def embed_with_cache(text: str) -> list[float]:
"""Generate embedding with caching"""
key = cache_key(text)
# Check Redis cache
cached = await redis.get(f"embedding:{key}")
if cached:
return json.loads(cached)
# Generate embedding
embedding = await embedding_service.embed(text)
# Cache for 1 hour
await redis.setex(
f"embedding:{key}",
3600,
json.dumps(embedding)
)
return embedding# Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge
# Search metrics
semantic_search_count = Counter(
'semantic_search_total',
'Total semantic searches',
['user_id', 'content_type']
)
semantic_search_latency = Histogram(
'semantic_search_duration_seconds',
'Semantic search latency',
['phase'] # 'embedding', 'vector_search', 'verification'
)
# Indexing metrics
documents_indexed = Counter(
'documents_indexed_total',
'Total documents indexed',
['user_id', 'content_type']
)
index_queue_size = Gauge(
'index_queue_size',
'Number of documents waiting to be indexed'
)
# Usage
async def semantic_search(query: str, user_id: str):
semantic_search_count.labels(user_id=user_id, content_type='note').inc()
with semantic_search_latency.labels(phase='embedding').time():
embedding = await embed(query)
with semantic_search_latency.labels(phase='vector_search').time():
results = await qdrant.search(...)
with semantic_search_latency.labels(phase='verification').time():
verified = await verify_access(results)
return verified-
Semantic Understanding
- Find content by meaning, not just keywords
- Support for natural language queries
- Cross-lingual search potential
- Better context discovery for LLMs
-
User Experience
- More relevant search results
- Discover related content across apps
- Fast sub-second query latency
- Hybrid search combines best of both worlds
-
Architecture
- External sidecar (doesn't bloat MCP server)
- Configurable embedding backend (cloud/self-hosted/local)
- Multi-tenant with strict isolation
- Scales horizontally (Qdrant cluster)
-
Privacy & Security
- Self-hosted option available
- Dual-phase authorization enforces permissions
- Vector DB is cache, not source of truth
- Per-user audit trail
-
Developer Experience
- Simple async Python API
- Comprehensive monitoring
- Clear upgrade path (better embeddings, reranking)
-
Complexity
- Additional infrastructure (Qdrant + embeddings)
- More monitoring required
- Embedding generation latency
- Initial indexing time for large collections
-
Cost
- Storage: ~4KB per document (embedding + metadata)
- Compute: Embedding generation (API costs or GPU)
- Memory: Qdrant keeps vectors in RAM for speed
-
Operational
- Index maintenance and updates
- Embedding model versioning
- Handling deleted/moved content
- Cold start indexing for new users
-
Search Accuracy
- Quality depends on embedding model
- May miss exact keyword matches (mitigated by hybrid search)
- Cultural/domain-specific terms may not embed well
- Requires tuning score thresholds
| Metric | Target | Notes |
|---|---|---|
| Search latency | <200ms | Embedding + vector search + verification |
| Indexing throughput | >100 docs/sec | With batch embeddings |
| Memory per 10k docs | ~40MB | Qdrant vectors + metadata |
| Disk per 10k docs | ~40MB | Persistent storage |
| Search accuracy | >90% | At score_threshold=0.7 |
Small Deployment (10 users, 1000 notes each):
- Initial indexing: 10,000 notes × $0.00002 = $0.20 (OpenAI)
- Monthly searches: 1000 queries × $0.00002 = $0.02
- Infrastructure: Qdrant (40MB RAM), Embeddings (optional)
- Total: ~$0.25/month (API) or self-hosted (negligible)
Medium Deployment (100 users, 500 notes each):
- Initial indexing: 50,000 notes × $0.00002 = $1.00
- Monthly searches: 10,000 queries × $0.00002 = $0.20
- Infrastructure: Qdrant (200MB RAM)
- Total: ~$1.20/month or self-hosted
Self-Hosted (any size):
- GPU instance:
$0.50/hour ($360/month for 24/7) - Or CPU-only: negligible cost, slower embeddings
-
Multimodal Search
- Image embeddings (CLIP)
- PDF/document layout understanding
- Audio transcription + embedding
-
Advanced Ranking
- Cross-encoder reranking
- Learning-to-rank models
- User feedback signals
-
Query Understanding
- Query expansion
- Spell correction
- Entity extraction
-
Performance
- Query result caching
- Approximate nearest neighbor improvements
- Quantization for reduced memory
-
Features
- Saved searches
- Search analytics
- Recommended content
Approach: Use traditional full-text search engine with vector plugin
Pros:
- Mature ecosystem
- Excellent keyword search
- Rich query DSL
Cons:
- Heavy infrastructure (JVM-based)
- Complex setup and tuning
- Vector search is plugin/add-on (not native)
- Higher resource usage
Decision: Rejected; Qdrant is purpose-built for vectors
Approach: Embedded or client-server vector database
Pros:
- Simple Python API
- Easy to get started
- Good for prototyping
Cons:
- Sync-only Python client (no async)
- Limited multi-tenancy features
- Less mature than Qdrant
- Scaling concerns
Decision: Rejected; async and multi-tenancy are critical
Approach: Full-featured vector database with GraphQL
Pros:
- Very feature-rich
- Built-in vectorization
- Good documentation
Cons:
- More complex architecture
- Higher resource usage
- GraphQL adds complexity
- Overkill for our use case
Decision: Rejected; Qdrant provides better balance
Approach: Add vector search to existing PostgreSQL
Pros:
- Leverages existing PostgreSQL expertise
- Transactional consistency
- Mature database ecosystem
Cons:
- This deployment uses MariaDB (would need PostgreSQL)
- Performance not as optimized as purpose-built vector DB
- Manual hybrid search implementation
- HNSW index limitations
Decision: Rejected; dedicated vector DB is better fit
Approach: Managed cloud vector database
Pros:
- Fully managed
- Excellent performance
- No infrastructure management
Cons:
- Cloud-only (no self-hosting)
- Recurring costs
- Vendor lock-in
- Data leaves premises
Decision: Rejected; self-hosting is important for privacy
- ADR-001: Enhanced Note Search (establishes need for better search)
- ADR-002: Vector Sync Authentication (defines how sync workers authenticate)
- [Future] ADR-004: Content Extraction and Document Processing
- [Future] ADR-005: Cross-App Semantic Search