Skip to content

Latest commit

 

History

History
1118 lines (912 loc) · 35.5 KB

File metadata and controls

1118 lines (912 loc) · 35.5 KB

ADR-003: Vector Database and Semantic Search Architecture

Status

Superseded by ADR-007

Note: This ADR was never implemented. The core technical decisions (Qdrant, embeddings, hybrid search) remain valid and are incorporated into ADR-007, which adds user-controlled background job management, task queuing, multi-user scheduling, and web UI integration. See ADR-007: Background Vector Sync with User-Controlled Job Management for the implemented architecture.

Context

Current State

ADR-001 introduced token-based keyword search with relevance ranking, which improved upon simple substring matching. However, this approach still has fundamental limitations:

  1. Lexical Matching Only: Requires exact word matches (e.g., "automobile" won't match "car")
  2. No Semantic Understanding: Cannot understand intent or context (e.g., "how to bake bread" won't match "bread recipe")
  3. Language Barriers: Poor support for synonyms, related terms, or multilingual content
  4. No Cross-Content Search: Cannot find related content across different apps (notes, files, calendar)
  5. Scaling Issues: Performance degrades with large content collections

User Needs

LLM-powered applications (Claude via MCP) benefit significantly from semantic search capabilities:

  • Context Discovery: Find relevant information based on meaning, not just keywords
  • Knowledge Retrieval: Retrieve contextually relevant notes/files for task completion
  • Cross-Referencing: Connect related information across different content types
  • Natural Language Queries: Support conversational search patterns

Technical Requirements

  1. Multi-User Environment: OAuth-based with per-user isolation and permissions
  2. Multi-Tenant: Single deployment serving multiple users with strict data isolation
  3. Real-Time Search: Sub-second query latency for good UX
  4. Large Content: Support for documents, PDFs, images with text extraction
  5. Privacy: No external API calls for sensitive content (optionally self-hosted)
  6. Hybrid Search: Combine semantic and keyword search for best results

Decision

We will implement semantic search using a vector database with the following architecture:

Core Components

  1. Vector Database: Qdrant as external sidecar service
  2. Embedding Strategy: Configurable (OpenAI API / local models / self-hosted)
  3. Search Pattern: Hybrid search (semantic + keyword fusion)
  4. Multi-Tenancy: Single collection with user_id filtering
  5. Authorization: Dual-phase (vector search + Nextcloud API verification)
  6. Sync Strategy: Background worker with incremental updates (see ADR-002)

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    User Request (OAuth)                      │
│                    "find notes about baking"                 │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌────────────────────────────────────────────────────────────┐
│               MCP Server (Semantic Search Tool)             │
│                                                              │
│  1. Generate query embedding                                │
│  2. Search vector DB (user_id filter)                       │
│  3. Verify permissions via Nextcloud API                    │
│  4. Return ranked results                                   │
└──────────┬─────────────────────────────┬────────────────────┘
           │                              │
           ▼                              ▼
┌──────────────────────┐      ┌──────────────────────────────┐
│ Embedding Service    │      │ Qdrant Vector Database        │
│ - OpenAI API         │      │                               │
│ - Local Model        │      │ Collection: nextcloud_content │
│ - Self-hosted        │      │ - User-filtered vectors       │
└──────────────────────┘      │ - Metadata for auth          │
                               │ - HNSW index                  │
                               └───────────────────────────────┘
                                          ▲
                                          │
                                          │ Indexing
                                          │
                               ┌──────────┴────────────────────┐
                               │ Background Sync Worker        │
                               │ (see ADR-002 for auth)        │
                               │                               │
                               │ 1. Fetch user content         │
                               │ 2. Generate embeddings        │
                               │ 3. Upsert to Qdrant          │
                               │ 4. Update metadata            │
                               └───────────────────────────────┘

Implementation Details

1. Vector Database Selection: Qdrant

After evaluating multiple options, we select Qdrant for the following reasons:

Qdrant Advantages:

  • ✅ Native async Python client (qdrant-client)
  • ✅ Efficient multi-tenancy via filtered search (no collection-per-user needed)
  • ✅ Built-in hybrid search support (dense + sparse vectors)
  • ✅ HNSW index with excellent performance
  • ✅ Lightweight Docker deployment
  • ✅ Persistent storage with snapshots
  • ✅ API key authentication
  • ✅ Active development and documentation

Comparison with Alternatives:

Feature Qdrant Chroma Weaviate pgvector
Async Python ⚠️ Sync
Multi-tenant filtering ⚠️ Limited
Hybrid search ⚠️ Manual
Docker deployment ✅ Easy ✅ Easy ✅ Complex ⚠️ Postgres
Memory usage ✅ Low ⚠️ Medium ⚠️ High ✅ Low
Maturity ✅ Production ⚠️ Young ✅ Production ✅ Mature

Decision: Qdrant provides the best balance of features, performance, and ease of deployment.

2. Embedding Strategy: Tiered Approach

Support multiple embedding backends with automatic fallback:

class EmbeddingService:
    """Unified interface for embedding generation"""

    def __init__(self):
        self.provider = self._detect_provider()

    def _detect_provider(self) -> EmbeddingProvider:
        """Auto-detect available embedding provider"""

        # Tier 1: OpenAI API (best quality, requires API key)
        if os.getenv("OPENAI_API_KEY"):
            return OpenAIEmbedding(
                model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"),
                api_key=os.getenv("OPENAI_API_KEY")
            )

        # Tier 2: Self-hosted embedding service (good quality, privacy-preserving)
        if os.getenv("EMBEDDING_SERVICE_URL"):
            return HTTPEmbedding(
                url=os.getenv("EMBEDDING_SERVICE_URL"),
                model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
            )

        # Tier 3: Local model (fallback, CPU-only)
        logger.warning("No cloud/hosted embeddings available, using local model")
        return LocalEmbedding(
            model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
        )

    async def embed(self, text: str) -> list[float]:
        """Generate embedding vector for text"""
        return await self.provider.embed(text)

    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings for multiple texts (optimized)"""
        return await self.provider.embed_batch(texts)

2.1 OpenAI Embeddings (Tier 1)

class OpenAIEmbedding(EmbeddingProvider):
    """OpenAI embedding API"""

    def __init__(self, model: str, api_key: str):
        self.client = AsyncOpenAI(api_key=api_key)
        self.model = model
        self.dimension = 1536 if "3-small" in model else 1536  # Model-dependent

    async def embed(self, text: str) -> list[float]:
        response = await self.client.embeddings.create(
            model=self.model,
            input=text
        )
        return response.data[0].embedding

    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        # OpenAI supports batch up to 2048 inputs
        response = await self.client.embeddings.create(
            model=self.model,
            input=texts
        )
        return [item.embedding for item in response.data]

Costs: text-embedding-3-small: $0.02 per 1M tokens (~4M characters)

  • 10,000 notes × 500 words avg = ~$0.10 to index
  • Searches are extremely cheap (~$0.00002 per query)

2.2 Self-Hosted Embeddings (Tier 2)

class HTTPEmbedding(EmbeddingProvider):
    """Self-hosted embedding service (Infinity, TEI, Ollama)"""

    def __init__(self, url: str, model: str):
        self.client = httpx.AsyncClient()
        self.url = url
        self.model = model
        self.dimension = 384  # Model-dependent (bge-small: 384, bge-base: 768)

    async def embed(self, text: str) -> list[float]:
        response = await self.client.post(
            f"{self.url}/embeddings",
            json={"input": text, "model": self.model}
        )
        response.raise_for_status()
        return response.json()["data"][0]["embedding"]

Self-Hosted Options:

  • Infinity: Lightweight, OpenAI-compatible API, GPU support
  • Text Embeddings Inference (TEI): HuggingFace official, optimized, Rust-based
  • Ollama: Easy setup, multi-model support, CPU/GPU

2.3 Local Embeddings (Tier 3)

class LocalEmbedding(EmbeddingProvider):
    """Local embedding using sentence-transformers (CPU fallback)"""

    def __init__(self, model: str):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer(model)
        self.dimension = self.model.get_sentence_embedding_dimension()

    async def embed(self, text: str) -> list[float]:
        # Run in thread pool to avoid blocking
        loop = asyncio.get_event_loop()
        embedding = await loop.run_in_executor(
            None,
            self.model.encode,
            text
        )
        return embedding.tolist()

Recommended Local Models:

  • all-MiniLM-L6-v2: 384 dims, fast, good quality
  • all-mpnet-base-v2: 768 dims, slower, better quality
  • paraphrase-multilingual-MiniLM-L12-v2: Multilingual support

3. Vector Database Schema

# Qdrant collection configuration
collection_config = {
    "collection_name": "nextcloud_content",
    "vectors_config": {
        "size": 384,  # Embedding dimension (model-dependent)
        "distance": "Cosine"  # Cosine similarity for semantic search
    },
    "optimizers_config": {
        "indexing_threshold": 10000  # Start indexing after 10k vectors
    },
    "hnsw_config": {
        "m": 16,  # Number of edges per node (balance speed/accuracy)
        "ef_construct": 100  # Quality of index construction
    }
}

# Payload schema (metadata)
payload_schema = {
    "user_id": str,           # Required: owner of content
    "content_type": str,      # "note", "file", "calendar_event"
    "content_id": str,        # Source ID (note_id, file_path, event_id)
    "title": str,             # Searchable title
    "excerpt": str,           # First 200 chars for preview
    "category": str,          # Optional: category/folder
    "mime_type": str,         # Optional: file MIME type
    "shared_with": list[str], # Optional: list of user_ids with access
    "tags": list[str],        # Optional: user tags
    "created_at": int,        # Unix timestamp
    "modified_at": int,       # Unix timestamp
    "indexed_at": int         # Unix timestamp (for sync tracking)
}

3.1 Multi-Tenancy via Filtering

# User-specific search with filtering
search_results = await qdrant_client.search(
    collection_name="nextcloud_content",
    query_vector=query_embedding,
    query_filter=models.Filter(
        must=[
            # User owns the content OR it's shared with them
            models.Filter(
                should=[
                    models.FieldCondition(
                        key="user_id",
                        match=models.MatchValue(value=current_user_id)
                    ),
                    models.FieldCondition(
                        key="shared_with",
                        match=models.MatchAny(any=[current_user_id])
                    )
                ]
            ),
            # Optional: filter by content type
            models.FieldCondition(
                key="content_type",
                match=models.MatchValue(value="note")
            )
        ]
    ),
    limit=20,
    score_threshold=0.7  # Only return confident matches
)

4. Hybrid Search Implementation

Combine semantic and keyword search for best results:

@mcp.tool()
@require_scopes("notes:read")
async def nc_notes_hybrid_search(
    query: str,
    ctx: Context,
    limit: int = 10,
    semantic_weight: float = 0.7,
    keyword_weight: float = 0.3
) -> SearchNotesResponse:
    """
    Hybrid search combining semantic understanding with keyword precision.

    Args:
        query: Natural language search query
        limit: Maximum results to return
        semantic_weight: Weight for semantic similarity (0-1)
        keyword_weight: Weight for keyword matching (0-1)
    """

    client = get_client(ctx)
    username = client.username

    # Run searches in parallel
    semantic_task = asyncio.create_task(
        semantic_search(query, username, limit=limit * 2)
    )
    keyword_task = asyncio.create_task(
        keyword_search(query, username, limit=limit * 2)
    )

    semantic_results, keyword_results = await asyncio.gather(
        semantic_task, keyword_task
    )

    # Fusion: Combine and rerank results
    fused_results = reciprocal_rank_fusion(
        semantic_results,
        keyword_results,
        semantic_weight=semantic_weight,
        keyword_weight=keyword_weight
    )

    # Verify permissions via Nextcloud API (dual-phase authorization)
    verified_results = []
    for result in fused_results[:limit * 2]:  # Get extra for filtering
        try:
            note = await client.notes.get_note(result["note_id"])
            verified_results.append({
                "note": note,
                "score": result["score"],
                "match_type": result["match_type"]  # "semantic", "keyword", "both"
            })
            if len(verified_results) >= limit:
                break
        except HTTPStatusError as e:
            if e.response.status_code == 403:
                continue  # User lost access
            raise

    return SearchNotesResponse(
        results=verified_results,
        query=query,
        total_found=len(verified_results),
        search_method="hybrid"
    )

def reciprocal_rank_fusion(
    semantic_results: list[dict],
    keyword_results: list[dict],
    semantic_weight: float = 0.7,
    keyword_weight: float = 0.3,
    k: int = 60  # RRF constant
) -> list[dict]:
    """
    Reciprocal Rank Fusion for combining search results.

    RRF is more robust than score normalization because it only
    depends on ranks, not absolute scores.
    """

    # Build rank maps
    semantic_ranks = {r["note_id"]: i for i, r in enumerate(semantic_results)}
    keyword_ranks = {r["note_id"]: i for i, r in enumerate(keyword_results)}

    # Get all unique note IDs
    all_note_ids = set(semantic_ranks.keys()) | set(keyword_ranks.keys())

    # Calculate fused scores
    fused = []
    for note_id in all_note_ids:
        # RRF formula: score = sum(weight_i / (k + rank_i))
        semantic_score = 0
        keyword_score = 0
        match_type = []

        if note_id in semantic_ranks:
            semantic_score = semantic_weight / (k + semantic_ranks[note_id])
            match_type.append("semantic")

        if note_id in keyword_ranks:
            keyword_score = keyword_weight / (k + keyword_ranks[note_id])
            match_type.append("keyword")

        fused.append({
            "note_id": note_id,
            "score": semantic_score + keyword_score,
            "match_type": "+".join(match_type)
        })

    # Sort by fused score
    fused.sort(key=lambda x: x["score"], reverse=True)
    return fused

5. Document Chunking Strategy

For large documents (>1000 tokens), implement semantic chunking:

class DocumentChunker:
    """Chunk large documents for optimal embedding"""

    def __init__(self, chunk_size: int = 512, overlap: int = 50):
        self.chunk_size = chunk_size  # tokens
        self.overlap = overlap  # overlapping tokens

    def chunk_document(
        self,
        content: str,
        metadata: dict
    ) -> list[tuple[str, dict]]:
        """
        Split document into overlapping chunks with metadata.

        Returns list of (chunk_text, chunk_metadata) tuples.
        """

        # Tokenize (approximate with words for simplicity)
        tokens = content.split()

        if len(tokens) <= self.chunk_size:
            # Document fits in single chunk
            return [(content, metadata)]

        chunks = []
        start = 0

        while start < len(tokens):
            end = start + self.chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = " ".join(chunk_tokens)

            # Add chunk metadata
            chunk_metadata = {
                **metadata,
                "chunk_index": len(chunks),
                "chunk_start": start,
                "chunk_end": end,
                "is_chunk": True
            }

            chunks.append((chunk_text, chunk_metadata))

            # Move to next chunk with overlap
            start = end - self.overlap

        return chunks

# Usage in sync worker
async def index_document(doc: Document, user_id: str):
    """Index a document with chunking"""

    chunker = DocumentChunker(chunk_size=512, overlap=50)
    chunks = chunker.chunk_document(
        content=doc.content,
        metadata={
            "user_id": user_id,
            "content_type": "file",
            "content_id": doc.path,
            "title": doc.title,
            "mime_type": doc.mime_type
        }
    )

    # Generate embeddings in batch
    chunk_texts = [chunk[0] for chunk in chunks]
    embeddings = await embedding_service.embed_batch(chunk_texts)

    # Upsert all chunks
    points = []
    for (chunk_text, chunk_metadata), embedding in zip(chunks, embeddings):
        points.append(
            models.PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding,
                payload={
                    **chunk_metadata,
                    "excerpt": chunk_text[:200]  # Preview
                }
            )
        )

    await qdrant_client.upsert(
        collection_name="nextcloud_content",
        points=points
    )

6. Background Sync Worker

# nextcloud_mcp_server/sync/vector_indexer.py
class VectorIndexer:
    """Indexes content into vector database"""

    def __init__(
        self,
        qdrant_client: AsyncQdrantClient,
        embedding_service: EmbeddingService,
        auth_provider: SyncAuthProvider  # From ADR-002
    ):
        self.qdrant = qdrant_client
        self.embeddings = embedding_service
        self.auth = auth_provider

    async def sync_user_notes(self, user_id: str):
        """Sync all notes for a user"""

        # Get authenticated client for user
        client = await self.auth.get_user_client(user_id)

        # Fetch all notes
        notes = await client.notes.list_notes()
        logger.info(f"Syncing {len(notes)} notes for {user_id}")

        # Check which notes need updating
        existing_ids = await self._get_indexed_note_ids(user_id)
        notes_to_update = [
            n for n in notes
            if f"note_{n.id}" not in existing_ids
            or n.modified > existing_ids[f"note_{n.id}"]
        ]

        if not notes_to_update:
            logger.info(f"All notes up-to-date for {user_id}")
            return

        # Generate embeddings in batch
        contents = [f"{n.title}\n\n{n.content}" for n in notes_to_update]
        embeddings = await self.embeddings.embed_batch(contents)

        # Prepare points for upsert
        points = []
        for note, embedding in zip(notes_to_update, embeddings):
            points.append(
                models.PointStruct(
                    id=f"note_{note.id}",
                    vector=embedding,
                    payload={
                        "user_id": user_id,
                        "content_type": "note",
                        "content_id": str(note.id),
                        "note_id": note.id,
                        "title": note.title,
                        "excerpt": note.content[:200],
                        "category": note.category,
                        "created_at": note.created,
                        "modified_at": note.modified,
                        "indexed_at": int(time.time())
                    }
                )
            )

        # Upsert to Qdrant
        await self.qdrant.upsert(
            collection_name="nextcloud_content",
            points=points
        )

        logger.info(f"Indexed {len(points)} notes for {user_id}")

    async def _get_indexed_note_ids(self, user_id: str) -> dict[str, int]:
        """Get map of note_id -> modified_at for indexed notes"""

        # Query Qdrant for existing notes
        scroll_result = await self.qdrant.scroll(
            collection_name="nextcloud_content",
            scroll_filter=models.Filter(
                must=[
                    models.FieldCondition(
                        key="user_id",
                        match=models.MatchValue(value=user_id)
                    ),
                    models.FieldCondition(
                        key="content_type",
                        match=models.MatchValue(value="note")
                    )
                ]
            ),
            with_payload=["content_id", "modified_at"],
            limit=10000
        )

        return {
            point.payload["content_id"]: point.payload["modified_at"]
            for point, _ in scroll_result
        }

    async def delete_note(self, user_id: str, note_id: int):
        """Remove deleted note from index"""

        await self.qdrant.delete(
            collection_name="nextcloud_content",
            points_selector=models.FilterSelector(
                filter=models.Filter(
                    must=[
                        models.FieldCondition(
                            key="user_id",
                            match=models.MatchValue(value=user_id)
                        ),
                        models.FieldCondition(
                            key="note_id",
                            match=models.MatchValue(value=note_id)
                        )
                    ]
                )
            )
        )

7. Configuration

7.1 Environment Variables

# Vector Database
QDRANT_URL=http://qdrant:6333
QDRANT_API_KEY=<secure-api-key>
QDRANT_COLLECTION=nextcloud_content

# Embedding Strategy (choose one)
# Option 1: OpenAI
OPENAI_API_KEY=sk-...
OPENAI_EMBEDDING_MODEL=text-embedding-3-small  # or text-embedding-3-large

# Option 2: Self-hosted
EMBEDDING_SERVICE_URL=http://embeddings:7997
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

# Option 3: Local (fallback, no config needed)

# Search Configuration
SEMANTIC_SEARCH_ENABLED=true
HYBRID_SEARCH_DEFAULT_SEMANTIC_WEIGHT=0.7
HYBRID_SEARCH_DEFAULT_KEYWORD_WEIGHT=0.3
SEARCH_SCORE_THRESHOLD=0.7

# Sync Configuration
VECTOR_SYNC_INTERVAL=300  # seconds
VECTOR_SYNC_BATCH_SIZE=100

7.2 Docker Compose

services:
  # Vector Database
  qdrant:
    image: qdrant/qdrant:latest
    restart: always
    ports:
      - 127.0.0.1:6333:6333  # REST API
      - 127.0.0.1:6334:6334  # gRPC
    volumes:
      - qdrant_storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__API_KEY=${QDRANT_API_KEY}
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334

  # Embedding Service (optional - for self-hosted)
  embeddings:
    image: michaelf34/infinity:latest
    restart: always
    ports:
      - 127.0.0.1:7997:7997
    volumes:
      - embedding_models:/app/.cache
    environment:
      - MODEL_ID=BAAI/bge-small-en-v1.5
      - BATCH_SIZE=32
      - ENGINE=torch  # or optimum for better CPU performance
    # Optional: GPU support
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # MCP Server with vector search
  mcp:
    build: .
    command: ["--transport", "streamable-http"]
    depends_on:
      - app
      - qdrant
      - embeddings  # optional
    environment:
      # ... existing env vars ...
      - SEMANTIC_SEARCH_ENABLED=true
      - QDRANT_URL=http://qdrant:6333
      - QDRANT_API_KEY=${QDRANT_API_KEY}
      # Choose embedding strategy
      - EMBEDDING_SERVICE_URL=http://embeddings:7997
      # OR
      # - OPENAI_API_KEY=${OPENAI_API_KEY}

  # Vector Sync Worker
  mcp-vector-sync:
    build: .
    command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"]
    depends_on:
      - app
      - qdrant
      - embeddings  # optional
    environment:
      # Nextcloud + Auth (from ADR-002)
      - NEXTCLOUD_HOST=http://app:80
      - ENABLE_OFFLINE_ACCESS=true
      - TOKEN_ENCRYPTION_KEY=${TOKEN_ENCRYPTION_KEY}
      # Vector Database
      - QDRANT_URL=http://qdrant:6333
      - QDRANT_API_KEY=${QDRANT_API_KEY}
      # Embeddings
      - EMBEDDING_SERVICE_URL=http://embeddings:7997
    volumes:
      - sync-tokens:/app/data

volumes:
  qdrant_storage:
  embedding_models:
  sync-tokens:

8. Performance Optimization

8.1 Indexing Performance

# Batch embedding generation
async def embed_batch_chunked(
    texts: list[str],
    batch_size: int = 100
) -> list[list[float]]:
    """Generate embeddings in chunks to avoid memory issues"""

    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        batch_embeddings = await embedding_service.embed_batch(batch)
        embeddings.extend(batch_embeddings)
        await asyncio.sleep(0.1)  # Rate limiting

    return embeddings

# Parallel upsert with batching
async def upsert_points_batched(
    points: list[models.PointStruct],
    batch_size: int = 100
):
    """Upsert points in batches"""

    for i in range(0, len(points), batch_size):
        batch = points[i:i + batch_size]
        await qdrant_client.upsert(
            collection_name="nextcloud_content",
            points=batch,
            wait=False  # Don't wait for indexing
        )

8.2 Search Performance

# Search with prefetch for better accuracy
search_results = await qdrant_client.search(
    collection_name="nextcloud_content",
    query_vector=query_embedding,
    query_filter=user_filter,
    limit=20,
    with_payload=True,
    with_vectors=False,  # Don't return vectors (saves bandwidth)
    search_params=models.SearchParams(
        hnsw_ef=128,  # Higher = more accurate but slower
        exact=False   # Use HNSW index
    )
)

8.3 Caching

# Cache embeddings for common queries
from functools import lru_cache

@lru_cache(maxsize=1000)
def cache_key(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

async def embed_with_cache(text: str) -> list[float]:
    """Generate embedding with caching"""

    key = cache_key(text)

    # Check Redis cache
    cached = await redis.get(f"embedding:{key}")
    if cached:
        return json.loads(cached)

    # Generate embedding
    embedding = await embedding_service.embed(text)

    # Cache for 1 hour
    await redis.setex(
        f"embedding:{key}",
        3600,
        json.dumps(embedding)
    )

    return embedding

9. Monitoring and Metrics

# Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge

# Search metrics
semantic_search_count = Counter(
    'semantic_search_total',
    'Total semantic searches',
    ['user_id', 'content_type']
)

semantic_search_latency = Histogram(
    'semantic_search_duration_seconds',
    'Semantic search latency',
    ['phase']  # 'embedding', 'vector_search', 'verification'
)

# Indexing metrics
documents_indexed = Counter(
    'documents_indexed_total',
    'Total documents indexed',
    ['user_id', 'content_type']
)

index_queue_size = Gauge(
    'index_queue_size',
    'Number of documents waiting to be indexed'
)

# Usage
async def semantic_search(query: str, user_id: str):
    semantic_search_count.labels(user_id=user_id, content_type='note').inc()

    with semantic_search_latency.labels(phase='embedding').time():
        embedding = await embed(query)

    with semantic_search_latency.labels(phase='vector_search').time():
        results = await qdrant.search(...)

    with semantic_search_latency.labels(phase='verification').time():
        verified = await verify_access(results)

    return verified

Consequences

Benefits

  1. Semantic Understanding

    • Find content by meaning, not just keywords
    • Support for natural language queries
    • Cross-lingual search potential
    • Better context discovery for LLMs
  2. User Experience

    • More relevant search results
    • Discover related content across apps
    • Fast sub-second query latency
    • Hybrid search combines best of both worlds
  3. Architecture

    • External sidecar (doesn't bloat MCP server)
    • Configurable embedding backend (cloud/self-hosted/local)
    • Multi-tenant with strict isolation
    • Scales horizontally (Qdrant cluster)
  4. Privacy & Security

    • Self-hosted option available
    • Dual-phase authorization enforces permissions
    • Vector DB is cache, not source of truth
    • Per-user audit trail
  5. Developer Experience

    • Simple async Python API
    • Comprehensive monitoring
    • Clear upgrade path (better embeddings, reranking)

Limitations

  1. Complexity

    • Additional infrastructure (Qdrant + embeddings)
    • More monitoring required
    • Embedding generation latency
    • Initial indexing time for large collections
  2. Cost

    • Storage: ~4KB per document (embedding + metadata)
    • Compute: Embedding generation (API costs or GPU)
    • Memory: Qdrant keeps vectors in RAM for speed
  3. Operational

    • Index maintenance and updates
    • Embedding model versioning
    • Handling deleted/moved content
    • Cold start indexing for new users
  4. Search Accuracy

    • Quality depends on embedding model
    • May miss exact keyword matches (mitigated by hybrid search)
    • Cultural/domain-specific terms may not embed well
    • Requires tuning score thresholds

Performance Characteristics

Metric Target Notes
Search latency <200ms Embedding + vector search + verification
Indexing throughput >100 docs/sec With batch embeddings
Memory per 10k docs ~40MB Qdrant vectors + metadata
Disk per 10k docs ~40MB Persistent storage
Search accuracy >90% At score_threshold=0.7

Cost Estimates

Small Deployment (10 users, 1000 notes each):

  • Initial indexing: 10,000 notes × $0.00002 = $0.20 (OpenAI)
  • Monthly searches: 1000 queries × $0.00002 = $0.02
  • Infrastructure: Qdrant (40MB RAM), Embeddings (optional)
  • Total: ~$0.25/month (API) or self-hosted (negligible)

Medium Deployment (100 users, 500 notes each):

  • Initial indexing: 50,000 notes × $0.00002 = $1.00
  • Monthly searches: 10,000 queries × $0.00002 = $0.20
  • Infrastructure: Qdrant (200MB RAM)
  • Total: ~$1.20/month or self-hosted

Self-Hosted (any size):

  • GPU instance: $0.50/hour ($360/month for 24/7)
  • Or CPU-only: negligible cost, slower embeddings

Future Enhancements

  1. Multimodal Search

    • Image embeddings (CLIP)
    • PDF/document layout understanding
    • Audio transcription + embedding
  2. Advanced Ranking

    • Cross-encoder reranking
    • Learning-to-rank models
    • User feedback signals
  3. Query Understanding

    • Query expansion
    • Spell correction
    • Entity extraction
  4. Performance

    • Query result caching
    • Approximate nearest neighbor improvements
    • Quantization for reduced memory
  5. Features

    • Saved searches
    • Search analytics
    • Recommended content

Alternatives Considered

Alternative 1: Elasticsearch/OpenSearch

Approach: Use traditional full-text search engine with vector plugin

Pros:

  • Mature ecosystem
  • Excellent keyword search
  • Rich query DSL

Cons:

  • Heavy infrastructure (JVM-based)
  • Complex setup and tuning
  • Vector search is plugin/add-on (not native)
  • Higher resource usage

Decision: Rejected; Qdrant is purpose-built for vectors

Alternative 2: ChromaDB

Approach: Embedded or client-server vector database

Pros:

  • Simple Python API
  • Easy to get started
  • Good for prototyping

Cons:

  • Sync-only Python client (no async)
  • Limited multi-tenancy features
  • Less mature than Qdrant
  • Scaling concerns

Decision: Rejected; async and multi-tenancy are critical

Alternative 3: Weaviate

Approach: Full-featured vector database with GraphQL

Pros:

  • Very feature-rich
  • Built-in vectorization
  • Good documentation

Cons:

  • More complex architecture
  • Higher resource usage
  • GraphQL adds complexity
  • Overkill for our use case

Decision: Rejected; Qdrant provides better balance

Alternative 4: pgvector (PostgreSQL Extension)

Approach: Add vector search to existing PostgreSQL

Pros:

  • Leverages existing PostgreSQL expertise
  • Transactional consistency
  • Mature database ecosystem

Cons:

  • This deployment uses MariaDB (would need PostgreSQL)
  • Performance not as optimized as purpose-built vector DB
  • Manual hybrid search implementation
  • HNSW index limitations

Decision: Rejected; dedicated vector DB is better fit

Alternative 5: Pinecone / Vertex AI Vector Search

Approach: Managed cloud vector database

Pros:

  • Fully managed
  • Excellent performance
  • No infrastructure management

Cons:

  • Cloud-only (no self-hosting)
  • Recurring costs
  • Vendor lock-in
  • Data leaves premises

Decision: Rejected; self-hosting is important for privacy

Related Decisions

  • ADR-001: Enhanced Note Search (establishes need for better search)
  • ADR-002: Vector Sync Authentication (defines how sync workers authenticate)
  • [Future] ADR-004: Content Extraction and Document Processing
  • [Future] ADR-005: Cross-App Semantic Search

References