Skip to content

Commit 8cc7b6a

Browse files
committed
feat: hybrid search, boost ranking for keyword matching
1 parent 6f7fb05 commit 8cc7b6a

7 files changed

Lines changed: 1288 additions & 81 deletions

File tree

AGENTS.md

Lines changed: 61 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -9,73 +9,63 @@ Code-RAG is a CLI tool that makes codebases searchable using semantic search. It
99
## Architecture Overview
1010

1111
```
12-
┌─────────────────┐
13-
│ CLI / MCP │ Entry points (command line + AI assistant integration)
14-
└────────┬────────┘
15-
16-
┌────────▼────────┐
17-
│ File Processor │ Discovers files → Chunks code → Yields metadata
18-
└────────┬────────┘
19-
20-
┌────────▼────────┐
21-
│ Embedding │ Converts text chunks → vectors (pluggable models)
22-
└────────┬────────┘
23-
24-
┌────────▼────────┐
25-
│ Database │ Stores vectors + metadata (ChromaDB or Qdrant)
26-
└─────────────────┘
12+
┌─────────────┐
13+
│ CLI / MCP │ Entry points
14+
└──────┬──────┘
15+
16+
┌──────▼──────┐
17+
│ API │ Orchestration layer (CodeRAGAPI)
18+
└──────┬──────┘
19+
20+
┌────────────┼────────────┬────────────┐
21+
│ │ │ │
22+
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
23+
│Process│ │Search │ │Manage │ │ Embed │
24+
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
25+
│ │ │ │
26+
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
27+
│Chunker│ │Rerank │ │Index │ │Storage│
28+
└───────┘ └───────┘ └───────┘ └───────┘
2729
```
2830

29-
**Key Design**: Plugin architecture. Database and embedding implementations are swappable via configuration.
31+
**Key Design**: Orchestrated Plugin architecture. The `CodeRAGAPI` centralizes logic, while specialized components handle chunking, indexing, search analysis, and storage.
3032

3133
## Components
3234

33-
### 1. File Processor
34-
- **What**: Finds source files, reads them, chunks them
35-
- **How**: Respects `.gitignore`, uses syntax-aware chunking when possible (falls back to line-based)
36-
- **Output**: Text chunks with metadata (file path, chunk position)
37-
38-
### 2. Database Layer
39-
- **Interface**: Abstract base class for vector storage
40-
- **Implementations**: ChromaDB (default, embedded) or Qdrant (networked)
41-
- **Operations**: Store embeddings, similarity search, state tracking
42-
43-
### 3. Embedding Layer
44-
- **Interface**: Abstract base class for text → vector conversion
45-
- **Models**:
46-
- `all-MiniLM-L6-v2` (default, general purpose)
47-
- `CodeRankEmbed` (code-optimized)
48-
- OpenAI's `text-embedding-3-small`
49-
- **Pattern**: Stateless, reusable across all operations
50-
51-
### 4. Configuration
52-
Environment variables control defaults:
53-
- `CODE_RAG_DATABASE_TYPE`: "chroma" or "qdrant"
54-
- `CODE_RAG_EMBEDDING_MODEL`: Model name
55-
- `CODE_RAG_CHUNK_SIZE`: Characters per chunk
56-
- See `src/code_rag/config/config.py` for full list
57-
58-
### 5. Entry Points
59-
- **CLI** (`code-rag-cli`): Interactive query session
60-
- **MCP Server** (`code-rag` or `code-rag-mcp`): Exposes search to AI assistants (Claude, etc.)
61-
- **Embedding Server** (`code-rag-server`): Shared model server for multiple MCP instances
62-
63-
### 6. Shared Embedding Server
64-
When running multiple MCP instances (e.g., multiple VS Code windows), each would normally load its own transformer model (~300MB+ RAM each). The **shared embedding server** solves this:
65-
66-
- **Auto-spawns** on first client request if not running
67-
- **Auto-terminates** when no clients remain (after idle timeout)
68-
- Uses **heartbeat** mechanism for client lifecycle tracking
69-
- **Lock file** prevents duplicate server instances
70-
71-
Configuration (via environment):
72-
- `CODE_RAG_SHARED_SERVER=true` (enabled by default)
73-
- `CODE_RAG_SHARED_SERVER_PORT=8199`
74-
75-
Files:
76-
- `src/code_rag/embedding_server.py` - FastAPI server
77-
- `src/code_rag/embeddings/http_embedding.py` - HTTP client for embedding
78-
- `src/code_rag/reranker/http_reranker.py` - HTTP client for reranking
35+
### 1. API Layer (`src/code_rag/api.py`)
36+
- **What**: The central hub for all Code-RAG operations.
37+
- **How**: Integrates embedding, database, reranking, and indexing logic. Used by both CLI and MCP.
38+
- **Features**: Session tracking, auto-generated collection names, and unified indexing flow.
39+
40+
### 2. File Processor & Chunker
41+
- **What**: Discovers source files and breaks them into logical chunks.
42+
- **How**: Uses `SyntaxChunker` (tree-sitter based) for code-aware splitting, falling back to line-based.
43+
- **Output**: Text chunks with rich metadata (file path, line numbers, symbol names).
44+
45+
### 3. Metadata Index (`src/code_rag/index/metadata_index.py`)
46+
- **What**: Tracks state of indexed files for incremental updates.
47+
- **How**: Stores `mtime`, `size`, and `sha256` hashes.
48+
- **Benefit**: Only re-indexes modified files, significantly speeding up subsequent runs.
49+
50+
### 4. Hybrid Search & Query Analyzer
51+
- **What**: Improves search relevance by combining vector search with exact identifier matching.
52+
- **How**: `QueryAnalyzer` detects code identifiers (CamelCase, snake_case) in queries and boosts results containing those identifiers.
53+
54+
### 5. Semantic Reranker (`src/code_rag/reranker/`)
55+
- **What**: Refines search results using Cross-Encoder models.
56+
- **How**: Re-scores top-K candidates from vector search for higher precision.
57+
- **Models**: Defaults to `mixedbread-ai/mxbai-rerank-xsmall-v1` or `cross-encoder/ms-marco-MiniLM-L-6-v2`.
58+
59+
### 6. Embedding & Database Layer
60+
- **Embeddings**: Swappable backends (SentenceTransformers, OpenAI, or Shared HTTP).
61+
- **Databases**: ChromaDB (default) or Qdrant.
62+
- **Features**: Automatic dimension mismatch handling and model idle timeouts.
63+
64+
### 7. Shared Embedding & Reranking Server
65+
- **What**: FastAPI-based server that hosts both embedding and reranker models.
66+
- **Why**: Prevents multiple MCP instances from each loading ~500MB+ of models into RAM.
67+
- **Management**: Auto-spawns on first request, auto-terminates after idle timeout, uses heartbeats.
68+
- **Files**: `src/code_rag/embedding_server.py`, `http_embedding.py`, `http_reranker.py`.
7969

8070
## Quick Start
8171

@@ -99,16 +89,19 @@ code-rag-cli --reindex
9989
## Common Tasks
10090

10191
**Add a new embedding model?**
102-
→ Extend `EmbeddingInterface`, add to initialization in CLI
92+
→ Extend `EmbeddingInterface`, add to `CodeRAGAPI._create_embedding_model`
10393

104-
**Add a new database backend?**
105-
→ Extend `DatabaseInterface`, add to initialization in CLI
94+
**Add a new reranker?**
95+
→ Extend `RerankerInterface`, update `CodeRAGAPI.__init__`
10696

107-
**Change chunk size or ignore patterns?**
108-
→ Modify configuration or file processor settings
97+
**Adjust reindexing behavior?**
98+
→ Modify `CODE_RAG_REINDEX_DEBOUNCE_MINUTES` or `CODE_RAG_VERIFY_CHANGES_WITH_HASH` in config.
99+
100+
**Change identifier boosting?**
101+
→ Update `QueryAnalyzer.get_boost_score` in `src/code_rag/search/query_analyzer.py`
109102

110103
**Add support for new languages?**
111-
→ Extend `SyntaxChunker.LANGUAGE_PACKAGES` with tree-sitter binding
104+
→ Extend `SyntaxChunker.LANGUAGE_PACKAGES` with tree-sitter bindings.
112105

113106
**Add new MCP tools?**
114107
→ Update `list_tools()` and `call_tool()` in `src/code_rag/mcp_server.py`

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "code-rag-mcp"
7-
version = "0.1.2"
7+
version = "0.2.0"
88
description = "MCP server for efficient code search"
99
readme = "README.md"
1010
requires-python = ">=3.10, <3.14"

src/code_rag/api.py

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
from .processor.file_processor import FileProcessor
2222
from .reranker.cross_encoder_reranker import CrossEncoderReranker
2323
from .reranker.reranker_interface import RerankerInterface
24+
from .search.query_analyzer import QueryAnalyzer
2425

2526

2627
def generate_collection_name(codebase_path: str) -> str:
@@ -607,20 +608,23 @@ def search(
607608
query, documents, top_k=n_results
608609
)
609610

610-
# Reorder results based on reranking
611-
reranked_docs = []
612-
reranked_metadata = []
613-
reranked_scores = []
611+
# Only update results if reranker returned valid results
612+
# If reranker returns empty list (e.g., server down), fall back to original
613+
if reranked_indices:
614+
# Reorder results based on reranking
615+
reranked_docs = []
616+
reranked_metadata = []
617+
reranked_scores = []
614618

615-
for orig_idx, rerank_score in reranked_indices:
616-
reranked_docs.append(results["documents"][0][orig_idx])
617-
reranked_metadata.append(results["metadatas"][0][orig_idx])
618-
reranked_scores.append(rerank_score)
619+
for orig_idx, rerank_score in reranked_indices:
620+
reranked_docs.append(results["documents"][0][orig_idx])
621+
reranked_metadata.append(results["metadatas"][0][orig_idx])
622+
reranked_scores.append(rerank_score)
619623

620-
# Update results with reranked data
621-
results["documents"][0] = reranked_docs
622-
results["metadatas"][0] = reranked_metadata
623-
results["distances"][0] = reranked_scores
624+
# Update results with reranked data
625+
results["documents"][0] = reranked_docs
626+
results["metadatas"][0] = reranked_metadata
627+
results["distances"][0] = reranked_scores
624628

625629
except Exception:
626630
# Fall back to original results if reranking fails
@@ -653,6 +657,33 @@ def search(
653657
}
654658
formatted_results.append(result)
655659

660+
# Apply identifier-based boosting (Phase 1: Hybrid Search)
661+
# Analyze query for code identifiers and boost exact matches
662+
query_analyzer = QueryAnalyzer(query)
663+
if query_analyzer.has_identifiers() and formatted_results:
664+
for result in formatted_results:
665+
# Calculate boost multiplier based on identifier matches
666+
boost = query_analyzer.get_boost_score(result["content"])
667+
668+
# Store original similarity for transparency
669+
result["original_similarity"] = result["similarity"]
670+
671+
# Apply boost to similarity score
672+
result["similarity"] = result["similarity"] * boost
673+
674+
# Track if this result was boosted
675+
result["boosted"] = boost > 1.0
676+
677+
# Re-sort by boosted similarity (descending)
678+
formatted_results.sort(key=lambda x: x["similarity"], reverse=True)
679+
680+
# Limit to requested number of results after boosting
681+
formatted_results = formatted_results[:n_results]
682+
else:
683+
# No identifiers detected - mark all results as not boosted for consistency
684+
for result in formatted_results:
685+
result["boosted"] = False
686+
656687
# Expand context if requested
657688
if expand_context and formatted_results:
658689
formatted_results = self._expand_context(formatted_results)

src/code_rag/search/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Search utilities for code-rag."""
2+
3+
from .query_analyzer import QueryAnalyzer
4+
5+
__all__ = ["QueryAnalyzer"]

0 commit comments

Comments
 (0)