Skip to content

Commit be89c84

Browse files
committed
feat: implement Phases 33-37 — Search Dominance strategy
Phase 33 — Search output & filtering: - search: --scores, --snippet-length, --no-snippet, --exclude, --no-ignore - grep: --jsonl, --exclude, --no-ignore, -L (files-without-match) - formatter: JSONL streaming with optional score prefix Phase 34 — Chunk-level incremental indexing: - index: --add <file>, --inspect <file> single-file operations - Model-consistency guard (manifest vs config embedding model) - Ctrl+C signal handler for partial-save safety Phase 35 — Tantivy full-text search engine (Rust): - TantivyIndex PyO3 class: add_chunks, search, remove_file, clear, num_docs - cfg-gated behind tantivy-backend Cargo feature - Python bridge with use_tantivy() feature detection Phase 36 — MCP Server v2: - Cursor-based pagination for semantic/keyword/hybrid search - codexa --serve shorthand flag - codexa mcp --claude-config for Claude Desktop auto-configuration - claude_config.py helper module Phase 37 — Grep parity & distribution: - --hybrid and --sem search shorthands - .codexaignore auto-creation on first index - codexa.spec PyInstaller single-binary config Docs & README: - PyPI version and download badges on README - Updated feature table with all new capabilities - docs/index.md updated to v0.5.0 with new feature descriptions All 2596 tests passing.
1 parent 70ada27 commit be89c84

16 files changed

Lines changed: 695 additions & 25 deletions

File tree

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66
<p align="center">
77
<a href="https://github.com/M9nx/CodexA/actions"><img src="https://github.com/M9nx/CodexA/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
8+
<a href="https://pypi.org/project/codexa/"><img src="https://img.shields.io/pypi/v/codexa?color=blue&label=PyPI" alt="PyPI"></a>
9+
<a href="https://pepy.tech/project/codexa"><img src="https://pepy.tech/badge/codexa" alt="Downloads"></a>
810
<img src="https://img.shields.io/badge/python-3.11%2B-blue" alt="Python 3.11+">
911
<img src="https://img.shields.io/badge/version-0.5.0-green" alt="Version">
1012
<img src="https://img.shields.io/badge/tests-2596-brightgreen" alt="Tests">
@@ -24,13 +26,13 @@ structured tool protocol that any AI agent can call over HTTP or CLI.
2426

2527
| Area | What you get |
2628
|------|-------------|
27-
| **Code Indexing** | Scan repos, extract functions/classes, generate vector embeddings (sentence-transformers + FAISS), ONNX runtime option, parallel indexing, `--watch` live re-indexing, `.codexaignore` support |
28-
| **Rust Search Engine** | Native `codexa-core` Rust crate via PyO3 — HNSW approximate nearest-neighbour search, BM25 keyword index, tree-sitter AST chunker (10 languages), memory-mapped vector persistence, parallel file scanner, optional ONNX embedding inference |
29-
| **Multi-Mode Search** | Semantic, keyword (BM25), regex, hybrid (RRF), and raw filesystem grep (ripgrep backend) with full `-A/-B/-C/-w/-v/-c` flags |
29+
| **Code Indexing** | Scan repos, extract functions/classes, generate vector embeddings (sentence-transformers + FAISS), ONNX runtime option, parallel indexing, `--watch` live re-indexing, `.codexaignore` support, `--add`/`--inspect` per-file control, model-consistency guard, Ctrl+C partial-save |
30+
| **Rust Search Engine** | Native `codexa-core` Rust crate via PyO3 — HNSW approximate nearest-neighbour search, BM25 keyword index, tree-sitter AST chunker (10 languages), memory-mapped vector persistence, parallel file scanner, optional ONNX embedding inference, optional Tantivy full-text search |
31+
| **Multi-Mode Search** | Semantic, keyword (BM25), regex, hybrid (RRF), and raw filesystem grep (ripgrep backend) with full `-A/-B/-C/-w/-v/-c/-l/-L/--exclude/--no-ignore` flags, `--hybrid`/`--sem` shorthands, `--scores`, `--snippet-length`, `--no-snippet`, JSONL streaming |
3032
| **RAG Pipeline** | 4-stage Retrieval-Augmented Generation — Retrieve → Deduplicate → Re-rank → Assemble with token budget, cross-encoder re-ranking, source citations |
3133
| **Code Context** | Rich context windows — imports, dependencies, AST-based call graphs, surrounding code |
3234
| **Repository Analysis** | Language breakdown (`codexa languages`), module summaries, component detection |
33-
| **AI Agent Protocol** | 13 built-in tools exposed via HTTP bridge, MCP server (13 tools), MCP-over-SSE (`--mcp`), or CLI for any AI agent to invoke |
35+
| **AI Agent Protocol** | 13 built-in tools exposed via HTTP bridge, MCP server (13 tools with pagination/cursors), MCP-over-SSE (`--mcp`), `codexa --serve` shorthand, Claude Desktop auto-config (`--claude-config`), or CLI for any AI agent to invoke |
3436
| **Quality & Metrics** | Complexity analysis, maintainability scoring, quality gates for CI |
3537
| **Multi-Repo Workspaces** | Link multiple repos under one workspace for cross-repo search & refactoring |
3638
| **Interactive TUI** | Terminal REPL with mode switching for interactive exploration |

codexa-core/Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ tree-sitter-cpp = "0.23.4"
3434
tree-sitter-ruby = "0.23.1"
3535
ort = { version = "2.0.0-rc.12", features = ["download-binaries"], optional = true }
3636
ndarray = { version = "0.17.2", optional = true }
37+
tantivy = { version = "0.22", optional = true }
3738

3839
[profile.release]
3940
opt-level = 3
@@ -43,3 +44,4 @@ codegen-units = 1
4344
[features]
4445
default = []
4546
onnx = ["dep:ort", "dep:ndarray"]
47+
tantivy-backend = ["dep:tantivy"]

codexa-core/src/lib.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ mod embed;
1414
mod hnsw;
1515
mod hybrid;
1616
mod scan;
17+
#[cfg(feature = "tantivy-backend")]
18+
mod tantivy_search;
1719

1820
/// The top-level Python module exposed via PyO3.
1921
#[pymodule]
@@ -45,5 +47,9 @@ fn codexa_core(m: &Bound<'_, PyModule>) -> PyResult<()> {
4547
#[cfg(feature = "onnx")]
4648
m.add_class::<embed::OnnxEmbedder>()?;
4749

50+
// Tantivy full-text search (only when compiled with --features tantivy-backend)
51+
#[cfg(feature = "tantivy-backend")]
52+
m.add_class::<tantivy_search::TantivyIndex>()?;
53+
4854
Ok(())
4955
}

codexa-core/src/tantivy_search.rs

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
//! Tantivy full-text search backend — optional feature.
2+
//!
3+
//! Provides a PyO3-exposed `TantivyIndex` class wrapping a Tantivy
4+
//! index for BM25-quality full-text search with sub-100ms query latency.
5+
//! Documents are code chunks (file_path, content, language, line range).
6+
7+
#[cfg(feature = "tantivy-backend")]
8+
use pyo3::prelude::*;
9+
10+
#[cfg(feature = "tantivy-backend")]
11+
use tantivy::{
12+
collector::TopDocs,
13+
doc,
14+
query::QueryParser,
15+
schema::{Field, Schema, STORED, TEXT},
16+
Index, IndexReader, IndexWriter, ReloadPolicy,
17+
};
18+
19+
#[cfg(feature = "tantivy-backend")]
20+
use std::path::PathBuf;
21+
22+
/// A Tantivy-backed full-text search index for code chunks.
23+
///
24+
/// Wraps Tantivy's inverted index for BM25-quality full-text search.
25+
/// Created via `TantivyIndex(directory)` — the index is disk-persistent.
26+
#[cfg(feature = "tantivy-backend")]
27+
#[pyclass]
28+
pub struct TantivyIndex {
29+
index: Index,
30+
reader: IndexReader,
31+
f_file_path: Field,
32+
f_content: Field,
33+
f_language: Field,
34+
f_start_line: Field,
35+
f_end_line: Field,
36+
f_chunk_index: Field,
37+
schema: Schema,
38+
index_dir: PathBuf,
39+
}
40+
41+
#[cfg(feature = "tantivy-backend")]
42+
#[pymethods]
43+
impl TantivyIndex {
44+
/// Create or open a Tantivy index at the given directory.
45+
#[new]
46+
fn new(directory: String) -> PyResult<Self> {
47+
let dir = PathBuf::from(&directory);
48+
std::fs::create_dir_all(&dir).map_err(|e| {
49+
pyo3::exceptions::PyIOError::new_err(format!("Cannot create index dir: {e}"))
50+
})?;
51+
52+
let mut schema_builder = Schema::builder();
53+
let f_file_path = schema_builder.add_text_field("file_path", STORED);
54+
let f_content = schema_builder.add_text_field("content", TEXT | STORED);
55+
let f_language = schema_builder.add_text_field("language", STORED);
56+
let f_start_line = schema_builder.add_text_field("start_line", STORED);
57+
let f_end_line = schema_builder.add_text_field("end_line", STORED);
58+
let f_chunk_index = schema_builder.add_text_field("chunk_index", STORED);
59+
let schema = schema_builder.build();
60+
61+
let mmap_dir =
62+
tantivy::directory::MmapDirectory::open(&dir).map_err(|e| {
63+
pyo3::exceptions::PyIOError::new_err(format!("Tantivy dir error: {e}"))
64+
})?;
65+
66+
let index = Index::open_or_create(mmap_dir, schema.clone()).map_err(|e| {
67+
pyo3::exceptions::PyRuntimeError::new_err(format!("Tantivy index error: {e}"))
68+
})?;
69+
70+
let reader = index
71+
.reader_builder()
72+
.reload_policy(ReloadPolicy::OnCommitWithDelay)
73+
.try_into()
74+
.map_err(|e| {
75+
pyo3::exceptions::PyRuntimeError::new_err(format!("Reader error: {e}"))
76+
})?;
77+
78+
Ok(Self {
79+
index,
80+
reader,
81+
f_file_path,
82+
f_content,
83+
f_language,
84+
f_start_line,
85+
f_end_line,
86+
f_chunk_index,
87+
schema,
88+
index_dir: dir,
89+
})
90+
}
91+
92+
/// Add a batch of code chunks to the index.
93+
///
94+
/// Each chunk is a tuple: (file_path, content, language, start_line, end_line, chunk_index)
95+
fn add_chunks(&self, chunks: Vec<(String, String, String, usize, usize, usize)>) -> PyResult<u64> {
96+
let mut writer: IndexWriter = self.index.writer(50_000_000).map_err(|e| {
97+
pyo3::exceptions::PyRuntimeError::new_err(format!("Writer error: {e}"))
98+
})?;
99+
100+
let mut count = 0u64;
101+
for (fp, content, lang, sl, el, ci) in chunks {
102+
writer.add_document(doc!(
103+
self.f_file_path => fp,
104+
self.f_content => content,
105+
self.f_language => lang,
106+
self.f_start_line => sl.to_string(),
107+
self.f_end_line => el.to_string(),
108+
self.f_chunk_index => ci.to_string(),
109+
)).map_err(|e| {
110+
pyo3::exceptions::PyRuntimeError::new_err(format!("Add doc error: {e}"))
111+
})?;
112+
count += 1;
113+
}
114+
115+
writer.commit().map_err(|e| {
116+
pyo3::exceptions::PyRuntimeError::new_err(format!("Commit error: {e}"))
117+
})?;
118+
119+
Ok(count)
120+
}
121+
122+
/// Search the index for a query string, returning up to `top_k` results.
123+
///
124+
/// Returns a list of (file_path, content, language, start_line, end_line, chunk_index, score).
125+
fn search(&self, query: &str, top_k: usize) -> PyResult<Vec<(String, String, String, usize, usize, usize, f32)>> {
126+
let searcher = self.reader.searcher();
127+
let query_parser = QueryParser::for_index(&self.index, vec![self.f_content]);
128+
let parsed = query_parser.parse_query(query).map_err(|e| {
129+
pyo3::exceptions::PyValueError::new_err(format!("Query parse error: {e}"))
130+
})?;
131+
132+
let top_docs = searcher.search(&parsed, &TopDocs::with_limit(top_k)).map_err(|e| {
133+
pyo3::exceptions::PyRuntimeError::new_err(format!("Search error: {e}"))
134+
})?;
135+
136+
let mut results = Vec::with_capacity(top_docs.len());
137+
for (score, doc_address) in top_docs {
138+
let doc = searcher.doc::<tantivy::TantivyDocument>(doc_address).map_err(|e| {
139+
pyo3::exceptions::PyRuntimeError::new_err(format!("Doc fetch error: {e}"))
140+
})?;
141+
142+
let get_text = |field: Field| -> String {
143+
doc.get_first(field)
144+
.and_then(|v| v.as_str())
145+
.unwrap_or("")
146+
.to_string()
147+
};
148+
149+
let file_path = get_text(self.f_file_path);
150+
let content = get_text(self.f_content);
151+
let language = get_text(self.f_language);
152+
let start_line: usize = get_text(self.f_start_line).parse().unwrap_or(0);
153+
let end_line: usize = get_text(self.f_end_line).parse().unwrap_or(0);
154+
let chunk_index: usize = get_text(self.f_chunk_index).parse().unwrap_or(0);
155+
156+
results.push((file_path, content, language, start_line, end_line, chunk_index, score));
157+
}
158+
159+
Ok(results)
160+
}
161+
162+
/// Remove all documents for a given file path.
163+
fn remove_file(&self, file_path: &str) -> PyResult<u64> {
164+
let mut writer: IndexWriter = self.index.writer(50_000_000).map_err(|e| {
165+
pyo3::exceptions::PyRuntimeError::new_err(format!("Writer error: {e}"))
166+
})?;
167+
168+
let term = tantivy::Term::from_field_text(self.f_file_path, file_path);
169+
writer.delete_term(term);
170+
writer.commit().map_err(|e| {
171+
pyo3::exceptions::PyRuntimeError::new_err(format!("Commit error: {e}"))
172+
})?;
173+
174+
Ok(0) // Tantivy doesn't easily report deleted count
175+
}
176+
177+
/// Clear the entire index.
178+
fn clear(&self) -> PyResult<()> {
179+
let mut writer: IndexWriter = self.index.writer(50_000_000).map_err(|e| {
180+
pyo3::exceptions::PyRuntimeError::new_err(format!("Writer error: {e}"))
181+
})?;
182+
writer.delete_all_documents().map_err(|e| {
183+
pyo3::exceptions::PyRuntimeError::new_err(format!("Clear error: {e}"))
184+
})?;
185+
writer.commit().map_err(|e| {
186+
pyo3::exceptions::PyRuntimeError::new_err(format!("Commit error: {e}"))
187+
})?;
188+
Ok(())
189+
}
190+
191+
/// Return the number of documents in the index.
192+
fn num_docs(&self) -> u64 {
193+
self.reader.searcher().num_docs()
194+
}
195+
}

codexa.spec

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# -*- mode: python ; coding: utf-8 -*-
2+
"""PyInstaller spec for building a single-binary CodexA distribution."""
3+
4+
import sys
5+
from pathlib import Path
6+
7+
block_cipher = None
8+
9+
a = Analysis(
10+
['semantic_code_intelligence/cli/main.py'],
11+
pathex=['.'],
12+
binaries=[],
13+
datas=[
14+
('semantic_code_intelligence', 'semantic_code_intelligence'),
15+
],
16+
hiddenimports=[
17+
'click',
18+
'rich',
19+
'pydantic',
20+
'semantic_code_intelligence.cli.router',
21+
'semantic_code_intelligence.cli.main',
22+
],
23+
hookspath=[],
24+
hooksconfig={},
25+
runtime_hooks=[],
26+
excludes=[],
27+
win_no_prefer_redirects=False,
28+
win_private_assemblies=False,
29+
cipher=block_cipher,
30+
noarchive=False,
31+
)
32+
33+
pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher)
34+
35+
exe = EXE(
36+
pyz,
37+
a.scripts,
38+
a.binaries,
39+
a.zipfiles,
40+
a.datas,
41+
[],
42+
name='codexa',
43+
debug=False,
44+
bootloader_ignore_signals=False,
45+
strip=False,
46+
upx=True,
47+
upx_exclude=[],
48+
runtime_tmpdir=None,
49+
console=True,
50+
disable_windowed_traceback=False,
51+
argv_emulation=False,
52+
target_arch=None,
53+
codesign_identity=None,
54+
entitlements_file=None,
55+
)

docs/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ features:
1717
- icon:
1818
svg: '<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="11" cy="11" r="8"/><path d="m21 21-4.3-4.3"/></svg>'
1919
title: Semantic Search
20-
details: Natural-language code search powered by sentence-transformers and FAISS. Find code by meaning, not just keywords — queries are embedded into vectors and matched against your entire codebase.
20+
details: Natural-language code search powered by sentence-transformers, FAISS, and optional Tantivy full-text engine. Multi-mode — semantic, BM25, regex, hybrid (RRF), grep. JSONL streaming, --scores, --snippet-length, --no-snippet, --hybrid/--sem shorthands, pagination cursors.
2121
- icon:
2222
svg: '<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M12 8V4H8"/><rect width="16" height="12" x="4" y="8" rx="2"/><path d="M2 14h2"/><path d="M20 14h2"/><path d="M15 13v2"/><path d="M9 13v2"/></svg>'
2323
title: AI Agent Protocol
24-
details: 13 structured tools invocable via CLI, HTTP bridge, MCP, or Python API. Designed for autonomous AI coding agents with structured JSON input/output.
24+
details: 13 structured tools invocable via CLI, HTTP bridge, or MCP server with cursor-based pagination. codexa --serve shorthand, Claude Desktop auto-config (--claude-config), SSE streaming, and full Cursor/Windsurf compatibility.
2525
- icon:
2626
svg: '<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M17 10.5V7c0-1.38-1.12-2.5-2.5-2.5S12 5.62 12 7v3.5"/><path d="M7 10.5V7c0-1.38 1.12-2.5 2.5-2.5"/><path d="m2 19 5-5"/><path d="m7 19 5-5"/><path d="m12 19 5-5"/><path d="m17 19 5-5"/></svg>'
2727
title: Multi-Language Parsing
@@ -41,11 +41,11 @@ features:
4141
- icon:
4242
svg: '<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="4 17 10 11 4 5"/><line x1="12" x2="20" y1="19" y2="19"/></svg>'
4343
title: 39 CLI Commands
44-
details: Comprehensive Click-based CLI with --json, --pipe, and --verbose flags. Every command returns structured output suitable for scripting and automation. Includes grep, benchmark, languages, and raw filesystem search.
44+
details: Comprehensive Click-based CLI with --json, --pipe, --jsonl, and --verbose flags. Every command returns structured output suitable for scripting and automation. Includes grep with --exclude/--no-ignore/-L, benchmark, languages, and raw filesystem search. Single-binary distribution via PyInstaller.
4545
- icon:
4646
svg: '<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect width="18" height="18" x="3" y="3" rx="2"/><path d="M3 9h18"/><path d="M9 21V9"/></svg>'
4747
title: Multiple Interfaces
48-
details: CLI, Web UI, REST API, Bridge Server, MCP Server, LSP Server, and interactive TUI — all built on the same tool protocol for consistent behavior everywhere.
48+
details: CLI, Web UI, REST API, Bridge Server, MCP Server (with cursor-based pagination), LSP Server, and interactive TUI — all built on the same tool protocol for consistent behavior everywhere. Incremental indexing with --add/--inspect and model-consistency guards.
4949
---
5050

5151
## Quick Start
@@ -64,6 +64,8 @@ codexa doctor
6464

6565
# Search your code
6666
codexa search "authentication middleware"
67+
codexa search "auth flow" --hybrid --scores
68+
codexa grep "TODO|FIXME" --jsonl -L
6769

6870
# AI-powered analysis
6971
codexa ask "How does the auth flow work?"
@@ -75,7 +77,7 @@ codexa hotspots
7577

7678
| Metric | Value |
7779
|--------|-------|
78-
| **Version** | 0.4.3 |
80+
| **Version** | 0.5.0 |
7981
| **CLI Commands** | 39 |
8082
| **AI Agent Tools** | 13 (+ plugin-registered) |
8183
| **Plugin Hooks** | 22 |

0 commit comments

Comments
 (0)