diff --git a/rust/PORTING.md b/rust/PORTING.md deleted file mode 100644 index ec89176..0000000 --- a/rust/PORTING.md +++ /dev/null @@ -1,120 +0,0 @@ -# Rust port of cocoindex-code - -A from-scratch Rust reimplementation of `cocoindex-code` (the `ccc` CLI), built -on the **CocoIndex Rust SDK** (`cocoindex-io/cocoindex` → `rust/sdk/cocoindex`). -Feature parity with the Python implementation in `../src/cocoindex_code`, which -is kept in the repo as the reference spec. - -## Build & run - -```bash -cd rust -cargo build # builds everything (fastembed/ONNX is always on — it's the only embedder) -cargo test # sqlite-vec (vec0) integration test - -./target/debug/ccc init -./target/debug/ccc index -./target/debug/ccc search "vector similarity" --lang rust --limit 10 -``` - -The SDK is a **path dependency** assuming `cocoindex` is checked out as a -sibling (`../../cocoindex`). For distribution this should become a git -dependency on `cocoindex-io/cocoindex` (the `v1` branch). - -## Architecture - -Like the Python tool, the CLI is a thin **client** that talks to a background -**daemon** over a Unix socket; the daemon keeps the embedding model warm and -caches per-project state. `index`/`search`/`status`/`doctor` are daemon-backed -and auto-spawn the daemon on first use. - -- **IPC**: length-prefixed msgpack frames over `daemon.sock` (Rust-to-Rust; not - wire-compatible with the Python daemon's `multiprocessing.connection`). -- **Embeddings**: **local sentence-transformers (fastembed) only.** Python also - offers a `litellm` provider for cloud/multi-provider embeddings; there is no - viable in-process Rust equivalent (the official `LiteLLM-Labs/litellm-rust` is - a gateway binary, not a library; the community `litellm-rust` crate is alpha - and only covers OpenAI-compatible embeddings), so the litellm option is - intentionally not exposed. Existing `provider: litellm` configs parse fine and - produce a clear error pointing at the local provider. - - Default model: `BAAI/bge-small-en-v1.5` (Python's - `Snowflake/snowflake-arctic-embed-xs` isn't in fastembed's registry). - - Models are limited to fastembed's supported set (resolved by name, then by - suffix — so `sentence-transformers/all-MiniLM-L6-v2` works). - -## Python → Rust module map - -| Python (`src/cocoindex_code`) | Rust (`rust/src`) | Status | -|---|---|---| -| `schema.py` / `CodeChunk` | `schema.rs` | ✅ | -| `settings.py` | `settings.rs` | ✅ (container path-mapping env vars deferred) | -| `embedder_params.py` + `embedder_defaults.py` | `embedder_params.rs` | ✅ | -| `litellm_embedder.py` / `shared.create_embedder` | `embedder.rs` | ✅ (local `prompt_name` TODO) | -| `indexer.py` | `indexer.rs` + `walk.rs` | ✅ (nested `.gitignore`, custom chunkers TODO) | -| `query.py` | `query.rs` | ✅ | -| `project.py` | `project.rs` + `daemon.rs` (registry) | ✅ | -| `protocol.py` | `protocol.rs` | ✅ | -| `_daemon_paths.py` | `daemon_paths.rs` | ✅ | -| `daemon.py` | `daemon.rs` | ✅ | -| `client.py` | `client.rs` | ✅ | -| `server.py` (MCP) | `mcp.rs` | ✅ (hand-rolled stdio JSON-RPC) | -| `cli.py` | `main.rs` | ✅ (interactive `init` prompts → flags) | - -## CLI commands (parity) - -`init`, `index`, `search` (`--lang`/`--path`/`--offset`/`--limit`/`--refresh`), -`status`, `reset` (`--all`/`-f`), `doctor` (`-v`), `mcp`, `daemon status|restart|stop`, -and the hidden `run-daemon`. - -## Tested (all green) - -- `cargo test`: sqlite-vec `vec0` extension loads + KNN returns correct results. -- **End-to-end (local embeddings)**: `init` → `index` (walk → tree-sitter chunk - → embed → vec0 upsert) → `search` with `--lang`/`--path` filters → incremental - re-index correctly skips unchanged files. -- **Daemon-backed lifecycle**: `index` auto-spawns the daemon (loads model once), - `daemon status`/`restart`/`stop`, graceful shutdown, PID/socket cleanup. -- **MCP**: `initialize` / `tools/list` / `tools/call search` over stdio JSON-RPC. -- **doctor** (global settings, daemon, model checks, project settings, file walk, - index status), **reset --all**, and post-reset "not initialized" handling. - -## Backward compatibility - -- **Settings files** (`global_settings.yml`, project `settings.yml`) written by - the Python tool parse unchanged — same keys, `provider` default (`litellm`), - `indexing_params`/`query_params` (absent vs empty), `envs`, and the legacy - `sbert/` model-name prefix (stripped before loading). -- **`provider: litellm`** configs do not crash — they load and return a clear - "only local embeddings are supported; set `provider: sentence-transformers`" - error (surfaced through the daemon). -- **Index DB**: the `target_sqlite.db` vec0 schema is identical, so `search` - works against a Python-built index. The CocoIndex state db (`cocoindex.db`) - differs across engine builds, so the first `index` re-runs (safe/incremental). -- **`.cocoindex_code/` layout**, paths, and the `.gitignore` entry match Python. - -## Parity audit (module-by-module) — fixed - -A deep Python-vs-Rust audit drove these fixes (all tested): search/status now -**auto-start load-time indexing and wait** (`ensure_indexing_started`); include/ -exclude use the SDK's `PatternFilePathMatcher` for **exact** pattern parity, with -a gitignore-aware wrapper; `init` restores the **"already initialized"** message -and the **parent-marker warning** (`-f` to override); `reset --all` removes the -`.gitignore` entry and prints the settings hint; `doctor` regained the -**daemon-env section**, include/exclude pattern values, the `params:` line, the -traceback hint, and the log line; the client gained **supervised-mode** -(`COCOINDEX_CODE_DAEMON_SUPERVISED`), handshake-warning dedup, and PID-guarded -cleanup; settings gained the empty-file check and absolutized project-root walk; -the MCP tool descriptions match `server.py`. - -## Known deltas vs Python (intentional / follow-up) - -1. **Embeddings** — local fastembed only; the `litellm` provider is not exposed - (no viable in-process Rust litellm). Default model differs (see above). -2. **Interactive `init`** — flag-driven (`--model`) instead of questionary prompts. -3. **Custom chunkers** — Python loads `module:callable` chunkers; Rust can't load - Python callables (config still parses; built-in splitter used). -4. **Legacy `cocoindex-code` entrypoint** + env-var migration - (`COCOINDEX_CODE_EMBEDDING_MODEL`, …) — not ported (the `ccc` CLI is the - entry point). -5. local-embedding `prompt_name`, container path-mapping env vars, and live - index-progress streaming (`IndexProgressUpdate`) — follow-ups. diff --git a/rust/README.md b/rust/README.md new file mode 100644 index 0000000..4683d3a --- /dev/null +++ b/rust/README.md @@ -0,0 +1,217 @@ +# cocoindex-code (Rust) — AST-based semantic code search + +A lightweight, effective **(AST-based)** semantic code search tool for your +codebase — the native-Rust build of [`ccc`](https://github.com/cocoindex-io/cocoindex-code). +Built on [CocoIndex](https://github.com/cocoindex-io/cocoindex), the Rust data +transformation engine. Use it from the CLI, or wire it into Claude Code, Codex, +Cursor — any coding agent — via [Skill](#coding-agent-integration) or +[MCP](#mcp-server). + +- Instant token savings — let the agent find code by meaning, not grep. +- **Local embeddings, zero setup** — runs fully offline, no API key required. +- **Incremental** — only re-indexes changed files. + +## Features + +- **Semantic code search** — find relevant code with natural-language queries + when grep falls short. +- **Ultra performant** — a single static binary on top of the Rust + [CocoIndex](https://github.com/cocoindex-io/cocoindex) engine; only changed + files are re-indexed. +- **Multi-language** — Python, JavaScript/TypeScript, Rust, Go, Java, C/C++, C#, + SQL, Shell, and more (tree-sitter). +- **Embedded** — a sqlite-vec index file; no database to run. +- **Local embeddings** — sentence-transformers via [fastembed](https://github.com/Anush008/fastembed-rs) + (ONNX), no API key, no Python. + +## Install + +The Rust build is compiled from source. It depends on the CocoIndex SDK as a +sibling checkout, so clone both repos side by side: + +```bash +git clone https://github.com/cocoindex-io/cocoindex +git clone -b rust https://github.com/cocoindex-io/cocoindex-code + +cd cocoindex-code/rust +cargo build --release + +# put the binary on your PATH (or use `cargo install --path .`) +install -m 0755 target/release/ccc ~/.local/bin/ccc +ccc --help +``` + +Embeddings are **local-only** (fastembed/ONNX) — no cloud provider or API key is +required or supported in this build. The default model is +[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5); any +model in fastembed's registry can be selected (see [Configuration](#configuration)). + +## Quick start + +```bash +ccc init # initialize project (creates settings) +ccc index # build the index +ccc search "authentication logic" # search! +``` + +The background daemon starts automatically on first use and keeps the embedding +model warm. + +> **Tip:** `ccc index` auto-initializes if you haven't run `ccc init` yet, so you +> can skip straight to indexing. + +## Coding Agent Integration + +### Skill + +Install the `ccc` skill so your coding agent automatically uses semantic search +when it helps: + +```bash +npx skills add cocoindex-io/cocoindex-code +``` + +The skill teaches the agent to initialize, index, and search on its own, and to +keep the index fresh as you work. Ask it to search the codebase — e.g. *"find how +user sessions are managed"* — or invoke it directly with `/ccc`. Requires the +`ccc` binary on your `PATH` (see [Install](#install)). + +### MCP Server + +Alternatively, run `ccc` as an MCP server over stdio: + +```bash +# Claude Code +claude mcp add cocoindex-code -- ccc mcp + +# Codex +codex mcp add cocoindex-code -- ccc mcp +``` + +Once configured, the agent decides when semantic search is helpful — finding code +by description, exploring unfamiliar code, or locating implementations without +knowing exact names. + +
+MCP Tool Reference + +Running as an MCP server (`ccc mcp`) exposes one tool: + +**`search`** — search the codebase by semantic similarity. + +``` +search( + query: str, # natural-language query or code snippet + limit: int = 5, # max results (1–100) + offset: int = 0, # pagination offset + refresh_index: bool = True, # refresh the index before querying + languages: list[str] | None = None, # filter by language, e.g. ["python","rust"] + paths: list[str] | None = None, # filter by path glob, e.g. ["src/utils/*"] +) +``` + +Returns matching chunks with file path, language, code, line numbers, and a +similarity score. +
+ +## CLI Reference + +| Command | Description | +|---------|-------------| +| `ccc init` | Initialize a project — creates settings files, adds `.cocoindex_code/` to `.gitignore` | +| `ccc index` | Build or update the index (auto-inits if needed) | +| `ccc search ` | Semantic search across the codebase | +| `ccc status` | Show index stats (chunk count, file count, language breakdown) | +| `ccc mcp` | Run as an MCP server in stdio mode | +| `ccc doctor` | Run diagnostics — settings, daemon, model, file matching, index health (`-v` for detail) | +| `ccc reset` | Delete index databases. `--all` also removes settings. `-f` skips confirmation. | +| `ccc daemon status` | Show daemon version, uptime, and loaded projects | +| `ccc daemon restart` | Restart the background daemon | +| `ccc daemon stop` | Stop the daemon | + +### Search options + +```bash +ccc search database schema # basic search +ccc search --lang python --lang markdown schema # filter by language +ccc search --path 'src/utils/*' query handler # filter by path glob +ccc search --offset 10 --limit 5 database schema # pagination +ccc search --refresh database schema # update index first, then search +``` + +By default `ccc search` scopes results to your current working directory +(relative to the project root). Use `--path` to override. + +## Configuration + +Configuration lives in two YAML files, both created by `ccc init`. + +### User settings (`~/.cocoindex_code/global_settings.yml`) + +Shared across all projects — controls the embedding model. + +```yaml +embedding: + provider: sentence-transformers # local fastembed (the only supported provider) + model: BAAI/bge-small-en-v1.5 # any model in fastembed's registry + + # Optional asymmetric-retrieval knobs, applied separately to indexing vs query. + # Accepted key: prompt_name (sentence-transformers). + # indexing_params: + # prompt_name: passage + # query_params: + # prompt_name: query +``` + +> Set `COCOINDEX_CODE_DIR` to place `global_settings.yml` somewhere other than +> `~/.cocoindex_code/`. + +Models are resolved against fastembed's registry by name, then by suffix — so +`sentence-transformers/all-MiniLM-L6-v2` resolves. Cloud / LiteLLM providers are +not part of this build; a `provider: litellm` config loads but fails with a clear +message pointing at the local provider. + +### Project settings (`/.cocoindex_code/settings.yml`) + +Per-project — controls which files are indexed. + +```yaml +include_patterns: + - "**/*.py" + - "**/*.ts" + - "**/*.rs" + - "**/*.go" + # ... sensible defaults for 28+ file types + +exclude_patterns: + - "**/.*" # hidden directories + - "**/node_modules" + - "**/dist" + # ... + +language_overrides: + - ext: inc # treat .inc files as PHP + lang: php +``` + +Include/exclude globs additionally honor nested `.gitignore` files. +`.cocoindex_code/` is added to `.gitignore` during `init`. + +## Supported languages + +Tree-sitter–based chunking for Python, JavaScript/TypeScript, Rust, Go, Java, +C/C++, C#, Ruby, PHP, Swift, Kotlin, Scala, SQL, Shell, Markdown, and more. +Unrecognized text files are indexed with a generic recursive splitter. + +## Differences from the Python build + +This native build targets feature parity with the Python `ccc` for day-to-day +use; two things differ today: + +- **Embeddings are local-only** (fastembed). There is no LiteLLM / cloud-provider + option, and the default model is `BAAI/bge-small-en-v1.5`. +- **Custom Python chunkers** (`chunkers:` in project settings) are not supported — + the config still parses, but the built-in tree-sitter splitter is used. + +Index databases are interchangeable: `ccc search` works against an index built by +the Python tool, and vice versa. diff --git a/rust/tests/e2e_advanced.sh b/rust/tests/e2e_advanced.sh index 3cf648f..da32bff 100755 --- a/rust/tests/e2e_advanced.sh +++ b/rust/tests/e2e_advanced.sh @@ -103,7 +103,7 @@ echo "### F. Real Rust codebase (the port's own src) — multi-language" P="$ROOT/realrust"; mkdir -p "$P/src" cp "$REPO"/rust/src/*.rs "$P/src/" cp "$REPO"/rust/Cargo.toml "$P/" -cp "$REPO"/rust/PORTING.md "$P/" +cp "$REPO"/rust/README.md "$P/" cd "$P"; $BIN init >/dev/null 2>&1; rr=$($BIN index 2>&1 | grep -E "rust:|toml:|markdown:") has "indexed rust files" "rust:" "$rr" has "indexed toml" "toml:" "$rr"