From 5c0b5dc06049cfe6db2ea7693506e5054b378a63 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= Date: Sun, 21 Jun 2026 23:29:00 -0700 Subject: [PATCH 1/3] docs(rust): document how the port uses the CocoIndex Rust SDK MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a "How it uses the CocoIndex Rust SDK" section to rust/PORTING.md — a code-grounded walkthrough of the SDK API the port exercises (Environment/App/run, ContextKey DI + change detection, #[cocoindex::function] memoization, walk_dir + mount_each!, the sqlite/vec0 table target + declare_row, and the sqlite-vec from_pool gotcha). Snippets cite live file:line anchors so the doc stays verifiable against the source. Co-Authored-By: Claude Opus 4.8 --- rust/PORTING.md | 90 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/rust/PORTING.md b/rust/PORTING.md index ec89176..019005f 100644 --- a/rust/PORTING.md +++ b/rust/PORTING.md @@ -42,6 +42,96 @@ and auto-spawn the daemon on first use. - Models are limited to fastembed's supported set (resolved by name, then by suffix — so `sentence-transformers/all-MiniLM-L6-v2` works). +## How it uses the CocoIndex Rust SDK + +This is the canonical worked example of driving the SDK from Rust. The snippets +below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230 +lines — so treat the cited `file:line` anchors as the source of truth and update +this section whenever those change. + +**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to +exactly what the tool needs. + +```toml +cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] } +# text -> RecursiveSplitter + detect_code_language (tree-sitter) +# sqlite -> sqlite-vec (vec0) table target +# fastembed -> local sentence-transformers embeddings +# fs_live -> live directory watching (daemon) +``` + +`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`, +`IdGenerator`, `walk_dir`, and the `mount_each!` macro. + +**1. Environment → App → run** — the entry point (`indexer.rs:206`). The +`Environment` owns the incremental-state DB and the dependency-injected +resources; `app.run` executes one declarative pass and returns `RunStats`. + +```rust +let app = cocoindex::Environment::builder() + .db_path(coco_db_path) // engine's change-tracking state DB + .provide_key(&DB, db) // inject resources by ContextKey + .provide_key(&EMBEDDER, embedder.clone()) + .build().await? + .app("CocoIndexCode").await?; +let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?; +``` + +**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey` +values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state` +attaches a state-id, so changing the underlying resource (e.g. the embedding +model) invalidates everything memoized against it. + +```rust +static EMBEDDER: LazyLock> = + LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key())); +``` + +**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The +arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is +skipped on the next run. (We thread the embedder's identity through `model_tag` +precisely so a model swap reprocesses every file.) + +```rust +#[cocoindex::function] +async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String) + -> Result> { /* chunk + embed */ } +``` + +**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)` +yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`); +`mount_each!` mounts the memoized fn once per item. + +```rust +let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?; +let rows_by_file = + mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?; +``` + +**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired +rows; the engine diffs against the previous run and applies the minimal +insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`). + +```rust +let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?; +for row in &rows { table.declare_row(&ctx, row)?; } +``` + +Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`; +sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`. + +**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does +*not* load the `vec0` extension. The port registers it as a SQLite +auto-extension once, builds its own pool, and hands it to +`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK +sqlite target with `vec0` virtual tables. + +**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter, +RecursiveChunkConfig, detect_code_language}` for chunking/language detection; +`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids; +`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into +the SDK error type. + ## Python → Rust module map | Python (`src/cocoindex_code`) | Rust (`rust/src`) | Status | From 1af60f8b1ec308c2afc82475ec798a15c035a8bf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= Date: Sun, 21 Jun 2026 23:38:38 -0700 Subject: [PATCH 2/3] docs(rust): make it a standalone Rust README (drop Python framing) Rename rust/PORTING.md -> rust/README.md and rewrite as a Rust-only usage doc: drop the Python->Rust module map, the parity audit, the Python backward-compat section, and the "vs Python" deltas. Keep build/run, architecture, the "How it uses the CocoIndex Rust SDK" walkthrough, CLI commands, configuration, testing, and a plain limitations/follow-ups list. Update the e2e fixture that copies the doc as a sample markdown file. Co-Authored-By: Claude Opus 4.8 --- rust/PORTING.md | 210 ------------------------------------- rust/README.md | 159 ++++++++++++++++++++++++++++ rust/tests/e2e_advanced.sh | 2 +- 3 files changed, 160 insertions(+), 211 deletions(-) delete mode 100644 rust/PORTING.md create mode 100644 rust/README.md diff --git a/rust/PORTING.md b/rust/PORTING.md deleted file mode 100644 index 019005f..0000000 --- a/rust/PORTING.md +++ /dev/null @@ -1,210 +0,0 @@ -# Rust port of cocoindex-code - -A from-scratch Rust reimplementation of `cocoindex-code` (the `ccc` CLI), built -on the **CocoIndex Rust SDK** (`cocoindex-io/cocoindex` → `rust/sdk/cocoindex`). -Feature parity with the Python implementation in `../src/cocoindex_code`, which -is kept in the repo as the reference spec. - -## Build & run - -```bash -cd rust -cargo build # builds everything (fastembed/ONNX is always on — it's the only embedder) -cargo test # sqlite-vec (vec0) integration test - -./target/debug/ccc init -./target/debug/ccc index -./target/debug/ccc search "vector similarity" --lang rust --limit 10 -``` - -The SDK is a **path dependency** assuming `cocoindex` is checked out as a -sibling (`../../cocoindex`). For distribution this should become a git -dependency on `cocoindex-io/cocoindex` (the `v1` branch). - -## Architecture - -Like the Python tool, the CLI is a thin **client** that talks to a background -**daemon** over a Unix socket; the daemon keeps the embedding model warm and -caches per-project state. `index`/`search`/`status`/`doctor` are daemon-backed -and auto-spawn the daemon on first use. - -- **IPC**: length-prefixed msgpack frames over `daemon.sock` (Rust-to-Rust; not - wire-compatible with the Python daemon's `multiprocessing.connection`). -- **Embeddings**: **local sentence-transformers (fastembed) only.** Python also - offers a `litellm` provider for cloud/multi-provider embeddings; there is no - viable in-process Rust equivalent (the official `LiteLLM-Labs/litellm-rust` is - a gateway binary, not a library; the community `litellm-rust` crate is alpha - and only covers OpenAI-compatible embeddings), so the litellm option is - intentionally not exposed. Existing `provider: litellm` configs parse fine and - produce a clear error pointing at the local provider. - - Default model: `BAAI/bge-small-en-v1.5` (Python's - `Snowflake/snowflake-arctic-embed-xs` isn't in fastembed's registry). - - Models are limited to fastembed's supported set (resolved by name, then by - suffix — so `sentence-transformers/all-MiniLM-L6-v2` works). - -## How it uses the CocoIndex Rust SDK - -This is the canonical worked example of driving the SDK from Rust. The snippets -below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230 -lines — so treat the cited `file:line` anchors as the source of truth and update -this section whenever those change. - -**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to -exactly what the tool needs. - -```toml -cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] } -# text -> RecursiveSplitter + detect_code_language (tree-sitter) -# sqlite -> sqlite-vec (vec0) table target -# fastembed -> local sentence-transformers embeddings -# fs_live -> live directory watching (daemon) -``` - -`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`, -`IdGenerator`, `walk_dir`, and the `mount_each!` macro. - -**1. Environment → App → run** — the entry point (`indexer.rs:206`). The -`Environment` owns the incremental-state DB and the dependency-injected -resources; `app.run` executes one declarative pass and returns `RunStats`. - -```rust -let app = cocoindex::Environment::builder() - .db_path(coco_db_path) // engine's change-tracking state DB - .provide_key(&DB, db) // inject resources by ContextKey - .provide_key(&EMBEDDER, embedder.clone()) - .build().await? - .app("CocoIndexCode").await?; -let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?; -``` - -**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey` -values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state` -attaches a state-id, so changing the underlying resource (e.g. the embedding -model) invalidates everything memoized against it. - -```rust -static EMBEDDER: LazyLock> = - LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key())); -``` - -**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The -arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is -skipped on the next run. (We thread the embedder's identity through `model_tag` -precisely so a model swap reprocesses every file.) - -```rust -#[cocoindex::function] -async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String) - -> Result> { /* chunk + embed */ } -``` - -**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)` -yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`); -`mount_each!` mounts the memoized fn once per item. - -```rust -let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?; -let rows_by_file = - mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?; -``` - -**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired -rows; the engine diffs against the previous run and applies the minimal -insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`). - -```rust -let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?; -for row in &rows { table.declare_row(&ctx, row)?; } -``` - -Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`; -sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`. - -**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does -*not* load the `vec0` extension. The port registers it as a SQLite -auto-extension once, builds its own pool, and hands it to -`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK -sqlite target with `vec0` virtual tables. - -**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter, -RecursiveChunkConfig, detect_code_language}` for chunking/language detection; -`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids; -`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into -the SDK error type. - -## Python → Rust module map - -| Python (`src/cocoindex_code`) | Rust (`rust/src`) | Status | -|---|---|---| -| `schema.py` / `CodeChunk` | `schema.rs` | ✅ | -| `settings.py` | `settings.rs` | ✅ (container path-mapping env vars deferred) | -| `embedder_params.py` + `embedder_defaults.py` | `embedder_params.rs` | ✅ | -| `litellm_embedder.py` / `shared.create_embedder` | `embedder.rs` | ✅ (local `prompt_name` TODO) | -| `indexer.py` | `indexer.rs` + `walk.rs` | ✅ (nested `.gitignore`, custom chunkers TODO) | -| `query.py` | `query.rs` | ✅ | -| `project.py` | `project.rs` + `daemon.rs` (registry) | ✅ | -| `protocol.py` | `protocol.rs` | ✅ | -| `_daemon_paths.py` | `daemon_paths.rs` | ✅ | -| `daemon.py` | `daemon.rs` | ✅ | -| `client.py` | `client.rs` | ✅ | -| `server.py` (MCP) | `mcp.rs` | ✅ (hand-rolled stdio JSON-RPC) | -| `cli.py` | `main.rs` | ✅ (interactive `init` prompts → flags) | - -## CLI commands (parity) - -`init`, `index`, `search` (`--lang`/`--path`/`--offset`/`--limit`/`--refresh`), -`status`, `reset` (`--all`/`-f`), `doctor` (`-v`), `mcp`, `daemon status|restart|stop`, -and the hidden `run-daemon`. - -## Tested (all green) - -- `cargo test`: sqlite-vec `vec0` extension loads + KNN returns correct results. -- **End-to-end (local embeddings)**: `init` → `index` (walk → tree-sitter chunk - → embed → vec0 upsert) → `search` with `--lang`/`--path` filters → incremental - re-index correctly skips unchanged files. -- **Daemon-backed lifecycle**: `index` auto-spawns the daemon (loads model once), - `daemon status`/`restart`/`stop`, graceful shutdown, PID/socket cleanup. -- **MCP**: `initialize` / `tools/list` / `tools/call search` over stdio JSON-RPC. -- **doctor** (global settings, daemon, model checks, project settings, file walk, - index status), **reset --all**, and post-reset "not initialized" handling. - -## Backward compatibility - -- **Settings files** (`global_settings.yml`, project `settings.yml`) written by - the Python tool parse unchanged — same keys, `provider` default (`litellm`), - `indexing_params`/`query_params` (absent vs empty), `envs`, and the legacy - `sbert/` model-name prefix (stripped before loading). -- **`provider: litellm`** configs do not crash — they load and return a clear - "only local embeddings are supported; set `provider: sentence-transformers`" - error (surfaced through the daemon). -- **Index DB**: the `target_sqlite.db` vec0 schema is identical, so `search` - works against a Python-built index. The CocoIndex state db (`cocoindex.db`) - differs across engine builds, so the first `index` re-runs (safe/incremental). -- **`.cocoindex_code/` layout**, paths, and the `.gitignore` entry match Python. - -## Parity audit (module-by-module) — fixed - -A deep Python-vs-Rust audit drove these fixes (all tested): search/status now -**auto-start load-time indexing and wait** (`ensure_indexing_started`); include/ -exclude use the SDK's `PatternFilePathMatcher` for **exact** pattern parity, with -a gitignore-aware wrapper; `init` restores the **"already initialized"** message -and the **parent-marker warning** (`-f` to override); `reset --all` removes the -`.gitignore` entry and prints the settings hint; `doctor` regained the -**daemon-env section**, include/exclude pattern values, the `params:` line, the -traceback hint, and the log line; the client gained **supervised-mode** -(`COCOINDEX_CODE_DAEMON_SUPERVISED`), handshake-warning dedup, and PID-guarded -cleanup; settings gained the empty-file check and absolutized project-root walk; -the MCP tool descriptions match `server.py`. - -## Known deltas vs Python (intentional / follow-up) - -1. **Embeddings** — local fastembed only; the `litellm` provider is not exposed - (no viable in-process Rust litellm). Default model differs (see above). -2. **Interactive `init`** — flag-driven (`--model`) instead of questionary prompts. -3. **Custom chunkers** — Python loads `module:callable` chunkers; Rust can't load - Python callables (config still parses; built-in splitter used). -4. **Legacy `cocoindex-code` entrypoint** + env-var migration - (`COCOINDEX_CODE_EMBEDDING_MODEL`, …) — not ported (the `ccc` CLI is the - entry point). -5. local-embedding `prompt_name`, container path-mapping env vars, and live - index-progress streaming (`IndexProgressUpdate`) — follow-ups. diff --git a/rust/README.md b/rust/README.md new file mode 100644 index 0000000..0a6a0d3 --- /dev/null +++ b/rust/README.md @@ -0,0 +1,159 @@ +# ccc — semantic code search (Rust) + +A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the +**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter, +embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for +fast vector search — from the CLI or over MCP. + +## Build & run + +```bash +cd rust +cargo build # fastembed/ONNX is always on — local embeddings are the only backend +cargo test # sqlite-vec (vec0) integration test + +./target/debug/ccc init +./target/debug/ccc index +./target/debug/ccc search "vector similarity" --lang rust --limit 10 +``` + +The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling +(`../../cocoindex`). For distribution this should become a git dependency on +`cocoindex-io/cocoindex` (the `v1` branch). + +## Architecture + +The CLI is a thin **client** that talks to a background **daemon** over a Unix +socket; the daemon keeps the embedding model warm and caches per-project state. +`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the +daemon on first use. + +- **IPC**: length-prefixed msgpack frames over `daemon.sock`. +- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default + model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works + (resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2` + resolves). +- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`. + +## How it uses the CocoIndex Rust SDK + +This is the canonical worked example of driving the SDK from Rust. The snippets +below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230 +lines — so treat the cited `file:line` anchors as the source of truth and update +this section whenever those change. + +**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to +exactly what the tool needs. + +```toml +cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] } +# text -> RecursiveSplitter + detect_code_language (tree-sitter) +# sqlite -> sqlite-vec (vec0) table target +# fastembed -> local sentence-transformers embeddings +# fs_live -> live directory watching (daemon) +``` + +`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`, +`IdGenerator`, `walk_dir`, and the `mount_each!` macro. + +**1. Environment → App → run** — the entry point (`indexer.rs:206`). The +`Environment` owns the incremental-state DB and the dependency-injected +resources; `app.run` executes one declarative pass and returns `RunStats`. + +```rust +let app = cocoindex::Environment::builder() + .db_path(coco_db_path) // engine's change-tracking state DB + .provide_key(&DB, db) // inject resources by ContextKey + .provide_key(&EMBEDDER, embedder.clone()) + .build().await? + .app("CocoIndexCode").await?; +let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?; +``` + +**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey` +values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state` +attaches a state-id, so changing the underlying resource (e.g. the embedding +model) invalidates everything memoized against it. + +```rust +static EMBEDDER: LazyLock> = + LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key())); +``` + +**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The +arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is +skipped on the next run. (We thread the embedder's identity through `model_tag` +precisely so a model swap reprocesses every file.) + +```rust +#[cocoindex::function] +async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String) + -> Result> { /* chunk + embed */ } +``` + +**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)` +yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`); +`mount_each!` mounts the memoized fn once per item. + +```rust +let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?; +let rows_by_file = + mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?; +``` + +**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired +rows; the engine diffs against the previous run and applies the minimal +insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`). + +```rust +let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?; +for row in &rows { table.declare_row(&ctx, row)?; } +``` + +Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`; +sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`. + +**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does +*not* load the `vec0` extension. The tool registers it as a SQLite +auto-extension once, builds its own pool, and hands it to +`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK +sqlite target with `vec0` virtual tables. + +**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter, +RecursiveChunkConfig, detect_code_language}` for chunking/language detection; +`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids; +`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into +the SDK error type. + +## CLI commands + +`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`), +`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`, +`daemon status|restart|stop`, and the hidden `run-daemon`. + +## Configuration + +Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model, +provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml` +(include/exclude patterns, language overrides). Include/exclude use the SDK's +`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files. +`ccc doctor` prints the resolved configuration and where each value came from. + +## Testing + +- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct + results. +- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of + `init` → `index` → `search` (with `--lang`/`--path` filters and incremental + re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown), + multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` / + `tools/call`), `doctor`, and `reset --all`. + +## Limitations / follow-ups + +- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet. +- **`init`** is flag-driven (`--model`) rather than interactive prompts. +- **Custom chunkers**: the built-in tree-sitter recursive splitter is used; + pluggable chunkers are not yet supported. +- **Live index-progress streaming** and container path-mapping env vars are + follow-ups. diff --git a/rust/tests/e2e_advanced.sh b/rust/tests/e2e_advanced.sh index 3cf648f..da32bff 100755 --- a/rust/tests/e2e_advanced.sh +++ b/rust/tests/e2e_advanced.sh @@ -103,7 +103,7 @@ echo "### F. Real Rust codebase (the port's own src) — multi-language" P="$ROOT/realrust"; mkdir -p "$P/src" cp "$REPO"/rust/src/*.rs "$P/src/" cp "$REPO"/rust/Cargo.toml "$P/" -cp "$REPO"/rust/PORTING.md "$P/" +cp "$REPO"/rust/README.md "$P/" cd "$P"; $BIN init >/dev/null 2>&1; rr=$($BIN index 2>&1 | grep -E "rust:|toml:|markdown:") has "indexed rust files" "rust:" "$rr" has "indexed toml" "toml:" "$rr" From 20b83310fe5dd59acfc569d454a9baec1a39e0e6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= Date: Sun, 21 Jun 2026 23:52:20 -0700 Subject: [PATCH 3/3] docs(rust): rewrite README as a user guide (install / CLI / MCP / config) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drop the SDK-internals walkthrough. The Rust README now mirrors the main cocoindex-code README's user-facing structure — Install (build from source), Quick start, Coding Agent Integration (Skill + MCP), CLI Reference, Search options, MCP tool reference, Configuration (user/project settings), Supported languages, and a short "Differences from the Python build" note (local-only embeddings; no custom Python chunkers). Co-Authored-By: Claude Opus 4.8 --- rust/README.md | 304 +++++++++++++++++++++++++++++-------------------- 1 file changed, 181 insertions(+), 123 deletions(-) diff --git a/rust/README.md b/rust/README.md index 0a6a0d3..4683d3a 100644 --- a/rust/README.md +++ b/rust/README.md @@ -1,159 +1,217 @@ -# ccc — semantic code search (Rust) +# cocoindex-code (Rust) — AST-based semantic code search -A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the -**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter, -embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for -fast vector search — from the CLI or over MCP. +A lightweight, effective **(AST-based)** semantic code search tool for your +codebase — the native-Rust build of [`ccc`](https://github.com/cocoindex-io/cocoindex-code). +Built on [CocoIndex](https://github.com/cocoindex-io/cocoindex), the Rust data +transformation engine. Use it from the CLI, or wire it into Claude Code, Codex, +Cursor — any coding agent — via [Skill](#coding-agent-integration) or +[MCP](#mcp-server). -## Build & run +- Instant token savings — let the agent find code by meaning, not grep. +- **Local embeddings, zero setup** — runs fully offline, no API key required. +- **Incremental** — only re-indexes changed files. + +## Features + +- **Semantic code search** — find relevant code with natural-language queries + when grep falls short. +- **Ultra performant** — a single static binary on top of the Rust + [CocoIndex](https://github.com/cocoindex-io/cocoindex) engine; only changed + files are re-indexed. +- **Multi-language** — Python, JavaScript/TypeScript, Rust, Go, Java, C/C++, C#, + SQL, Shell, and more (tree-sitter). +- **Embedded** — a sqlite-vec index file; no database to run. +- **Local embeddings** — sentence-transformers via [fastembed](https://github.com/Anush008/fastembed-rs) + (ONNX), no API key, no Python. + +## Install + +The Rust build is compiled from source. It depends on the CocoIndex SDK as a +sibling checkout, so clone both repos side by side: ```bash -cd rust -cargo build # fastembed/ONNX is always on — local embeddings are the only backend -cargo test # sqlite-vec (vec0) integration test +git clone https://github.com/cocoindex-io/cocoindex +git clone -b rust https://github.com/cocoindex-io/cocoindex-code + +cd cocoindex-code/rust +cargo build --release -./target/debug/ccc init -./target/debug/ccc index -./target/debug/ccc search "vector similarity" --lang rust --limit 10 +# put the binary on your PATH (or use `cargo install --path .`) +install -m 0755 target/release/ccc ~/.local/bin/ccc +ccc --help ``` -The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling -(`../../cocoindex`). For distribution this should become a git dependency on -`cocoindex-io/cocoindex` (the `v1` branch). +Embeddings are **local-only** (fastembed/ONNX) — no cloud provider or API key is +required or supported in this build. The default model is +[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5); any +model in fastembed's registry can be selected (see [Configuration](#configuration)). -## Architecture +## Quick start -The CLI is a thin **client** that talks to a background **daemon** over a Unix -socket; the daemon keeps the embedding model warm and caches per-project state. -`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the -daemon on first use. +```bash +ccc init # initialize project (creates settings) +ccc index # build the index +ccc search "authentication logic" # search! +``` -- **IPC**: length-prefixed msgpack frames over `daemon.sock`. -- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default - model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works - (resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2` - resolves). -- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`. +The background daemon starts automatically on first use and keeps the embedding +model warm. -## How it uses the CocoIndex Rust SDK +> **Tip:** `ccc index` auto-initializes if you haven't run `ccc init` yet, so you +> can skip straight to indexing. -This is the canonical worked example of driving the SDK from Rust. The snippets -below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230 -lines — so treat the cited `file:line` anchors as the source of truth and update -this section whenever those change. +## Coding Agent Integration -**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to -exactly what the tool needs. +### Skill -```toml -cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] } -# text -> RecursiveSplitter + detect_code_language (tree-sitter) -# sqlite -> sqlite-vec (vec0) table target -# fastembed -> local sentence-transformers embeddings -# fs_live -> live directory watching (daemon) +Install the `ccc` skill so your coding agent automatically uses semantic search +when it helps: + +```bash +npx skills add cocoindex-io/cocoindex-code ``` -`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`, -`IdGenerator`, `walk_dir`, and the `mount_each!` macro. - -**1. Environment → App → run** — the entry point (`indexer.rs:206`). The -`Environment` owns the incremental-state DB and the dependency-injected -resources; `app.run` executes one declarative pass and returns `RunStats`. - -```rust -let app = cocoindex::Environment::builder() - .db_path(coco_db_path) // engine's change-tracking state DB - .provide_key(&DB, db) // inject resources by ContextKey - .provide_key(&EMBEDDER, embedder.clone()) - .build().await? - .app("CocoIndexCode").await?; -let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?; +The skill teaches the agent to initialize, index, and search on its own, and to +keep the index fresh as you work. Ask it to search the codebase — e.g. *"find how +user sessions are managed"* — or invoke it directly with `/ccc`. Requires the +`ccc` binary on your `PATH` (see [Install](#install)). + +### MCP Server + +Alternatively, run `ccc` as an MCP server over stdio: + +```bash +# Claude Code +claude mcp add cocoindex-code -- ccc mcp + +# Codex +codex mcp add cocoindex-code -- ccc mcp ``` -**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey` -values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state` -attaches a state-id, so changing the underlying resource (e.g. the embedding -model) invalidates everything memoized against it. +Once configured, the agent decides when semantic search is helpful — finding code +by description, exploring unfamiliar code, or locating implementations without +knowing exact names. + +
+MCP Tool Reference + +Running as an MCP server (`ccc mcp`) exposes one tool: + +**`search`** — search the codebase by semantic similarity. -```rust -static EMBEDDER: LazyLock> = - LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key())); ``` +search( + query: str, # natural-language query or code snippet + limit: int = 5, # max results (1–100) + offset: int = 0, # pagination offset + refresh_index: bool = True, # refresh the index before querying + languages: list[str] | None = None, # filter by language, e.g. ["python","rust"] + paths: list[str] | None = None, # filter by path glob, e.g. ["src/utils/*"] +) +``` + +Returns matching chunks with file path, language, code, line numbers, and a +similarity score. +
+ +## CLI Reference -**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The -arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is -skipped on the next run. (We thread the embedder's identity through `model_tag` -precisely so a model swap reprocesses every file.) +| Command | Description | +|---------|-------------| +| `ccc init` | Initialize a project — creates settings files, adds `.cocoindex_code/` to `.gitignore` | +| `ccc index` | Build or update the index (auto-inits if needed) | +| `ccc search ` | Semantic search across the codebase | +| `ccc status` | Show index stats (chunk count, file count, language breakdown) | +| `ccc mcp` | Run as an MCP server in stdio mode | +| `ccc doctor` | Run diagnostics — settings, daemon, model, file matching, index health (`-v` for detail) | +| `ccc reset` | Delete index databases. `--all` also removes settings. `-f` skips confirmation. | +| `ccc daemon status` | Show daemon version, uptime, and loaded projects | +| `ccc daemon restart` | Restart the background daemon | +| `ccc daemon stop` | Stop the daemon | -```rust -#[cocoindex::function] -async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String) - -> Result> { /* chunk + embed */ } +### Search options + +```bash +ccc search database schema # basic search +ccc search --lang python --lang markdown schema # filter by language +ccc search --path 'src/utils/*' query handler # filter by path glob +ccc search --offset 10 --limit 5 database schema # pagination +ccc search --refresh database schema # update index first, then search ``` -**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)` -yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`); -`mount_each!` mounts the memoized fn once per item. +By default `ccc search` scopes results to your current working directory +(relative to the project root). Use `--path` to override. + +## Configuration + +Configuration lives in two YAML files, both created by `ccc init`. + +### User settings (`~/.cocoindex_code/global_settings.yml`) -```rust -let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?; -let rows_by_file = - mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?; +Shared across all projects — controls the embedding model. + +```yaml +embedding: + provider: sentence-transformers # local fastembed (the only supported provider) + model: BAAI/bge-small-en-v1.5 # any model in fastembed's registry + + # Optional asymmetric-retrieval knobs, applied separately to indexing vs query. + # Accepted key: prompt_name (sentence-transformers). + # indexing_params: + # prompt_name: passage + # query_params: + # prompt_name: query ``` -**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired -rows; the engine diffs against the previous run and applies the minimal -insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`). +> Set `COCOINDEX_CODE_DIR` to place `global_settings.yml` somewhere other than +> `~/.cocoindex_code/`. + +Models are resolved against fastembed's registry by name, then by suffix — so +`sentence-transformers/all-MiniLM-L6-v2` resolves. Cloud / LiteLLM providers are +not part of this build; a `provider: litellm` config loads but fails with a clear +message pointing at the local provider. + +### Project settings (`/.cocoindex_code/settings.yml`) -```rust -let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?; -for row in &rows { table.declare_row(&ctx, row)?; } +Per-project — controls which files are indexed. + +```yaml +include_patterns: + - "**/*.py" + - "**/*.ts" + - "**/*.rs" + - "**/*.go" + # ... sensible defaults for 28+ file types + +exclude_patterns: + - "**/.*" # hidden directories + - "**/node_modules" + - "**/dist" + # ... + +language_overrides: + - ext: inc # treat .inc files as PHP + lang: php ``` -Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`; -sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`. +Include/exclude globs additionally honor nested `.gitignore` files. +`.cocoindex_code/` is added to `.gitignore` during `init`. -**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does -*not* load the `vec0` extension. The tool registers it as a SQLite -auto-extension once, builds its own pool, and hands it to -`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK -sqlite target with `vec0` virtual tables. +## Supported languages -**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter, -RecursiveChunkConfig, detect_code_language}` for chunking/language detection; -`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids; -`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into -the SDK error type. +Tree-sitter–based chunking for Python, JavaScript/TypeScript, Rust, Go, Java, +C/C++, C#, Ruby, PHP, Swift, Kotlin, Scala, SQL, Shell, Markdown, and more. +Unrecognized text files are indexed with a generic recursive splitter. -## CLI commands +## Differences from the Python build -`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`), -`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`, -`daemon status|restart|stop`, and the hidden `run-daemon`. +This native build targets feature parity with the Python `ccc` for day-to-day +use; two things differ today: -## Configuration +- **Embeddings are local-only** (fastembed). There is no LiteLLM / cloud-provider + option, and the default model is `BAAI/bge-small-en-v1.5`. +- **Custom Python chunkers** (`chunkers:` in project settings) are not supported — + the config still parses, but the built-in tree-sitter splitter is used. -Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model, -provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml` -(include/exclude patterns, language overrides). Include/exclude use the SDK's -`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files. -`ccc doctor` prints the resolved configuration and where each value came from. - -## Testing - -- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct - results. -- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of - `init` → `index` → `search` (with `--lang`/`--path` filters and incremental - re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown), - multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` / - `tools/call`), `doctor`, and `reset --all`. - -## Limitations / follow-ups - -- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet. -- **`init`** is flag-driven (`--model`) rather than interactive prompts. -- **Custom chunkers**: the built-in tree-sitter recursive splitter is used; - pluggable chunkers are not yet supported. -- **Live index-progress streaming** and container path-mapping env vars are - follow-ups. +Index databases are interchangeable: `ccc search` works against an index built by +the Python tool, and vice versa.