From 5c0b5dc06049cfe6db2ea7693506e5054b378a63 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= <linghua@cocoindex.io>
Date: Sun, 21 Jun 2026 23:29:00 -0700
Subject: [PATCH 1/3] docs(rust): document how the port uses the CocoIndex Rust
 SDK
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a "How it uses the CocoIndex Rust SDK" section to rust/PORTING.md — a
code-grounded walkthrough of the SDK API the port exercises (Environment/App/run,
ContextKey DI + change detection, #[cocoindex::function] memoization, walk_dir +
mount_each!, the sqlite/vec0 table target + declare_row, and the sqlite-vec
from_pool gotcha). Snippets cite live file:line anchors so the doc stays
verifiable against the source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 rust/PORTING.md | 90 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
diff --git a/rust/PORTING.md b/rust/PORTING.md
index ec89176..019005f 100644
--- a/rust/PORTING.md
+++ b/rust/PORTING.md
@@ -42,6 +42,96 @@ and auto-spawn the daemon on first use.
   - Models are limited to fastembed's supported set (resolved by name, then by
     suffix — so `sentence-transformers/all-MiniLM-L6-v2` works).
 
+## How it uses the CocoIndex Rust SDK
+
+This is the canonical worked example of driving the SDK from Rust. The snippets
+below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230
+lines — so treat the cited `file:line` anchors as the source of truth and update
+this section whenever those change.
+
+**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to
+exactly what the tool needs.
+
+```toml
+cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] }
+#   text      -> RecursiveSplitter + detect_code_language (tree-sitter)
+#   sqlite    -> sqlite-vec (vec0) table target
+#   fastembed -> local sentence-transformers embeddings
+#   fs_live   -> live directory watching (daemon)
+```
+
+`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`,
+`IdGenerator`, `walk_dir`, and the `mount_each!` macro.
+
+**1. Environment → App → run** — the entry point (`indexer.rs:206`). The
+`Environment` owns the incremental-state DB and the dependency-injected
+resources; `app.run` executes one declarative pass and returns `RunStats`.
+
+```rust
+let app = cocoindex::Environment::builder()
+    .db_path(coco_db_path)                 // engine's change-tracking state DB
+    .provide_key(&DB, db)                  // inject resources by ContextKey
+    .provide_key(&EMBEDDER, embedder.clone())
+    .build().await?
+    .app("CocoIndexCode").await?;
+let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?;
+```
+
+**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey`
+values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state`
+attaches a state-id, so changing the underlying resource (e.g. the embedding
+model) invalidates everything memoized against it.
+
+```rust
+static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> =
+    LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key()));
+```
+
+**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The
+arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is
+skipped on the next run. (We thread the embedder's identity through `model_tag`
+precisely so a model swap reprocesses every file.)
+
+```rust
+#[cocoindex::function]
+async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String)
+    -> Result<Vec<CodeChunk>> { /* chunk + embed */ }
+```
+
+**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)`
+yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`);
+`mount_each!` mounts the memoized fn once per item.
+
+```rust
+let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?;
+let rows_by_file =
+    mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?;
+```
+
+**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired
+rows; the engine diffs against the previous run and applies the minimal
+insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`).
+
+```rust
+let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?;
+for row in &rows { table.declare_row(&ctx, row)?; }
+```
+
+Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`;
+sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`.
+
+**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does
+*not* load the `vec0` extension. The port registers it as a SQLite
+auto-extension once, builds its own pool, and hands it to
+`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK
+sqlite target with `vec0` virtual tables.
+
+**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter,
+RecursiveChunkConfig, detect_code_language}` for chunking/language detection;
+`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids;
+`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into
+the SDK error type.
+
 ## Python → Rust module map
 
 | Python (`src/cocoindex_code`) | Rust (`rust/src`) | Status |

From 1af60f8b1ec308c2afc82475ec798a15c035a8bf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= <linghua@cocoindex.io>
Date: Sun, 21 Jun 2026 23:38:38 -0700
Subject: [PATCH 2/3] docs(rust): make it a standalone Rust README (drop Python
 framing)

Rename rust/PORTING.md -> rust/README.md and rewrite as a Rust-only usage doc:
drop the Python->Rust module map, the parity audit, the Python backward-compat
section, and the "vs Python" deltas. Keep build/run, architecture, the
"How it uses the CocoIndex Rust SDK" walkthrough, CLI commands, configuration,
testing, and a plain limitations/follow-ups list. Update the e2e fixture that
copies the doc as a sample markdown file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 rust/PORTING.md            | 210 -------------------------------------
 rust/README.md             | 159 ++++++++++++++++++++++++++++
 rust/tests/e2e_advanced.sh |   2 +-
 3 files changed, 160 insertions(+), 211 deletions(-)
 delete mode 100644 rust/PORTING.md
 create mode 100644 rust/README.md

diff --git a/rust/PORTING.md b/rust/PORTING.md
deleted file mode 100644
index 019005f..0000000
--- a/rust/PORTING.md
+++ /dev/null
@@ -1,210 +0,0 @@
-# Rust port of cocoindex-code
-
-A from-scratch Rust reimplementation of `cocoindex-code` (the `ccc` CLI), built
-on the **CocoIndex Rust SDK** (`cocoindex-io/cocoindex` → `rust/sdk/cocoindex`).
-Feature parity with the Python implementation in `../src/cocoindex_code`, which
-is kept in the repo as the reference spec.
-
-## Build & run
-
-```bash
-cd rust
-cargo build       # builds everything (fastembed/ONNX is always on — it's the only embedder)
-cargo test        # sqlite-vec (vec0) integration test
-
-./target/debug/ccc init
-./target/debug/ccc index
-./target/debug/ccc search "vector similarity" --lang rust --limit 10
-```
-
-The SDK is a **path dependency** assuming `cocoindex` is checked out as a
-sibling (`../../cocoindex`). For distribution this should become a git
-dependency on `cocoindex-io/cocoindex` (the `v1` branch).
-
-## Architecture
-
-Like the Python tool, the CLI is a thin **client** that talks to a background
-**daemon** over a Unix socket; the daemon keeps the embedding model warm and
-caches per-project state. `index`/`search`/`status`/`doctor` are daemon-backed
-and auto-spawn the daemon on first use.
-
-- **IPC**: length-prefixed msgpack frames over `daemon.sock` (Rust-to-Rust; not
-  wire-compatible with the Python daemon's `multiprocessing.connection`).
-- **Embeddings**: **local sentence-transformers (fastembed) only.** Python also
-  offers a `litellm` provider for cloud/multi-provider embeddings; there is no
-  viable in-process Rust equivalent (the official `LiteLLM-Labs/litellm-rust` is
-  a gateway binary, not a library; the community `litellm-rust` crate is alpha
-  and only covers OpenAI-compatible embeddings), so the litellm option is
-  intentionally not exposed. Existing `provider: litellm` configs parse fine and
-  produce a clear error pointing at the local provider.
-  - Default model: `BAAI/bge-small-en-v1.5` (Python's
-    `Snowflake/snowflake-arctic-embed-xs` isn't in fastembed's registry).
-  - Models are limited to fastembed's supported set (resolved by name, then by
-    suffix — so `sentence-transformers/all-MiniLM-L6-v2` works).
-
-## How it uses the CocoIndex Rust SDK
-
-This is the canonical worked example of driving the SDK from Rust. The snippets
-below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230
-lines — so treat the cited `file:line` anchors as the source of truth and update
-this section whenever those change.
-
-**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to
-exactly what the tool needs.
-
-```toml
-cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] }
-#   text      -> RecursiveSplitter + detect_code_language (tree-sitter)
-#   sqlite    -> sqlite-vec (vec0) table target
-#   fastembed -> local sentence-transformers embeddings
-#   fs_live   -> live directory watching (daemon)
-```
-
-`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`,
-`IdGenerator`, `walk_dir`, and the `mount_each!` macro.
-
-**1. Environment → App → run** — the entry point (`indexer.rs:206`). The
-`Environment` owns the incremental-state DB and the dependency-injected
-resources; `app.run` executes one declarative pass and returns `RunStats`.
-
-```rust
-let app = cocoindex::Environment::builder()
-    .db_path(coco_db_path)                 // engine's change-tracking state DB
-    .provide_key(&DB, db)                  // inject resources by ContextKey
-    .provide_key(&EMBEDDER, embedder.clone())
-    .build().await?
-    .app("CocoIndexCode").await?;
-let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?;
-```
-
-**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey`
-values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state`
-attaches a state-id, so changing the underlying resource (e.g. the embedding
-model) invalidates everything memoized against it.
-
-```rust
-static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> =
-    LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key()));
-```
-
-**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The
-arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is
-skipped on the next run. (We thread the embedder's identity through `model_tag`
-precisely so a model swap reprocesses every file.)
-
-```rust
-#[cocoindex::function]
-async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String)
-    -> Result<Vec<CodeChunk>> { /* chunk + embed */ }
-```
-
-**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)`
-yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`);
-`mount_each!` mounts the memoized fn once per item.
-
-```rust
-let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?;
-let rows_by_file =
-    mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?;
-```
-
-**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired
-rows; the engine diffs against the previous run and applies the minimal
-insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`).
-
-```rust
-let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?;
-for row in &rows { table.declare_row(&ctx, row)?; }
-```
-
-Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`;
-sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`.
-
-**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does
-*not* load the `vec0` extension. The port registers it as a SQLite
-auto-extension once, builds its own pool, and hands it to
-`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK
-sqlite target with `vec0` virtual tables.
-
-**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter,
-RecursiveChunkConfig, detect_code_language}` for chunking/language detection;
-`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids;
-`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into
-the SDK error type.
-
-## Python → Rust module map
-
-| Python (`src/cocoindex_code`) | Rust (`rust/src`) | Status |
-|---|---|---|
-| `schema.py` / `CodeChunk` | `schema.rs` | ✅ |
-| `settings.py` | `settings.rs` | ✅ (container path-mapping env vars deferred) |
-| `embedder_params.py` + `embedder_defaults.py` | `embedder_params.rs` | ✅ |
-| `litellm_embedder.py` / `shared.create_embedder` | `embedder.rs` | ✅ (local `prompt_name` TODO) |
-| `indexer.py` | `indexer.rs` + `walk.rs` | ✅ (nested `.gitignore`, custom chunkers TODO) |
-| `query.py` | `query.rs` | ✅ |
-| `project.py` | `project.rs` + `daemon.rs` (registry) | ✅ |
-| `protocol.py` | `protocol.rs` | ✅ |
-| `_daemon_paths.py` | `daemon_paths.rs` | ✅ |
-| `daemon.py` | `daemon.rs` | ✅ |
-| `client.py` | `client.rs` | ✅ |
-| `server.py` (MCP) | `mcp.rs` | ✅ (hand-rolled stdio JSON-RPC) |
-| `cli.py` | `main.rs` | ✅ (interactive `init` prompts → flags) |
-
-## CLI commands (parity)
-
-`init`, `index`, `search` (`--lang`/`--path`/`--offset`/`--limit`/`--refresh`),
-`status`, `reset` (`--all`/`-f`), `doctor` (`-v`), `mcp`, `daemon status|restart|stop`,
-and the hidden `run-daemon`.
-
-## Tested (all green)
-
-- `cargo test`: sqlite-vec `vec0` extension loads + KNN returns correct results.
-- **End-to-end (local embeddings)**: `init` → `index` (walk → tree-sitter chunk
-  → embed → vec0 upsert) → `search` with `--lang`/`--path` filters → incremental
-  re-index correctly skips unchanged files.
-- **Daemon-backed lifecycle**: `index` auto-spawns the daemon (loads model once),
-  `daemon status`/`restart`/`stop`, graceful shutdown, PID/socket cleanup.
-- **MCP**: `initialize` / `tools/list` / `tools/call search` over stdio JSON-RPC.
-- **doctor** (global settings, daemon, model checks, project settings, file walk,
-  index status), **reset --all**, and post-reset "not initialized" handling.
-
-## Backward compatibility
-
-- **Settings files** (`global_settings.yml`, project `settings.yml`) written by
-  the Python tool parse unchanged — same keys, `provider` default (`litellm`),
-  `indexing_params`/`query_params` (absent vs empty), `envs`, and the legacy
-  `sbert/` model-name prefix (stripped before loading).
-- **`provider: litellm`** configs do not crash — they load and return a clear
-  "only local embeddings are supported; set `provider: sentence-transformers`"
-  error (surfaced through the daemon).
-- **Index DB**: the `target_sqlite.db` vec0 schema is identical, so `search`
-  works against a Python-built index. The CocoIndex state db (`cocoindex.db`)
-  differs across engine builds, so the first `index` re-runs (safe/incremental).
-- **`.cocoindex_code/` layout**, paths, and the `.gitignore` entry match Python.
-
-## Parity audit (module-by-module) — fixed
-
-A deep Python-vs-Rust audit drove these fixes (all tested): search/status now
-**auto-start load-time indexing and wait** (`ensure_indexing_started`); include/
-exclude use the SDK's `PatternFilePathMatcher` for **exact** pattern parity, with
-a gitignore-aware wrapper; `init` restores the **"already initialized"** message
-and the **parent-marker warning** (`-f` to override); `reset --all` removes the
-`.gitignore` entry and prints the settings hint; `doctor` regained the
-**daemon-env section**, include/exclude pattern values, the `params:` line, the
-traceback hint, and the log line; the client gained **supervised-mode**
-(`COCOINDEX_CODE_DAEMON_SUPERVISED`), handshake-warning dedup, and PID-guarded
-cleanup; settings gained the empty-file check and absolutized project-root walk;
-the MCP tool descriptions match `server.py`.
-
-## Known deltas vs Python (intentional / follow-up)
-
-1. **Embeddings** — local fastembed only; the `litellm` provider is not exposed
-   (no viable in-process Rust litellm). Default model differs (see above).
-2. **Interactive `init`** — flag-driven (`--model`) instead of questionary prompts.
-3. **Custom chunkers** — Python loads `module:callable` chunkers; Rust can't load
-   Python callables (config still parses; built-in splitter used).
-4. **Legacy `cocoindex-code` entrypoint** + env-var migration
-   (`COCOINDEX_CODE_EMBEDDING_MODEL`, …) — not ported (the `ccc` CLI is the
-   entry point).
-5. local-embedding `prompt_name`, container path-mapping env vars, and live
-   index-progress streaming (`IndexProgressUpdate`) — follow-ups.
diff --git a/rust/README.md b/rust/README.md
new file mode 100644
index 0000000..0a6a0d3
--- /dev/null
+++ b/rust/README.md
@@ -0,0 +1,159 @@
+# ccc — semantic code search (Rust)
+
+A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the
+**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter,
+embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for
+fast vector search — from the CLI or over MCP.
+
+## Build & run
+
+```bash
+cd rust
+cargo build       # fastembed/ONNX is always on — local embeddings are the only backend
+cargo test        # sqlite-vec (vec0) integration test
+
+./target/debug/ccc init
+./target/debug/ccc index
+./target/debug/ccc search "vector similarity" --lang rust --limit 10
+```
+
+The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling
+(`../../cocoindex`). For distribution this should become a git dependency on
+`cocoindex-io/cocoindex` (the `v1` branch).
+
+## Architecture
+
+The CLI is a thin **client** that talks to a background **daemon** over a Unix
+socket; the daemon keeps the embedding model warm and caches per-project state.
+`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the
+daemon on first use.
+
+- **IPC**: length-prefixed msgpack frames over `daemon.sock`.
+- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default
+  model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works
+  (resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2`
+  resolves).
+- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`.
+
+## How it uses the CocoIndex Rust SDK
+
+This is the canonical worked example of driving the SDK from Rust. The snippets
+below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230
+lines — so treat the cited `file:line` anchors as the source of truth and update
+this section whenever those change.
+
+**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to
+exactly what the tool needs.
+
+```toml
+cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] }
+#   text      -> RecursiveSplitter + detect_code_language (tree-sitter)
+#   sqlite    -> sqlite-vec (vec0) table target
+#   fastembed -> local sentence-transformers embeddings
+#   fs_live   -> live directory watching (daemon)
+```
+
+`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`,
+`IdGenerator`, `walk_dir`, and the `mount_each!` macro.
+
+**1. Environment → App → run** — the entry point (`indexer.rs:206`). The
+`Environment` owns the incremental-state DB and the dependency-injected
+resources; `app.run` executes one declarative pass and returns `RunStats`.
+
+```rust
+let app = cocoindex::Environment::builder()
+    .db_path(coco_db_path)                 // engine's change-tracking state DB
+    .provide_key(&DB, db)                  // inject resources by ContextKey
+    .provide_key(&EMBEDDER, embedder.clone())
+    .build().await?
+    .app("CocoIndexCode").await?;
+let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?;
+```
+
+**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey`
+values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state`
+attaches a state-id, so changing the underlying resource (e.g. the embedding
+model) invalidates everything memoized against it.
+
+```rust
+static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> =
+    LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key()));
+```
+
+**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The
+arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is
+skipped on the next run. (We thread the embedder's identity through `model_tag`
+precisely so a model swap reprocesses every file.)
+
+```rust
+#[cocoindex::function]
+async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String)
+    -> Result<Vec<CodeChunk>> { /* chunk + embed */ }
+```
+
+**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)`
+yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`);
+`mount_each!` mounts the memoized fn once per item.
+
+```rust
+let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?;
+let rows_by_file =
+    mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?;
+```
+
+**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired
+rows; the engine diffs against the previous run and applies the minimal
+insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`).
+
+```rust
+let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?;
+for row in &rows { table.declare_row(&ctx, row)?; }
+```
+
+Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`;
+sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`.
+
+**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does
+*not* load the `vec0` extension. The tool registers it as a SQLite
+auto-extension once, builds its own pool, and hands it to
+`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK
+sqlite target with `vec0` virtual tables.
+
+**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter,
+RecursiveChunkConfig, detect_code_language}` for chunking/language detection;
+`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids;
+`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into
+the SDK error type.
+
+## CLI commands
+
+`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`),
+`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`,
+`daemon status|restart|stop`, and the hidden `run-daemon`.
+
+## Configuration
+
+Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model,
+provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml`
+(include/exclude patterns, language overrides). Include/exclude use the SDK's
+`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files.
+`ccc doctor` prints the resolved configuration and where each value came from.
+
+## Testing
+
+- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct
+  results.
+- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of
+  `init` → `index` → `search` (with `--lang`/`--path` filters and incremental
+  re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown),
+  multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` /
+  `tools/call`), `doctor`, and `reset --all`.
+
+## Limitations / follow-ups
+
+- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet.
+- **`init`** is flag-driven (`--model`) rather than interactive prompts.
+- **Custom chunkers**: the built-in tree-sitter recursive splitter is used;
+  pluggable chunkers are not yet supported.
+- **Live index-progress streaming** and container path-mapping env vars are
+  follow-ups.
diff --git a/rust/tests/e2e_advanced.sh b/rust/tests/e2e_advanced.sh
index 3cf648f..da32bff 100755
--- a/rust/tests/e2e_advanced.sh
+++ b/rust/tests/e2e_advanced.sh
@@ -103,7 +103,7 @@ echo "### F. Real Rust codebase (the port's own src) — multi-language"
 P="$ROOT/realrust"; mkdir -p "$P/src"
 cp "$REPO"/rust/src/*.rs "$P/src/"
 cp "$REPO"/rust/Cargo.toml "$P/"
-cp "$REPO"/rust/PORTING.md "$P/"
+cp "$REPO"/rust/README.md "$P/"
 cd "$P"; $BIN init >/dev/null 2>&1; rr=$($BIN index 2>&1 | grep -E "rust:|toml:|markdown:")
 has "indexed rust files"     "rust:"     "$rr"
 has "indexed toml"           "toml:"     "$rr"

From 20b83310fe5dd59acfc569d454a9baec1a39e0e6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?LJ=20=F0=9F=A5=A5=F0=9F=8C=B4?= <linghua@cocoindex.io>
Date: Sun, 21 Jun 2026 23:52:20 -0700
Subject: [PATCH 3/3] docs(rust): rewrite README as a user guide (install / CLI
 / MCP / config)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drop the SDK-internals walkthrough. The Rust README now mirrors the main
cocoindex-code README's user-facing structure — Install (build from source),
Quick start, Coding Agent Integration (Skill + MCP), CLI Reference, Search
options, MCP tool reference, Configuration (user/project settings), Supported
languages, and a short "Differences from the Python build" note (local-only
embeddings; no custom Python chunkers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 rust/README.md | 304 +++++++++++++++++++++++++++++--------------------
 1 file changed, 181 insertions(+), 123 deletions(-)

diff --git a/rust/README.md b/rust/README.md
index 0a6a0d3..4683d3a 100644
--- a/rust/README.md
+++ b/rust/README.md
@@ -1,159 +1,217 @@
-# ccc — semantic code search (Rust)
+# cocoindex-code (Rust) — AST-based semantic code search
 
-A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the
-**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter,
-embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for
-fast vector search — from the CLI or over MCP.
+A lightweight, effective **(AST-based)** semantic code search tool for your
+codebase — the native-Rust build of [`ccc`](https://github.com/cocoindex-io/cocoindex-code).
+Built on [CocoIndex](https://github.com/cocoindex-io/cocoindex), the Rust data
+transformation engine. Use it from the CLI, or wire it into Claude Code, Codex,
+Cursor — any coding agent — via [Skill](#coding-agent-integration) or
+[MCP](#mcp-server).
 
-## Build & run
+- Instant token savings — let the agent find code by meaning, not grep.
+- **Local embeddings, zero setup** — runs fully offline, no API key required.
+- **Incremental** — only re-indexes changed files.
+
+## Features
+
+- **Semantic code search** — find relevant code with natural-language queries
+  when grep falls short.
+- **Ultra performant** — a single static binary on top of the Rust
+  [CocoIndex](https://github.com/cocoindex-io/cocoindex) engine; only changed
+  files are re-indexed.
+- **Multi-language** — Python, JavaScript/TypeScript, Rust, Go, Java, C/C++, C#,
+  SQL, Shell, and more (tree-sitter).
+- **Embedded** — a sqlite-vec index file; no database to run.
+- **Local embeddings** — sentence-transformers via [fastembed](https://github.com/Anush008/fastembed-rs)
+  (ONNX), no API key, no Python.
+
+## Install
+
+The Rust build is compiled from source. It depends on the CocoIndex SDK as a
+sibling checkout, so clone both repos side by side:
 
 ```bash
-cd rust
-cargo build       # fastembed/ONNX is always on — local embeddings are the only backend
-cargo test        # sqlite-vec (vec0) integration test
+git clone https://github.com/cocoindex-io/cocoindex
+git clone -b rust https://github.com/cocoindex-io/cocoindex-code
+
+cd cocoindex-code/rust
+cargo build --release
 
-./target/debug/ccc init
-./target/debug/ccc index
-./target/debug/ccc search "vector similarity" --lang rust --limit 10
+# put the binary on your PATH (or use `cargo install --path .`)
+install -m 0755 target/release/ccc ~/.local/bin/ccc
+ccc --help
 ```
 
-The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling
-(`../../cocoindex`). For distribution this should become a git dependency on
-`cocoindex-io/cocoindex` (the `v1` branch).
+Embeddings are **local-only** (fastembed/ONNX) — no cloud provider or API key is
+required or supported in this build. The default model is
+[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5); any
+model in fastembed's registry can be selected (see [Configuration](#configuration)).
 
-## Architecture
+## Quick start
 
-The CLI is a thin **client** that talks to a background **daemon** over a Unix
-socket; the daemon keeps the embedding model warm and caches per-project state.
-`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the
-daemon on first use.
+```bash
+ccc init                                # initialize project (creates settings)
+ccc index                               # build the index
+ccc search "authentication logic"       # search!
+```
 
-- **IPC**: length-prefixed msgpack frames over `daemon.sock`.
-- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default
-  model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works
-  (resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2`
-  resolves).
-- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`.
+The background daemon starts automatically on first use and keeps the embedding
+model warm.
 
-## How it uses the CocoIndex Rust SDK
+> **Tip:** `ccc index` auto-initializes if you haven't run `ccc init` yet, so you
+> can skip straight to indexing.
 
-This is the canonical worked example of driving the SDK from Rust. The snippets
-below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230
-lines — so treat the cited `file:line` anchors as the source of truth and update
-this section whenever those change.
+## Coding Agent Integration
 
-**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to
-exactly what the tool needs.
+### Skill
 
-```toml
-cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] }
-#   text      -> RecursiveSplitter + detect_code_language (tree-sitter)
-#   sqlite    -> sqlite-vec (vec0) table target
-#   fastembed -> local sentence-transformers embeddings
-#   fs_live   -> live directory watching (daemon)
+Install the `ccc` skill so your coding agent automatically uses semantic search
+when it helps:
+
+```bash
+npx skills add cocoindex-io/cocoindex-code
 ```
 
-`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`,
-`IdGenerator`, `walk_dir`, and the `mount_each!` macro.
-
-**1. Environment → App → run** — the entry point (`indexer.rs:206`). The
-`Environment` owns the incremental-state DB and the dependency-injected
-resources; `app.run` executes one declarative pass and returns `RunStats`.
-
-```rust
-let app = cocoindex::Environment::builder()
-    .db_path(coco_db_path)                 // engine's change-tracking state DB
-    .provide_key(&DB, db)                  // inject resources by ContextKey
-    .provide_key(&EMBEDDER, embedder.clone())
-    .build().await?
-    .app("CocoIndexCode").await?;
-let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?;
+The skill teaches the agent to initialize, index, and search on its own, and to
+keep the index fresh as you work. Ask it to search the codebase — e.g. *"find how
+user sessions are managed"* — or invoke it directly with `/ccc`. Requires the
+`ccc` binary on your `PATH` (see [Install](#install)).
+
+### MCP Server
+
+Alternatively, run `ccc` as an MCP server over stdio:
+
+```bash
+# Claude Code
+claude mcp add cocoindex-code -- ccc mcp
+
+# Codex
+codex mcp add cocoindex-code -- ccc mcp
 ```
 
-**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey`
-values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state`
-attaches a state-id, so changing the underlying resource (e.g. the embedding
-model) invalidates everything memoized against it.
+Once configured, the agent decides when semantic search is helpful — finding code
+by description, exploring unfamiliar code, or locating implementations without
+knowing exact names.
+
+<details>
+<summary>MCP Tool Reference</summary>
+
+Running as an MCP server (`ccc mcp`) exposes one tool:
+
+**`search`** — search the codebase by semantic similarity.
 
-```rust
-static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> =
-    LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key()));
 ```
+search(
+    query: str,                          # natural-language query or code snippet
+    limit: int = 5,                      # max results (1–100)
+    offset: int = 0,                     # pagination offset
+    refresh_index: bool = True,          # refresh the index before querying
+    languages: list[str] | None = None,  # filter by language, e.g. ["python","rust"]
+    paths: list[str] | None = None,      # filter by path glob, e.g. ["src/utils/*"]
+)
+```
+
+Returns matching chunks with file path, language, code, line numbers, and a
+similarity score.
+</details>
+
+## CLI Reference
 
-**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The
-arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is
-skipped on the next run. (We thread the embedder's identity through `model_tag`
-precisely so a model swap reprocesses every file.)
+| Command | Description |
+|---------|-------------|
+| `ccc init` | Initialize a project — creates settings files, adds `.cocoindex_code/` to `.gitignore` |
+| `ccc index` | Build or update the index (auto-inits if needed) |
+| `ccc search <query>` | Semantic search across the codebase |
+| `ccc status` | Show index stats (chunk count, file count, language breakdown) |
+| `ccc mcp` | Run as an MCP server in stdio mode |
+| `ccc doctor` | Run diagnostics — settings, daemon, model, file matching, index health (`-v` for detail) |
+| `ccc reset` | Delete index databases. `--all` also removes settings. `-f` skips confirmation. |
+| `ccc daemon status` | Show daemon version, uptime, and loaded projects |
+| `ccc daemon restart` | Restart the background daemon |
+| `ccc daemon stop` | Stop the daemon |
 
-```rust
-#[cocoindex::function]
-async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String)
-    -> Result<Vec<CodeChunk>> { /* chunk + embed */ }
+### Search options
+
+```bash
+ccc search database schema                           # basic search
+ccc search --lang python --lang markdown schema      # filter by language
+ccc search --path 'src/utils/*' query handler        # filter by path glob
+ccc search --offset 10 --limit 5 database schema     # pagination
+ccc search --refresh database schema                 # update index first, then search
 ```
 
-**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)`
-yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`);
-`mount_each!` mounts the memoized fn once per item.
+By default `ccc search` scopes results to your current working directory
+(relative to the project root). Use `--path` to override.
+
+## Configuration
+
+Configuration lives in two YAML files, both created by `ccc init`.
+
+### User settings (`~/.cocoindex_code/global_settings.yml`)
 
-```rust
-let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?;
-let rows_by_file =
-    mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?;
+Shared across all projects — controls the embedding model.
+
+```yaml
+embedding:
+  provider: sentence-transformers          # local fastembed (the only supported provider)
+  model: BAAI/bge-small-en-v1.5            # any model in fastembed's registry
+
+  # Optional asymmetric-retrieval knobs, applied separately to indexing vs query.
+  # Accepted key: prompt_name (sentence-transformers).
+  # indexing_params:
+  #   prompt_name: passage
+  # query_params:
+  #   prompt_name: query
 ```
 
-**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired
-rows; the engine diffs against the previous run and applies the minimal
-insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`).
+> Set `COCOINDEX_CODE_DIR` to place `global_settings.yml` somewhere other than
+> `~/.cocoindex_code/`.
+
+Models are resolved against fastembed's registry by name, then by suffix — so
+`sentence-transformers/all-MiniLM-L6-v2` resolves. Cloud / LiteLLM providers are
+not part of this build; a `provider: litellm` config loads but fails with a clear
+message pointing at the local provider.
+
+### Project settings (`<project>/.cocoindex_code/settings.yml`)
 
-```rust
-let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?;
-for row in &rows { table.declare_row(&ctx, row)?; }
+Per-project — controls which files are indexed.
+
+```yaml
+include_patterns:
+  - "**/*.py"
+  - "**/*.ts"
+  - "**/*.rs"
+  - "**/*.go"
+  # ... sensible defaults for 28+ file types
+
+exclude_patterns:
+  - "**/.*"               # hidden directories
+  - "**/node_modules"
+  - "**/dist"
+  # ...
+
+language_overrides:
+  - ext: inc              # treat .inc files as PHP
+    lang: php
 ```
 
-Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`;
-sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`.
+Include/exclude globs additionally honor nested `.gitignore` files.
+`.cocoindex_code/` is added to `.gitignore` during `init`.
 
-**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does
-*not* load the `vec0` extension. The tool registers it as a SQLite
-auto-extension once, builds its own pool, and hands it to
-`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK
-sqlite target with `vec0` virtual tables.
+## Supported languages
 
-**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter,
-RecursiveChunkConfig, detect_code_language}` for chunking/language detection;
-`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids;
-`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into
-the SDK error type.
+Tree-sitter–based chunking for Python, JavaScript/TypeScript, Rust, Go, Java,
+C/C++, C#, Ruby, PHP, Swift, Kotlin, Scala, SQL, Shell, Markdown, and more.
+Unrecognized text files are indexed with a generic recursive splitter.
 
-## CLI commands
+## Differences from the Python build
 
-`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`),
-`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`,
-`daemon status|restart|stop`, and the hidden `run-daemon`.
+This native build targets feature parity with the Python `ccc` for day-to-day
+use; two things differ today:
 
-## Configuration
+- **Embeddings are local-only** (fastembed). There is no LiteLLM / cloud-provider
+  option, and the default model is `BAAI/bge-small-en-v1.5`.
+- **Custom Python chunkers** (`chunkers:` in project settings) are not supported —
+  the config still parses, but the built-in tree-sitter splitter is used.
 
-Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model,
-provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml`
-(include/exclude patterns, language overrides). Include/exclude use the SDK's
-`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files.
-`ccc doctor` prints the resolved configuration and where each value came from.
-
-## Testing
-
-- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct
-  results.
-- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of
-  `init` → `index` → `search` (with `--lang`/`--path` filters and incremental
-  re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown),
-  multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` /
-  `tools/call`), `doctor`, and `reset --all`.
-
-## Limitations / follow-ups
-
-- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet.
-- **`init`** is flag-driven (`--model`) rather than interactive prompts.
-- **Custom chunkers**: the built-in tree-sitter recursive splitter is used;
-  pluggable chunkers are not yet supported.
-- **Live index-progress streaming** and container path-mapping env vars are
-  follow-ups.
+Index databases are interchangeable: `ccc search` works against an index built by
+the Python tool, and vice versa.