Skip to content

Commit 1af60f8

Browse files
badmonster0claude
andcommitted
docs(rust): make it a standalone Rust README (drop Python framing)
Rename rust/PORTING.md -> rust/README.md and rewrite as a Rust-only usage doc: drop the Python->Rust module map, the parity audit, the Python backward-compat section, and the "vs Python" deltas. Keep build/run, architecture, the "How it uses the CocoIndex Rust SDK" walkthrough, CLI commands, configuration, testing, and a plain limitations/follow-ups list. Update the e2e fixture that copies the doc as a sample markdown file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 5c0b5dc commit 1af60f8

3 files changed

Lines changed: 160 additions & 211 deletions

File tree

rust/PORTING.md

Lines changed: 0 additions & 210 deletions
This file was deleted.

rust/README.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# ccc — semantic code search (Rust)
2+
3+
A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the
4+
**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter,
5+
embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for
6+
fast vector search — from the CLI or over MCP.
7+
8+
## Build & run
9+
10+
```bash
11+
cd rust
12+
cargo build # fastembed/ONNX is always on — local embeddings are the only backend
13+
cargo test # sqlite-vec (vec0) integration test
14+
15+
./target/debug/ccc init
16+
./target/debug/ccc index
17+
./target/debug/ccc search "vector similarity" --lang rust --limit 10
18+
```
19+
20+
The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling
21+
(`../../cocoindex`). For distribution this should become a git dependency on
22+
`cocoindex-io/cocoindex` (the `v1` branch).
23+
24+
## Architecture
25+
26+
The CLI is a thin **client** that talks to a background **daemon** over a Unix
27+
socket; the daemon keeps the embedding model warm and caches per-project state.
28+
`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the
29+
daemon on first use.
30+
31+
- **IPC**: length-prefixed msgpack frames over `daemon.sock`.
32+
- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default
33+
model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works
34+
(resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2`
35+
resolves).
36+
- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`.
37+
38+
## How it uses the CocoIndex Rust SDK
39+
40+
This is the canonical worked example of driving the SDK from Rust. The snippets
41+
below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230
42+
lines — so treat the cited `file:line` anchors as the source of truth and update
43+
this section whenever those change.
44+
45+
**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to
46+
exactly what the tool needs.
47+
48+
```toml
49+
cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] }
50+
# text -> RecursiveSplitter + detect_code_language (tree-sitter)
51+
# sqlite -> sqlite-vec (vec0) table target
52+
# fastembed -> local sentence-transformers embeddings
53+
# fs_live -> live directory watching (daemon)
54+
```
55+
56+
`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`,
57+
`IdGenerator`, `walk_dir`, and the `mount_each!` macro.
58+
59+
**1. Environment → App → run** — the entry point (`indexer.rs:206`). The
60+
`Environment` owns the incremental-state DB and the dependency-injected
61+
resources; `app.run` executes one declarative pass and returns `RunStats`.
62+
63+
```rust
64+
let app = cocoindex::Environment::builder()
65+
.db_path(coco_db_path) // engine's change-tracking state DB
66+
.provide_key(&DB, db) // inject resources by ContextKey
67+
.provide_key(&EMBEDDER, embedder.clone())
68+
.build().await?
69+
.app("CocoIndexCode").await?;
70+
let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?;
71+
```
72+
73+
**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey`
74+
values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state`
75+
attaches a state-id, so changing the underlying resource (e.g. the embedding
76+
model) invalidates everything memoized against it.
77+
78+
```rust
79+
static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> =
80+
LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key()));
81+
```
82+
83+
**3. Memoized functions**`#[cocoindex::function]` (`indexer.rs:48`). The
84+
arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is
85+
skipped on the next run. (We thread the embedder's identity through `model_tag`
86+
precisely so a model swap reprocesses every file.)
87+
88+
```rust
89+
#[cocoindex::function]
90+
async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String)
91+
-> Result<Vec<CodeChunk>> { /* chunk + embed */ }
92+
```
93+
94+
**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)`
95+
yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`);
96+
`mount_each!` mounts the memoized fn once per item.
97+
98+
```rust
99+
let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?;
100+
let rows_by_file =
101+
mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?;
102+
```
103+
104+
**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired
105+
rows; the engine diffs against the previous run and applies the minimal
106+
insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`).
107+
108+
```rust
109+
let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?;
110+
for row in &rows { table.declare_row(&ctx, row)?; }
111+
```
112+
113+
Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`;
114+
sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`.
115+
116+
**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does
117+
*not* load the `vec0` extension. The tool registers it as a SQLite
118+
auto-extension once, builds its own pool, and hands it to
119+
`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK
120+
sqlite target with `vec0` virtual tables.
121+
122+
**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter,
123+
RecursiveChunkConfig, detect_code_language}` for chunking/language detection;
124+
`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids;
125+
`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into
126+
the SDK error type.
127+
128+
## CLI commands
129+
130+
`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`),
131+
`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`,
132+
`daemon status|restart|stop`, and the hidden `run-daemon`.
133+
134+
## Configuration
135+
136+
Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model,
137+
provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml`
138+
(include/exclude patterns, language overrides). Include/exclude use the SDK's
139+
`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files.
140+
`ccc doctor` prints the resolved configuration and where each value came from.
141+
142+
## Testing
143+
144+
- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct
145+
results.
146+
- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of
147+
`init``index``search` (with `--lang`/`--path` filters and incremental
148+
re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown),
149+
multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` /
150+
`tools/call`), `doctor`, and `reset --all`.
151+
152+
## Limitations / follow-ups
153+
154+
- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet.
155+
- **`init`** is flag-driven (`--model`) rather than interactive prompts.
156+
- **Custom chunkers**: the built-in tree-sitter recursive splitter is used;
157+
pluggable chunkers are not yet supported.
158+
- **Live index-progress streaming** and container path-mapping env vars are
159+
follow-ups.

rust/tests/e2e_advanced.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ echo "### F. Real Rust codebase (the port's own src) — multi-language"
103103
P="$ROOT/realrust"; mkdir -p "$P/src"
104104
cp "$REPO"/rust/src/*.rs "$P/src/"
105105
cp "$REPO"/rust/Cargo.toml "$P/"
106-
cp "$REPO"/rust/PORTING.md "$P/"
106+
cp "$REPO"/rust/README.md "$P/"
107107
cd "$P"; $BIN init >/dev/null 2>&1; rr=$($BIN index 2>&1 | grep -E "rust:|toml:|markdown:")
108108
has "indexed rust files" "rust:" "$rr"
109109
has "indexed toml" "toml:" "$rr"

0 commit comments

Comments
 (0)