|
| 1 | +# ccc — semantic code search (Rust) |
| 2 | + |
| 3 | +A lightweight, AST-aware semantic code search engine (the `ccc` CLI) built on the |
| 4 | +**CocoIndex Rust SDK**. It walks a codebase, chunks each file with tree-sitter, |
| 5 | +embeds the chunks locally, and stores them in a sqlite-vec (`vec0`) table for |
| 6 | +fast vector search — from the CLI or over MCP. |
| 7 | + |
| 8 | +## Build & run |
| 9 | + |
| 10 | +```bash |
| 11 | +cd rust |
| 12 | +cargo build # fastembed/ONNX is always on — local embeddings are the only backend |
| 13 | +cargo test # sqlite-vec (vec0) integration test |
| 14 | + |
| 15 | +./target/debug/ccc init |
| 16 | +./target/debug/ccc index |
| 17 | +./target/debug/ccc search "vector similarity" --lang rust --limit 10 |
| 18 | +``` |
| 19 | + |
| 20 | +The SDK is a **path dependency** assuming `cocoindex` is checked out as a sibling |
| 21 | +(`../../cocoindex`). For distribution this should become a git dependency on |
| 22 | +`cocoindex-io/cocoindex` (the `v1` branch). |
| 23 | + |
| 24 | +## Architecture |
| 25 | + |
| 26 | +The CLI is a thin **client** that talks to a background **daemon** over a Unix |
| 27 | +socket; the daemon keeps the embedding model warm and caches per-project state. |
| 28 | +`index` / `search` / `status` / `doctor` are daemon-backed and auto-spawn the |
| 29 | +daemon on first use. |
| 30 | + |
| 31 | +- **IPC**: length-prefixed msgpack frames over `daemon.sock`. |
| 32 | +- **Embeddings**: local sentence-transformers via **fastembed** (ONNX). Default |
| 33 | + model `BAAI/bge-small-en-v1.5`; any model in fastembed's registry works |
| 34 | + (resolved by name, then by suffix, so `sentence-transformers/all-MiniLM-L6-v2` |
| 35 | + resolves). |
| 36 | +- **Storage**: a sqlite-vec (`vec0`) virtual table, partitioned by `language`. |
| 37 | + |
| 38 | +## How it uses the CocoIndex Rust SDK |
| 39 | + |
| 40 | +This is the canonical worked example of driving the SDK from Rust. The snippets |
| 41 | +below mirror the live source — `rust/src/indexer.rs` is the whole flow in ~230 |
| 42 | +lines — so treat the cited `file:line` anchors as the source of truth and update |
| 43 | +this section whenever those change. |
| 44 | + |
| 45 | +**Crate + features** (`Cargo.toml`): one path/git dependency, feature-gated to |
| 46 | +exactly what the tool needs. |
| 47 | + |
| 48 | +```toml |
| 49 | +cocoindex = { features = ["text", "sqlite", "fastembed", "fs_live"] } |
| 50 | +# text -> RecursiveSplitter + detect_code_language (tree-sitter) |
| 51 | +# sqlite -> sqlite-vec (vec0) table target |
| 52 | +# fastembed -> local sentence-transformers embeddings |
| 53 | +# fs_live -> live directory watching (daemon) |
| 54 | +``` |
| 55 | + |
| 56 | +`use cocoindex::prelude::*;` pulls in `Ctx`, `Error`/`Result`, `FileEntry`, |
| 57 | +`IdGenerator`, `walk_dir`, and the `mount_each!` macro. |
| 58 | + |
| 59 | +**1. Environment → App → run** — the entry point (`indexer.rs:206`). The |
| 60 | +`Environment` owns the incremental-state DB and the dependency-injected |
| 61 | +resources; `app.run` executes one declarative pass and returns `RunStats`. |
| 62 | + |
| 63 | +```rust |
| 64 | +let app = cocoindex::Environment::builder() |
| 65 | + .db_path(coco_db_path) // engine's change-tracking state DB |
| 66 | + .provide_key(&DB, db) // inject resources by ContextKey |
| 67 | + .provide_key(&EMBEDDER, embedder.clone()) |
| 68 | + .build().await? |
| 69 | + .app("CocoIndexCode").await?; |
| 70 | +let stats: RunStats = app.run(move |ctx| app_main(ctx, /* … */)).await?; |
| 71 | +``` |
| 72 | + |
| 73 | +**2. Context keys = typed DI + change detection** (`indexer.rs:30`). `ContextKey` |
| 74 | +values are fetched inside a flow with `ctx.get_key(&KEY)?`. `new_with_state` |
| 75 | +attaches a state-id, so changing the underlying resource (e.g. the embedding |
| 76 | +model) invalidates everything memoized against it. |
| 77 | + |
| 78 | +```rust |
| 79 | +static EMBEDDER: LazyLock<ContextKey<CodeEmbedder>> = |
| 80 | + LazyLock::new(|| ContextKey::new_with_state("embedder", |e| e.state_key())); |
| 81 | +``` |
| 82 | + |
| 83 | +**3. Memoized functions** — `#[cocoindex::function]` (`indexer.rs:48`). The |
| 84 | +arguments are part of the memo fingerprint: an unchanged `(file, model_tag)` is |
| 85 | +skipped on the next run. (We thread the embedder's identity through `model_tag` |
| 86 | +precisely so a model swap reprocesses every file.) |
| 87 | + |
| 88 | +```rust |
| 89 | +#[cocoindex::function] |
| 90 | +async fn process_file(ctx: &Ctx, file: FileEntry, model_tag: String) |
| 91 | + -> Result<Vec<CodeChunk>> { /* chunk + embed */ } |
| 92 | +``` |
| 93 | + |
| 94 | +**4. Sources + fan-out** (`indexer.rs:169`). `walk_dir(...).path_matcher(...)` |
| 95 | +yields `(key, FileEntry)` items (`file.key()`, `file.content_str()`); |
| 96 | +`mount_each!` mounts the memoized fn once per item. |
| 97 | + |
| 98 | +```rust |
| 99 | +let files = walk_dir(root).recursive(true).path_matcher(matcher).items()?; |
| 100 | +let rows_by_file = |
| 101 | + mount_each!(files, |file| process_file(ctx, file, model_tag.clone())).await?; |
| 102 | +``` |
| 103 | + |
| 104 | +**5. Targets = declarative sync** (`indexer.rs:152`). You *declare* the desired |
| 105 | +rows; the engine diffs against the previous run and applies the minimal |
| 106 | +insert/update/delete. Rows are plain `Serialize` structs (`schema.rs::CodeChunk`). |
| 107 | + |
| 108 | +```rust |
| 109 | +let table = sqlite::mount_table_target_with_options(&ctx, &DB, TABLE_NAME, schema, opts).await?; |
| 110 | +for row in &rows { table.declare_row(&ctx, row)?; } |
| 111 | +``` |
| 112 | + |
| 113 | +Schema is built with `TableSchema::new([(name, ColumnDef::new(ty)), …], [pk])`; |
| 114 | +sqlite-vec virtual tables via `Vec0TableDef { partition_key_columns, auxiliary_columns }`. |
| 115 | + |
| 116 | +**6. sqlite-vec gotcha** (`db.rs`). The SDK's `sqlite::Database::connect` does |
| 117 | +*not* load the `vec0` extension. The tool registers it as a SQLite |
| 118 | +auto-extension once, builds its own pool, and hands it to |
| 119 | +`sqlite::Database::from_pool(state_id, pool)` — the supported way to use the SDK |
| 120 | +sqlite target with `vec0` virtual tables. |
| 121 | + |
| 122 | +**7. Building blocks used from the SDK:** `ops::text::{RecursiveSplitter, |
| 123 | +RecursiveChunkConfig, detect_code_language}` for chunking/language detection; |
| 124 | +`IdGenerator::new()` + `id_gen.next_id(ctx, &code)` for stable chunk ids; |
| 125 | +`RunStats` for the run summary; `Error::engine(..)` to wrap foreign errors into |
| 126 | +the SDK error type. |
| 127 | + |
| 128 | +## CLI commands |
| 129 | + |
| 130 | +`init`, `index`, `search` (`--lang` / `--path` / `--offset` / `--limit` / `--refresh`), |
| 131 | +`status`, `reset` (`--all` / `-f`), `doctor` (`-v`), `mcp`, |
| 132 | +`daemon status|restart|stop`, and the hidden `run-daemon`. |
| 133 | + |
| 134 | +## Configuration |
| 135 | + |
| 136 | +Settings live in `~/.cocoindex_code/global_settings.yml` (embedding model, |
| 137 | +provider, indexing/query params) and a per-project `.cocoindex_code/settings.yml` |
| 138 | +(include/exclude patterns, language overrides). Include/exclude use the SDK's |
| 139 | +`PatternFilePathMatcher`, wrapped to also honor nested `.gitignore` files. |
| 140 | +`ccc doctor` prints the resolved configuration and where each value came from. |
| 141 | + |
| 142 | +## Testing |
| 143 | + |
| 144 | +- `cargo test` — the sqlite-vec `vec0` extension loads and KNN returns correct |
| 145 | + results. |
| 146 | +- `tests/e2e_cli.sh` / `tests/e2e_advanced.sh` — end-to-end coverage of |
| 147 | + `init` → `index` → `search` (with `--lang`/`--path` filters and incremental |
| 148 | + re-index), daemon lifecycle (auto-spawn, restart, stop, graceful shutdown), |
| 149 | + multi-project serving, model-swap re-index, MCP (`initialize` / `tools/list` / |
| 150 | + `tools/call`), `doctor`, and `reset --all`. |
| 151 | + |
| 152 | +## Limitations / follow-ups |
| 153 | + |
| 154 | +- **Embeddings**: local fastembed only — no cloud / multi-provider backend yet. |
| 155 | +- **`init`** is flag-driven (`--model`) rather than interactive prompts. |
| 156 | +- **Custom chunkers**: the built-in tree-sitter recursive splitter is used; |
| 157 | + pluggable chunkers are not yet supported. |
| 158 | +- **Live index-progress streaming** and container path-mapping env vars are |
| 159 | + follow-ups. |
0 commit comments