Skip to content

Commit 2b75ca8

Browse files
rudimar.ronsonirudimar.ronsoni
authored andcommitted
Add local daemon Git layered indexing
1 parent e6a077f commit 2b75ca8

41 files changed

Lines changed: 3040 additions & 63 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
This is built on top of [CocoIndex v1](https://cocoindex.io/docs-v1/llms.txt).
2+
3+
4+
## Build and Test Commands
5+
6+
This project uses [uv](https://docs.astral.sh/uv/) for project management.
7+
8+
```bash
9+
uv run mypy . # Type check Python code
10+
uv run pytest tests/ # Run Python tests
11+
```
12+
13+
## Code Conventions
14+
15+
### Internal vs External Modules
16+
17+
We distinguish between **internal modules** (under packages with `_` prefix, e.g. `_internal.*` or `connectors.*._source`) and **external modules** (which users can directly import).
18+
19+
**External modules** (user-facing, e.g. `cocoindex/ops/sentence_transformers.py`):
20+
21+
* Be strict about not leaking implementation details
22+
* Use `__all__` to explicitly list public exports
23+
* Prefix ALL non-public symbols with `_`, including:
24+
* Standard library imports: `import threading as _threading`, `import typing as _typing`
25+
* Third-party imports: `import numpy as _np`, `from numpy.typing import NDArray as _NDArray`
26+
* Internal package imports: `from cocoindex.resources import schema as _schema`
27+
* Exception: `TYPE_CHECKING` imports for type hints don't need prefixing
28+
29+
**Internal modules** (e.g. `cocoindex/_internal/component_ctx.py`):
30+
31+
* Less strict since users shouldn't import these directly
32+
* Standard library and internal imports don't need underscore prefix
33+
* Only prefix symbols that are truly private to the module itself (e.g. `_context_var` for a module-private ContextVar)
34+
35+
### General principles (also covered by `/review-changes`)
36+
37+
- **Top-level imports.** Defer to in-function only for a real circular dependency or a heavy import that isn't always needed.
38+
- **Specific types over `Any`.** When a value enters as a weaker form (`str`, `Any`), convert to the strong type at the earliest point. Don't propagate the weak form.
39+
- **`NamedTuple`/small dataclass for multi-value returns.** Access fields by name at call sites.
40+
- **Single source of truth.** When the same value or logic appears in multiple places, consolidate it.
41+
- **Delete dead code and dead config.** When a change makes something unreachable, the code, the tests, and the knobs all go.
42+
- **Honest names.** The name describes what the code does today.
43+
44+
### Testing Guidelines
45+
46+
We prefer end-to-end tests on user-facing APIs, over unit tests on smaller internal functions. With this said, there're cases where unit tests are necessary, e.g. for internal logic with various situations and edge cases, in which case it's usually easier to cover various scenarios with unit tests.
47+
48+
When tests fail, fix the underlying issue. Don't skip, ignore, or exclude to get a green result.

README.md

Lines changed: 163 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,10 @@ Two install styles — they mirror the Docker image variants of the same names:
6161

6262
Next, set up your [coding agent integration](#coding-agent-integration) — or jump to [Manual CLI Usage](#manual-cli-usage) if you prefer direct control.
6363

64+
Docs:
65+
- [Git Layered Indexing](./docs/layered-indexing.md): configure reusable `base > branch > dirty` Git layers for root clones and linked worktrees.
66+
- [Docker Layered Indexing](./docs/docker-layered-indexing.md): run the layered daemon in Docker with persistent native state.
67+
6468
## Coding Agent Integration
6569

6670
### Skill (Recommended)
@@ -162,6 +166,16 @@ The background daemon starts automatically on first use.
162166

163167
> **Tip:** `ccc index` auto-initializes if you haven't run `ccc init` yet, so you can skip straight to indexing.
164168
169+
For Git repositories, you can configure layered indexing once from the root clone:
170+
171+
```bash
172+
ccc init --base main # share a base layer across linked worktrees
173+
ccc index # builds base + branch + dirty layers as needed
174+
ccc overlay status # inspect the current layer stack
175+
```
176+
177+
Linked worktrees reuse the same daemon-owned base layer and only index branch and dirty deltas. See [Git Layered Indexing](./docs/layered-indexing.md) for the full configuration model.
178+
165179
### CLI Reference
166180

167181
| Command | Description |
@@ -170,6 +184,8 @@ The background daemon starts automatically on first use.
170184
| `ccc index` | Build or update the index (auto-inits if needed). Shows streaming progress. |
171185
| `ccc search <query>` | Semantic search across the codebase |
172186
| `ccc status` | Show index stats (chunk count, file count, language breakdown) |
187+
| `ccc overlay status` | Inspect Git layered indexing state for the current worktree |
188+
| `ccc overlay prune` | Prune expired branch and dirty layers |
173189
| `ccc mcp` | Run as MCP server in stdio mode |
174190
| `ccc doctor` | Run diagnostics — checks settings, daemon, model, file matching, and index health |
175191
| `ccc reset` | Delete index databases. `--all` also removes settings. `-f` skips confirmation. |
@@ -185,6 +201,7 @@ ccc search --lang python --lang markdown schema # filter by language
185201
ccc search --path 'src/utils/*' query handler # filter by path
186202
ccc search --offset 10 --limit 5 database schema # pagination
187203
ccc search --refresh database schema # update index first, then search
204+
ccc index --base release/1.2 # override Git overlay base ref once
188205
```
189206

190207
By default, `ccc search` scopes results to your current working directory (relative to the project root). Use `--path` to override.
@@ -231,11 +248,12 @@ PUID=$(id -u) PGID=$(id -g) docker compose -f <(curl -L https://raw.githubuserco
231248

232249
Or grab [`docker/docker-compose.yml`](./docker/docker-compose.yml) and run `docker compose up -d` next to it (works on any shell, including Windows cmd / PowerShell).
233250

234-
By default your home directory is mounted into the container (set
235-
`COCOINDEX_HOST_WORKSPACE` to narrow this to a specific code folder). Index
236-
data and the embedding model cache persist in a Docker volume across
237-
restarts. Your global settings file at `$HOME/.cocoindex_code/global_settings.yml`
238-
is visible and editable on the host; edits take effect on your next `ccc` command.
251+
By default your home directory is mounted into the container. For team setups,
252+
prefer a narrower mount such as `COCOINDEX_HOST_WORKSPACE=$HOME/src` or one
253+
repo path. Index data, daemon Git-layer state, and the embedding model cache
254+
persist in the `cocoindex-data` Docker volume under `/var/cocoindex`. Your
255+
global settings file at `$HOME/.cocoindex_code/global_settings.yml` is visible
256+
and editable on the host; edits take effect on your next `ccc` command.
239257

240258
> **Pick a different image:** set `COCOINDEX_CODE_IMAGE` to override the
241259
> default. For example, the `:full` variant or GHCR:
@@ -254,6 +272,9 @@ docker run -d --name cocoindex-code \
254272
--volume "$HOME:/workspace" \
255273
--volume cocoindex-data:/var/cocoindex \
256274
-e COCOINDEX_CODE_HOST_PATH_MAPPING="/workspace=$HOME" \
275+
-e COCOINDEX_CODE_STATE_DIR=/var/cocoindex/state \
276+
-e COCOINDEX_CODE_RUNTIME_DIR=/var/run/cocoindex_code \
277+
-e COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/var/cocoindex/db \
257278
cocoindex/cocoindex-code:latest
258279
```
259280
</details>
@@ -267,18 +288,35 @@ docker run -d --name cocoindex-code \
267288
--volume "$HOME:/workspace" \
268289
--volume cocoindex-data:/var/cocoindex \
269290
-e COCOINDEX_CODE_HOST_PATH_MAPPING="/workspace=$HOME" \
291+
-e COCOINDEX_CODE_STATE_DIR=/var/cocoindex/state \
292+
-e COCOINDEX_CODE_RUNTIME_DIR=/var/run/cocoindex_code \
293+
-e COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/var/cocoindex/db \
270294
cocoindex/cocoindex-code:latest
271295
```
272296
</details>
273297

274298
### Shell wrapper for `ccc` commands
275299

276-
Paste this into `~/.bashrc` / `~/.zshrc` so `ccc` feels native on the host
277-
and picks up the right project based on your current directory:
300+
Paste this into `~/.bashrc` / `~/.zshrc` so `ccc` feels native on the host,
301+
picks up the right project based on your current directory, and uses the right
302+
TTY mode for interactive commands vs. MCP or piped stdin:
278303

279304
```bash
280305
ccc() {
281-
docker exec -it -e COCOINDEX_CODE_HOST_CWD="$PWD" cocoindex-code ccc "$@"
306+
local container="${COCOINDEX_CODE_CONTAINER_NAME:-cocoindex-code}"
307+
if [ "$(docker inspect -f '{{.State.Running}}' "$container" 2>/dev/null)" != "true" ]; then
308+
echo "cocoindex-code container is not running. Start it with: docker compose -f docker/docker-compose.yml up -d" >&2
309+
return 1
310+
fi
311+
312+
local flags=(-i)
313+
if [ "${1:-}" != "mcp" ] && [ -t 0 ] && [ -t 1 ]; then
314+
flags=(-it)
315+
fi
316+
317+
docker exec "${flags[@]}" \
318+
-e COCOINDEX_CODE_HOST_CWD="$PWD" \
319+
"$container" ccc "$@"
282320
}
283321
```
284322

@@ -346,6 +384,86 @@ docker rm -f cocoindex-code
346384
docker volume rm cocoindex-db cocoindex-model-cache
347385
```
348386

387+
For regular upgrades, keep the volume and recreate the container:
388+
389+
```bash
390+
docker compose -f docker/docker-compose.yml pull
391+
docker compose -f docker/docker-compose.yml up -d
392+
```
393+
394+
Switch between the slim and full images by changing `COCOINDEX_CODE_IMAGE`:
395+
396+
```bash
397+
COCOINDEX_CODE_IMAGE=cocoindex/cocoindex-code:latest docker compose -f docker/docker-compose.yml up -d
398+
COCOINDEX_CODE_IMAGE=cocoindex/cocoindex-code:full docker compose -f docker/docker-compose.yml up -d
399+
```
400+
401+
### Docker debugging
402+
403+
Useful commands:
404+
405+
```bash
406+
# Logs from the daemon supervisor and daemon process
407+
docker logs -f cocoindex-code
408+
409+
# Shell inside the container
410+
docker exec -it cocoindex-code sh
411+
412+
# Daemon readiness/status
413+
docker exec cocoindex-code ccc daemon status
414+
docker exec cocoindex-code test -S /var/run/cocoindex_code/daemon.sock
415+
416+
# Restart the container
417+
docker restart cocoindex-code
418+
419+
# Stop and remove the container, preserving index/state/cache volume
420+
docker rm -f cocoindex-code
421+
422+
# Reset all Docker-managed index/state/cache data
423+
docker compose -f docker/docker-compose.yml down -v
424+
```
425+
426+
Docker paths:
427+
428+
| Data | Default path |
429+
|---|---|
430+
| Host workspace mount | `/workspace` |
431+
| Settings on the mounted workspace | `/workspace/.cocoindex_code/global_settings.yml` |
432+
| DB/index files | `/var/cocoindex/db` via `COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/var/cocoindex/db` |
433+
| Durable daemon Git-layer state | `/var/cocoindex/state` via `COCOINDEX_CODE_STATE_DIR` |
434+
| Runtime socket, PID, and daemon log | `/var/run/cocoindex_code` via `COCOINDEX_CODE_RUNTIME_DIR` |
435+
| Embedding/cache data | `/var/cocoindex/cache` |
436+
437+
Local Git worktrees that use the same Docker container should share the Docker
438+
daemon state in `/var/cocoindex/state`. That lets the local daemon Git-layer
439+
feature reuse daemon-owned layer metadata and materialized layer sources across
440+
projects while keeping transient sockets under `/var/run/cocoindex_code`.
441+
442+
For layered indexing in Docker, initialize the base ref from the root clone and
443+
then use linked worktrees through the same wrapper/container:
444+
445+
```bash
446+
cd $HOME/src/github/cocoindex-io/cocoindex-code
447+
ccc init --base main
448+
ccc index
449+
450+
git worktree add ../cocoindex-code.worktrees/feature-1 -b feature-1 main
451+
cd ../cocoindex-code.worktrees/feature-1
452+
ccc index
453+
ccc overlay status
454+
```
455+
456+
Mount a workspace parent that contains both the root clone and linked
457+
worktrees. For example:
458+
459+
```bash
460+
COCOINDEX_HOST_WORKSPACE=$HOME/src/github/cocoindex-io \
461+
docker compose -f docker/docker-compose.yml up -d
462+
```
463+
464+
See [Docker Layered Indexing](./docs/docker-layered-indexing.md) for the full
465+
Docker setup and troubleshooting guide.
466+
349467
### Configuration via environment variables
350468

351469
Pass configuration to `docker run` / compose with `-e`:
@@ -365,6 +483,25 @@ Pass configuration to `docker run` / compose with `-e`:
365483
> to everything under it. If that's too broad, bind-mount a narrower
366484
> directory instead (`COCOINDEX_HOST_WORKSPACE=/path/to/code`).
367485
486+
Supported Docker environment variables:
487+
488+
| Variable | Purpose |
489+
|---|---|
490+
| `COCOINDEX_CODE_IMAGE` | Compose image, e.g. `cocoindex/cocoindex-code:full`. |
491+
| `COCOINDEX_CODE_CONTAINER_NAME` | Compose container name, default `cocoindex-code`. |
492+
| `COCOINDEX_HOST_WORKSPACE` | Host directory mounted at `/workspace`, default `${HOME}`. |
493+
| `COCOINDEX_CODE_HOST_PATH_MAPPING` | Container-to-host path mapping for displayed paths. |
494+
| `COCOINDEX_CODE_HOST_CWD` | Host current directory forwarded by `docker exec` wrappers. |
495+
| `COCOINDEX_CODE_STATE_DIR` | Durable daemon state directory, default `/var/cocoindex/state`. |
496+
| `COCOINDEX_CODE_RUNTIME_DIR` | Runtime socket/PID/log directory, default `/var/run/cocoindex_code`. |
497+
| `COCOINDEX_CODE_DB_PATH_MAPPING` | DB/index storage remapping, default `/workspace=/var/cocoindex/db`. |
498+
| `PUID`, `PGID` | Linux UID/GID used to chown Docker-managed paths and write host-owned workspace files. |
499+
500+
`COCOINDEX_CODE_STATE_DIR` is where repository/worktree metadata, overlay
501+
policy, layer manifests, and materialized layer sources are stored. Keep it on
502+
the persistent Docker volume if you want base layers to survive container
503+
recreation.
504+
368505
### Build the image locally
369506

370507
```bash
@@ -373,6 +510,7 @@ docker build -t cocoindex-code:local -f docker/Dockerfile .
373510

374511
## Features
375512
- **Semantic Code Search**: Find relevant code using natural language queries when grep doesn't work well, and save tokens immediately.
513+
- **Git Layered Indexing**: Reuse a shared base index across root clones and linked worktrees, then index only branch and dirty deltas. Configure it with `ccc init --base main`; see [Git Layered Indexing](./docs/layered-indexing.md).
376514
- **Ultra Performant**: ⚡ Built on top of ultra performant [Rust indexing engine](https://github.com/cocoindex-io/cocoindex). Only re-indexes changed files for fast updates.
377515
- **Multi-Language Support**: Python, JavaScript/TypeScript, Rust, Go, Java, C/C++, C#, SQL, Shell, and more.
378516
- **Embedded**: Portable and just works, no database setup required!
@@ -493,6 +631,23 @@ def my_chunker(path: Path, content: str) -> tuple[str | None, list[Chunk]]:
493631

494632
See [`src/cocoindex_code/chunking.py`](./src/cocoindex_code/chunking.py) for the public types and [`tests/example_toml_chunker.py`](./tests/example_toml_chunker.py) for a complete example.
495633

634+
### Git Layered Indexing Configuration
635+
636+
For Git repositories, `ccc init --base <ref>` stores a repository-level overlay
637+
policy in daemon state. The checkout-local `settings.yml` still controls file
638+
matching and chunking, while daemon state controls the shared base ref used by
639+
root clones and linked worktrees.
640+
641+
```bash
642+
ccc init --base main
643+
ccc index
644+
ccc overlay status
645+
```
646+
647+
The daemon stores durable layer metadata under `COCOINDEX_CODE_STATE_DIR` and
648+
uses stable hash IDs, so moving a repository or linked worktree does not
649+
invalidate reusable base and branch layers. See [Git Layered Indexing](./docs/layered-indexing.md) for details.
650+
496651
## Embedding Models
497652

498653
With the `[full]` extra installed, `ccc init` defaults to a local SentenceTransformers model ([Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs)) — no API key required. To use a different model, edit `~/.cocoindex_code/global_settings.yml`.

docker/Dockerfile

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
#
44
# Stable layers (reuse across releases — digest reproducible from the RUN
55
# command string + base image, so users keep them in local cache):
6-
# 1. apt install gosu + create coco user
6+
# 1. apt install gosu + git + create coco user
77
# 2. install uv
88
# 3. (full only) `uv pip install sentence-transformers` — ~1 GB of torch +
99
# transformers. This is the heavy, slow-changing layer we're optimizing
1010
# around.
1111
# 4. (full only) pre-bake the default embedding model under
1212
# /var/cocoindex/cache/... so the named volume's copy-up populates it
1313
# on first start without a network fetch.
14-
# 5. writable-path setup (mkdir /var/cocoindex/db + /var/run/cocoindex_code,
14+
# 5. writable-path setup (mkdir /var/cocoindex/state + /var/cocoindex/db + /var/run/cocoindex_code,
1515
# chown to coco) + env vars + entrypoint copy.
1616
#
1717
# Per-release layers (invalidate when the source tree changes):
@@ -36,7 +36,7 @@
3636
FROM python:3.12-slim
3737

3838
RUN apt-get update \
39-
&& apt-get install -y --no-install-recommends gosu \
39+
&& apt-get install -y --no-install-recommends git gosu \
4040
&& rm -rf /var/lib/apt/lists/* \
4141
&& groupadd -g 1000 coco \
4242
&& useradd -u 1000 -g 1000 -m coco
@@ -65,14 +65,15 @@ RUN mkdir -p /var/cocoindex/cache/huggingface /var/cocoindex/cache/sentence-tran
6565
# entrypoint re-chowns to the host user; under root (Docker Desktop
6666
# default) coco-ownership is harmless since processes run as root and can
6767
# write anywhere.
68-
RUN mkdir -p /var/cocoindex/db /var/run/cocoindex_code \
68+
RUN mkdir -p /var/cocoindex/state /var/cocoindex/db /var/cocoindex/cache/huggingface /var/cocoindex/cache/sentence-transformers /var/run/cocoindex_code \
6969
&& chown -R coco:coco /var/cocoindex /var/run/cocoindex_code
7070

7171
WORKDIR /workspace
7272

7373
# Runtime defaults — see the spec for what each does. All overridable at
7474
# `docker run -e ...` time.
7575
ENV COCOINDEX_CODE_DIR=/workspace/.cocoindex_code \
76+
COCOINDEX_CODE_STATE_DIR=/var/cocoindex/state \
7677
COCOINDEX_CODE_RUNTIME_DIR=/var/run/cocoindex_code \
7778
COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/var/cocoindex/db \
7879
COCOINDEX_CODE_DAEMON_SUPERVISED=1

docker/docker-compose.yml

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,23 +13,47 @@
1313
# Override the image via COCOINDEX_CODE_IMAGE — for example:
1414
# COCOINDEX_CODE_IMAGE=cocoindex/cocoindex-code:full docker compose up -d
1515
# COCOINDEX_CODE_IMAGE=ghcr.io/cocoindex-io/cocoindex-code:latest docker compose up -d
16+
#
17+
# Optional knobs:
18+
# COCOINDEX_CODE_CONTAINER_NAME=my-ccc
19+
# COCOINDEX_CODE_STATE_DIR=/var/cocoindex/state
20+
# COCOINDEX_CODE_RUNTIME_DIR=/var/run/cocoindex_code
21+
# COCOINDEX_CODE_DB_PATH_MAPPING=/workspace=/var/cocoindex/db
22+
#
23+
# For Git layered indexing, mount a workspace parent that contains both the
24+
# root clone and linked worktrees. Keep COCOINDEX_CODE_STATE_DIR on the
25+
# persistent cocoindex-data volume so base/branch layers survive container
26+
# recreation. See docs/docker-layered-indexing.md.
1627

1728
services:
1829
cocoindex-code:
1930
image: ${COCOINDEX_CODE_IMAGE:-cocoindex/cocoindex-code:latest}
20-
container_name: cocoindex-code
31+
container_name: ${COCOINDEX_CODE_CONTAINER_NAME:-cocoindex-code}
2132
volumes:
2233
- ${COCOINDEX_HOST_WORKSPACE:-${HOME}}:/workspace
2334
- cocoindex-data:/var/cocoindex
2435
environment:
36+
COCOINDEX_CODE_STATE_DIR: ${COCOINDEX_CODE_STATE_DIR:-/var/cocoindex/state}
37+
COCOINDEX_CODE_RUNTIME_DIR: ${COCOINDEX_CODE_RUNTIME_DIR:-/var/run/cocoindex_code}
38+
COCOINDEX_CODE_DB_PATH_MAPPING: ${COCOINDEX_CODE_DB_PATH_MAPPING:-/workspace=/var/cocoindex/db}
2539
# Makes CLI and MCP output show your real paths
2640
# (e.g. `/Users/you/myproject/...`) instead of container paths
2741
# (e.g. `/workspace/myproject/...`).
28-
COCOINDEX_CODE_HOST_PATH_MAPPING: /workspace=${COCOINDEX_HOST_WORKSPACE:-${HOME}}
42+
COCOINDEX_CODE_HOST_PATH_MAPPING: ${COCOINDEX_CODE_HOST_PATH_MAPPING:-/workspace=${COCOINDEX_HOST_WORKSPACE:-${HOME}}}
2943
# Linux only: set these so files written to your workspace are owned by
3044
# you rather than root. Not needed on macOS / Windows — leave empty.
3145
PUID: ${PUID:-}
3246
PGID: ${PGID:-}
47+
healthcheck:
48+
test:
49+
[
50+
"CMD-SHELL",
51+
"ccc daemon status >/dev/null 2>&1 || test -S /var/run/cocoindex_code/daemon.sock",
52+
]
53+
interval: 10s
54+
timeout: 5s
55+
retries: 12
56+
start_period: 10s
3357

3458
volumes:
3559
cocoindex-data:

0 commit comments

Comments
 (0)