Skip to content

Commit 2e96b45

Browse files
georgeh0claude
andauthored
feat: add ccc grep structural code search (#200)
* feat: add `ccc grep` structural code search Add a `ccc grep "PATTERN" [DIR/FILE]` subcommand backed by cocoindex's structural `code_match` API. It compiles the pattern once per language, walks the project (honoring configured include/exclude globs and nested .gitignore, or the enclosing git repo when run outside a cocoindex project), and matches files in parallel on a thread pool, streaming each file's results as soon as it completes. Supports `--lang` and `--path` filters like `ccc search`, and renders matches with colorized line numbers and paths under a TTY. Extract the shared source-file walking logic (the include/exclude + nested-gitignore matcher and the os.walk-based file iteration) into a new `file_walk` module, now the single source of truth used by the indexer, the daemon's doctor file-walk, and grep. Bump cocoindex to >=1.0.13 for the locked symmetric pattern syntax used by `code_match`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(grep): normalize display paths to posix and strip CRLF in output On Windows the directory walk rendered paths with backslashes (e.g. `sub\b.py`) and CRLF files left a trailing `\r` on every rendered code line, which broke two tests. Normalize all display paths via `Path.as_posix()` — matching the indexer and `ccc search` — and strip the trailing `\r` when splitting source into lines. Add a CRLF rendering regression test that runs on all platforms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(grep): assert posix display path in path_glob test Completes the Windows path fix: test_grep_path_glob compared fm.path (now normalized to posix) against a str()-built path, which renders with backslashes on Windows. Use as_posix() so the assertion is platform independent, matching the other display-path assertions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent a18a0ee commit 2e96b45

9 files changed

Lines changed: 1115 additions & 143 deletions

File tree

README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ The background daemon starts automatically on first use.
180180
| `ccc init` | Initialize a project — creates settings files, adds `.cocoindex_code/` to `.gitignore` |
181181
| `ccc index` | Build or update the index (auto-inits if needed). Shows streaming progress. |
182182
| `ccc search <query>` | Semantic search across the codebase |
183+
| `ccc grep <pattern> [path]` | Structural code search by example (no index needed) |
183184
| `ccc status` | Show index stats (chunk count, file count, language breakdown) |
184185
| `ccc mcp` | Run as MCP server in stdio mode |
185186
| `ccc doctor` | Run diagnostics — checks settings, daemon, model, file matching, and index health |
@@ -200,6 +201,36 @@ ccc search --refresh database schema # update index first, then
200201

201202
By default, `ccc search` scopes results to your current working directory (relative to the project root). Use `--path` to override.
202203

204+
### Structural Search (`ccc grep`)
205+
206+
`ccc grep` finds code by **structure**, not text — you write a by-example pattern
207+
and it matches the syntax tree (via cocoindex's `code_match`), so formatting,
208+
whitespace, and intervening tokens don't matter. It runs entirely locally: no
209+
index, daemon, or embeddings required.
210+
211+
```bash
212+
ccc grep 'def \NAME(\(ARGS*\)):' # every Python function def under the cwd
213+
ccc grep 'foo(\(ARGS*\))' src/ # calls to foo(...) anywhere under src/
214+
ccc grep 'fn \NAME(\(A*\))' --lang rust # restrict to one language
215+
ccc grep 'class \NAME:' --path 'tests/**' # restrict to a path glob
216+
ccc grep 'TODO(\(A*\))' path/to/file.py # a single file
217+
```
218+
219+
Metavariables use the `\` sigil: `\NAME` captures one node, `\(NAME*\)` a run of
220+
siblings, `\_`/`\*` match anonymously. The pattern is matched per language, so a
221+
single invocation scans every supported source file (others are skipped). Inside
222+
an initialized project, `ccc grep` honors the project's include/exclude patterns
223+
and `.gitignore`; otherwise it scans all supported source files under the path.
224+
225+
Results stream to the terminal file-by-file as each match is found (in completion
226+
order, since files are matched in parallel) rather than all at once at the end.
227+
Each matching file shows its matched line range; under a TTY the path is colored,
228+
line numbers are dimmed, and the unmatched context around a match is dimmed so the
229+
match stands out.
230+
231+
> **Note:** `ccc grep` relies on cocoindex's structural `code_match` feature.
232+
> Until it ships in a released cocoindex, run against a local cocoindex build.
233+
203234
## Docker
204235

205236
A Docker image is available for teams who want a reproducible, dependency-free

pyproject.toml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ classifiers = [
2323

2424
dependencies = [
2525
"mcp>=1.0.0",
26-
"cocoindex[litellm]>=1.0.6,<1.1.0",
26+
"cocoindex[litellm]>=1.0.13,<1.1.0",
2727
"sqlite-vec>=0.1.0",
2828
"pydantic>=2.0.0",
2929
"numpy>=1.24.0",
@@ -39,23 +39,23 @@ dependencies = [
3939
# `embeddings-local` is the primary feature extra: it pulls in
4040
# `sentence-transformers` (via cocoindex) so local embeddings work without
4141
# an API key.
42-
embeddings-local = ["cocoindex[sentence-transformers]>=1.0.6,<1.1.0"]
42+
embeddings-local = ["cocoindex[sentence-transformers]>=1.0.13,<1.1.0"]
4343
# `full` is the umbrella "batteries-included" alias. Today it's just
4444
# `embeddings-local`, but we expect to bundle more optional niceties under
4545
# it over time — users who want everything can keep using `[full]` and pick
4646
# up the additions automatically. The name also matches the Docker
4747
# `:full` image variant for consistency across install paths. Contents are
4848
# inlined rather than self-referencing `cocoindex-code[embeddings-local]`
4949
# to avoid resolver edge cases with older pip.
50-
full = ["cocoindex[sentence-transformers]>=1.0.6,<1.1.0"]
50+
full = ["cocoindex[sentence-transformers]>=1.0.13,<1.1.0"]
5151
dev = [
5252
"pytest>=7.0.0",
5353
"pytest-asyncio>=0.21.0",
5454
"pytest-cov>=4.0.0",
5555
"ruff>=0.1.0",
5656
"mypy>=1.0.0",
5757
"prek>=0.1.0",
58-
"cocoindex[sentence-transformers]>=1.0.6,<1.1.0",
58+
"cocoindex[sentence-transformers]>=1.0.13,<1.1.0",
5959
]
6060

6161
[project.scripts]
@@ -89,7 +89,7 @@ dev = [
8989
"mypy>=1.0.0",
9090
"prek>=0.1.0",
9191
"types-pyyaml>=6.0.12.20250915",
92-
"cocoindex[sentence-transformers]>=1.0.7,<1.1.0",
92+
"cocoindex[sentence-transformers]>=1.0.13,<1.1.0",
9393
]
9494

9595
[tool.ruff]

src/cocoindex_code/cli.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,15 @@
55
import functools
66
import os
77
import sys
8+
import threading
89
from collections.abc import Callable
910
from pathlib import Path
1011
from typing import TYPE_CHECKING, TypeVar
1112

1213
import typer as _typer
1314

1415
if TYPE_CHECKING:
16+
from .grep import FileMatches, GrepWarning
1517
from .protocol import (
1618
DoctorCheckResult,
1719
IndexingProgress,
@@ -657,6 +659,79 @@ def search(
657659
print_search_results(resp)
658660

659661

662+
@app.command()
663+
def grep(
664+
pattern: str = _typer.Argument(
665+
...,
666+
help=r"By-example structural pattern; use \ for metavariables, "
667+
r"e.g. 'def \NAME(\(ARGS*\)):' or 'foo(\(ARGS*\))'.",
668+
),
669+
path: str = _typer.Argument(
670+
".", help="File or directory to search (default: current directory)."
671+
),
672+
lang: list[str] = _typer.Option(
673+
[], "--lang", help="Only match files of these languages (e.g. python, rust, cpp)."
674+
),
675+
path_glob: str | None = _typer.Option(
676+
None, "--path", help="Only match files whose path matches this glob (globset syntax)."
677+
),
678+
no_color: bool = _typer.Option(False, "--no-color", help="Disable colored output."),
679+
) -> None:
680+
r"""Structurally grep code by example (no index or daemon required).
681+
682+
Compiles the pattern per language and matches every supported source file
683+
under PATH in parallel. Inside an initialized project it honors the project's
684+
include/exclude and .gitignore rules; otherwise it scans all supported source
685+
files.
686+
"""
687+
from . import grep as _grep
688+
689+
target = Path(path)
690+
if not target.exists():
691+
_typer.echo(f"Error: path not found: {path}", err=True)
692+
raise _typer.Exit(code=1)
693+
694+
req = _grep.GrepRequest(
695+
pattern=pattern,
696+
root=target,
697+
languages=frozenset(lang_name.lower() for lang_name in lang) or None,
698+
path_glob=path_glob,
699+
)
700+
use_color = not no_color and sys.stdout.isatty() and not os.environ.get("NO_COLOR")
701+
grep_run = _grep.Grep(req)
702+
matched = 0
703+
# `run` calls `_emit` from several worker threads at once; the lock keeps one
704+
# file's output (and the `matched` bookkeeping) from interleaving with another's.
705+
output_lock = threading.Lock()
706+
707+
def _emit(item: FileMatches | GrepWarning) -> None:
708+
nonlocal matched
709+
if isinstance(item, _grep.GrepWarning):
710+
with output_lock:
711+
_typer.echo(f"warning: {item.message}", err=True)
712+
return
713+
block = _grep.render_file(item, color=use_color) # render outside the lock
714+
with output_lock:
715+
if matched:
716+
_typer.echo() # blank line between files
717+
_typer.echo(block)
718+
matched += 1
719+
720+
grep_run.run(_emit)
721+
722+
# The "unusable everywhere" verdict needs the whole walk, so it's known only
723+
# once the run is done — report it before exiting.
724+
if grep_run.unusable:
725+
langs = ", ".join(grep_run.failed_languages)
726+
_typer.echo(
727+
f"Error: the pattern did not compile for any of the languages found ({langs}).",
728+
err=True,
729+
)
730+
raise _typer.Exit(code=1)
731+
if matched == 0:
732+
_typer.echo("No matches found.")
733+
734+
660735
@app.command()
661736
@_catch_daemon_start_error
662737
def status() -> None:

src/cocoindex_code/daemon.py

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -372,23 +372,16 @@ async def _check_file_walk(project_root_str: str) -> DoctorCheckResult:
372372
"""Walk project files and report counts + gitignore paths."""
373373
from pathlib import PurePath
374374

375-
from cocoindex.resources.file import PatternFilePathMatcher
376-
377-
from .indexer import GitignoreAwareMatcher
378-
from .settings import load_gitignore_spec, load_project_settings
375+
from .file_walk import build_matcher
376+
from .settings import load_project_settings
379377

380378
project_root = Path(project_root_str)
381379
try:
382380
ps = load_project_settings(project_root)
383381
except FileNotFoundError as e:
384382
return DoctorCheckResult(name="File Walk", ok=False, details=[], errors=[str(e)])
385383

386-
gitignore_spec = load_gitignore_spec(project_root)
387-
base_matcher = PatternFilePathMatcher(
388-
included_patterns=ps.include_patterns,
389-
excluded_patterns=ps.exclude_patterns,
390-
)
391-
matcher = GitignoreAwareMatcher(base_matcher, gitignore_spec, project_root)
384+
matcher = build_matcher(project_root, ps.include_patterns, ps.exclude_patterns)
392385

393386
counts_by_ext: dict[str, int] = {}
394387
gitignore_dirs: list[str] = []

src/cocoindex_code/file_walk.py

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
"""Shared source-file walking: pattern + .gitignore matching, reused by the
2+
indexer, the daemon's doctor file-walk, and ``ccc grep``.
3+
4+
The matcher (include/exclude globs + nested ``.gitignore`` awareness) is the
5+
single source of truth for "which files count as part of the project". The
6+
indexer feeds it to CocoIndex's incremental file source; the daemon and ``ccc
7+
grep`` drive a plain :func:`os.walk` over it via :func:`iter_included_files`.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
import os
13+
from collections.abc import Iterable, Iterator
14+
from pathlib import Path, PurePath
15+
16+
from cocoindex.resources.file import FilePathMatcher, PatternFilePathMatcher
17+
from pathspec import GitIgnoreSpec
18+
19+
from .settings import load_gitignore_spec
20+
21+
22+
def _normalize_gitignore_lines(lines: Iterable[str], directory: PurePath) -> list[str]:
23+
"""Normalize .gitignore lines to root-relative gitignore patterns."""
24+
if directory in (PurePath("."), PurePath("")):
25+
prefix = ""
26+
else:
27+
prefix = f"{directory.as_posix().rstrip('/')}/"
28+
29+
normalized: list[str] = []
30+
for raw_line in lines:
31+
line = raw_line.rstrip("\n\r")
32+
if not line:
33+
continue
34+
stripped = line.lstrip()
35+
if not stripped or stripped.startswith("#"):
36+
continue
37+
if line.startswith("\\#") or line.startswith("\\!"):
38+
line = line[1:]
39+
negated = line.startswith("!")
40+
if negated:
41+
line = line[1:]
42+
body = line.strip()
43+
if not body:
44+
continue
45+
anchor = body.startswith("/")
46+
if anchor:
47+
body = body.lstrip("/")
48+
pattern = f"{prefix}{body}" if prefix else body
49+
else:
50+
contains_slash = "/" in body
51+
base = prefix
52+
if contains_slash:
53+
pattern = f"{base}{body}"
54+
else:
55+
if base:
56+
pattern = f"{base}**/{body}"
57+
else:
58+
pattern = f"**/{body}"
59+
if negated:
60+
pattern = f"!{pattern}"
61+
normalized.append(pattern)
62+
return normalized
63+
64+
65+
class GitignoreAwareMatcher(FilePathMatcher):
66+
"""Wraps another matcher and applies .gitignore filtering."""
67+
68+
def __init__(
69+
self,
70+
delegate: FilePathMatcher,
71+
root_spec: GitIgnoreSpec | None,
72+
project_root: Path,
73+
) -> None:
74+
self._delegate = delegate
75+
self._root = project_root
76+
self._spec_cache: dict[PurePath, GitIgnoreSpec | None] = {PurePath("."): root_spec}
77+
78+
def _spec_for(self, directory: PurePath) -> GitIgnoreSpec | None:
79+
if directory in self._spec_cache:
80+
return self._spec_cache[directory]
81+
82+
parent_dir = directory.parent if directory != PurePath(".") else PurePath(".")
83+
parent_spec = self._spec_for(parent_dir)
84+
spec = parent_spec
85+
86+
gitignore_path = (self._root / directory) / ".gitignore"
87+
if gitignore_path.is_file():
88+
try:
89+
lines = gitignore_path.read_text().splitlines()
90+
except (OSError, UnicodeDecodeError):
91+
lines = []
92+
normalized = _normalize_gitignore_lines(lines, directory)
93+
if normalized:
94+
new_spec = GitIgnoreSpec.from_lines(normalized)
95+
spec = new_spec if spec is None else spec + new_spec
96+
97+
self._spec_cache[directory] = spec
98+
return spec
99+
100+
def _is_ignored(self, path: PurePath, is_dir: bool) -> bool:
101+
directory = path if is_dir else path.parent
102+
if directory == PurePath(""):
103+
directory = PurePath(".")
104+
spec = self._spec_for(directory)
105+
if spec is None:
106+
return False
107+
match_path = path.as_posix()
108+
if is_dir and not match_path.endswith("/"):
109+
match_path = f"{match_path}/"
110+
return spec.match_file(match_path)
111+
112+
def is_dir_included(self, path: PurePath) -> bool:
113+
if self._is_ignored(path, True):
114+
return False
115+
return self._delegate.is_dir_included(path)
116+
117+
def is_file_included(self, path: PurePath) -> bool:
118+
if self._is_ignored(path, False):
119+
return False
120+
return self._delegate.is_file_included(path)
121+
122+
123+
def find_git_root(start: Path) -> Path | None:
124+
"""Walk up from ``start`` to the nearest directory holding a ``.git`` entry — a
125+
directory for a normal repo, or a *file* for a submodule or linked worktree.
126+
Returns that directory, or ``None`` if ``start`` is not inside a git repo.
127+
128+
Used to anchor ``.gitignore`` resolution at the real repo root when grepping a
129+
subdirectory that isn't inside an initialized cocoindex project."""
130+
current = start.resolve()
131+
while True:
132+
if (current / ".git").exists():
133+
return current
134+
if current.parent == current:
135+
return None
136+
current = current.parent
137+
138+
139+
def build_matcher(
140+
project_root: Path,
141+
included_patterns: list[str],
142+
excluded_patterns: list[str],
143+
) -> FilePathMatcher:
144+
"""Build the project's file matcher: include/exclude globs plus nested
145+
``.gitignore`` awareness anchored at ``project_root``."""
146+
base_matcher = PatternFilePathMatcher(
147+
included_patterns=included_patterns,
148+
excluded_patterns=excluded_patterns,
149+
)
150+
return GitignoreAwareMatcher(base_matcher, load_gitignore_spec(project_root), project_root)
151+
152+
153+
def iter_included_files(
154+
start: Path,
155+
base: Path,
156+
matcher: FilePathMatcher,
157+
) -> Iterator[tuple[Path, PurePath]]:
158+
"""Walk ``start`` recursively, yielding ``(absolute_path, path_relative_to_base)``
159+
for every file ``matcher`` includes, pruning excluded directories.
160+
161+
``base`` anchors the relative paths the matcher sees (the project root, so
162+
its patterns line up); ``start`` is where traversal begins and may be a
163+
subdirectory of ``base``. Both must be absolute. Traversal is deterministic
164+
(directories and files are visited in sorted order).
165+
"""
166+
for dirpath_str, dirnames, filenames in os.walk(start):
167+
dirpath = Path(dirpath_str)
168+
rel_dir = PurePath(dirpath.relative_to(base))
169+
if rel_dir != PurePath(".") and not matcher.is_dir_included(rel_dir):
170+
dirnames.clear()
171+
continue
172+
dirnames.sort()
173+
for fname in sorted(filenames):
174+
rel_path = rel_dir / fname if rel_dir != PurePath(".") else PurePath(fname)
175+
if matcher.is_file_included(rel_path):
176+
yield dirpath / fname, rel_path

0 commit comments

Comments
 (0)