Skip to content

Commit 2e479e4

Browse files
committed
docs: document eval harness and evaluation guardrails
- add README section for npm run eval (multi-codebase, offline smoke) - capture eval harness + fixture details in capabilities and changelog - update motivation to call out frozen eval/regression guardrails
1 parent ec1fe76 commit 2e479e4

File tree

4 files changed

+70
-25
lines changed

4 files changed

+70
-25
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# Changelog
22

3+
## [Unreleased]
4+
5+
### Added
6+
7+
- Multi-codebase eval runner: `npm run eval -- <codebaseA> <codebaseB>` with per-codebase reports and combined summary.
8+
- Shared eval scoring/reporting module (`src/eval/*`) used by both the CLI runner and the test suite.
9+
- Second frozen eval fixture plus an in-repo controlled TypeScript codebase for fully-offline eval runs.
10+
- Regression tests covering Tree-sitter Unicode slicing, parser cleanup/reset behavior, and large/generated file skipping.
11+
12+
### Fixed
13+
14+
- Tree-sitter symbol extraction now treats node offsets as UTF-8 byte ranges and evicts cached parsers on failures/timeouts.
15+
316
## [1.6.2] - 2026-02-17
417

518
Stripped it down for token efficiency, moved CLI code out of the protocol layer, and cleared structural debt.

MOTIVATION.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@ Correct the agent once. Record the decision. From then on, it surfaces in search
5151

5252
Before an edit, the agent gets a curated "preflight" check from three sources (code, patterns, memories). If evidence is thin or contradictory, the response tells the AI Agent to look for more evidence with a concrete next step. This is the difference between "confident assumption" and "informed decision."
5353

54+
### Guardrails via frozen eval + regressions
55+
56+
When retrieval quality silently degrades (Unicode slicing bugs, large generated files, parser failures), agents still produce confident output — just with worse evidence. Shipping frozen eval fixtures plus regression tests makes these failures measurable and blocks "fix the tests" style metric gaming.
57+
5458
## Key Design Decisions
5559

5660
1. **Fewer tools, richer responses.** 10 tools instead of 50. One search call that aggregates everything.

README.md

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -165,18 +165,35 @@ Record a decision once. It surfaces automatically in search results and prefligh
165165

166166
### All Tools
167167

168-
| Tool | What it does |
169-
| ------------------------------ | -------------------------------------------------------------------------------- |
168+
| Tool | What it does |
169+
| ------------------------------ | ----------------------------------------------------------------------------------------- |
170170
| `search_codebase` | Hybrid search with enrichment + preflight. Pass `intent="edit"` for edit readiness check. |
171-
| `get_team_patterns` | Pattern frequencies, golden files, conflict detection |
172-
| `get_component_usage` | "Find Usages" - where a library or component is imported |
173-
| `remember` | Record a convention, decision, gotcha, or failure |
174-
| `get_memory` | Query team memory with confidence decay scoring |
175-
| `get_codebase_metadata` | Project structure, frameworks, dependencies |
176-
| `get_style_guide` | Style guide rules for the current project |
177-
| `detect_circular_dependencies` | Import cycles between files |
178-
| `refresh_index` | Re-index (full or incremental) + extract git memories |
179-
| `get_indexing_status` | Progress and stats for the current index |
171+
| `get_team_patterns` | Pattern frequencies, golden files, conflict detection |
172+
| `get_component_usage` | "Find Usages" - where a library or component is imported |
173+
| `remember` | Record a convention, decision, gotcha, or failure |
174+
| `get_memory` | Query team memory with confidence decay scoring |
175+
| `get_codebase_metadata` | Project structure, frameworks, dependencies |
176+
| `get_style_guide` | Style guide rules for the current project |
177+
| `detect_circular_dependencies` | Import cycles between files |
178+
| `refresh_index` | Re-index (full or incremental) + extract git memories |
179+
| `get_indexing_status` | Progress and stats for the current index |
180+
181+
## Evaluation Harness (`npm run eval`)
182+
183+
Reproducible evaluation with frozen fixtures so ranking/chunking changes are measured honestly and regressions get caught.
184+
185+
- Two codebases: `npm run eval -- <codebaseA> <codebaseB>`
186+
- Defaults: fixture A = `tests/fixtures/eval-angular-spotify.json`, fixture B = `tests/fixtures/eval-controlled.json`
187+
- Offline smoke (no network):
188+
189+
```bash
190+
npm run eval -- tests/fixtures/codebases/eval-controlled tests/fixtures/codebases/eval-controlled \
191+
--fixture-a=tests/fixtures/eval-controlled.json \
192+
--fixture-b=tests/fixtures/eval-controlled.json \
193+
--skip-reindex --no-rerank
194+
```
195+
196+
- Flags: `--help`, `--fixture-a`, `--fixture-b`, `--skip-reindex`, `--no-rerank`, `--no-redact`
180197

181198
## How the Search Works
182199

docs/capabilities.md

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,23 @@ Technical reference for what `codebase-context` ships today. For the user-facing
88

99
### Core Tools
1010

11-
| Tool | Input | Output |
12-
| --- | --- | --- |
13-
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`) + `searchQuality` (with `hint` when low confidence) + `preflight` ({ready, reason}). Snippets opt-in. |
14-
| `get_team_patterns` | optional `category` | Pattern frequencies, trends, golden files, conflicts |
15-
| `get_component_usage` | `name` (import source) | Files importing the given package/module |
16-
| `remember` | `type`, `category`, `memory`, `reason` | Persists to `.codebase-context/memory.json` |
17-
| `get_memory` | optional `category`, `type`, `query`, `limit` | Memories with confidence decay scoring |
11+
| Tool | Input | Output |
12+
| --------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
13+
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`) + `searchQuality` (with `hint` when low confidence) + `preflight` ({ready, reason}). Snippets opt-in. |
14+
| `get_team_patterns` | optional `category` | Pattern frequencies, trends, golden files, conflicts |
15+
| `get_component_usage` | `name` (import source) | Files importing the given package/module |
16+
| `remember` | `type`, `category`, `memory`, `reason` | Persists to `.codebase-context/memory.json` |
17+
| `get_memory` | optional `category`, `type`, `query`, `limit` | Memories with confidence decay scoring |
1818

1919
### Utility Tools
2020

21-
| Tool | Purpose |
22-
| --- | --- |
23-
| `get_codebase_metadata` | Framework, dependencies, project stats |
24-
| `get_style_guide` | Style rules from project documentation |
25-
| `detect_circular_dependencies` | Import cycles in the file graph |
26-
| `refresh_index` | Full or incremental re-index + git memory extraction |
27-
| `get_indexing_status` | Index state, progress, last stats |
21+
| Tool | Purpose |
22+
| ------------------------------ | ---------------------------------------------------- |
23+
| `get_codebase_metadata` | Framework, dependencies, project stats |
24+
| `get_style_guide` | Style rules from project documentation |
25+
| `detect_circular_dependencies` | Import cycles in the file graph |
26+
| `refresh_index` | Full or incremental re-index + git memory extraction |
27+
| `get_indexing_status` | Index state, progress, last stats |
2828

2929
## Retrieval Pipeline
3030

@@ -90,3 +90,14 @@ Output: `{ ready: boolean, reason?: string }`
9090

9191
- **Angular**: signals, standalone components, control flow syntax, lifecycle hooks, DI patterns, component metadata
9292
- **Generic**: 30+ languages — TypeScript, JavaScript, Python, Java, Kotlin, C/C++, C#, Go, Rust, PHP, Ruby, Swift, Scala, Shell, config/markup formats
93+
94+
## Evaluation Harness
95+
96+
Reproducible evaluation is shipped as a CLI entrypoint backed by shared scoring/reporting code.
97+
98+
- **Command:** `npm run eval -- <codebaseA> <codebaseB>` (builds first, then runs `scripts/run-eval.mjs`)
99+
- **Shared implementation:** `src/eval/harness.ts` + `src/eval/types.ts` (tests and CLI use the same scoring)
100+
- **Frozen fixtures:**
101+
- `tests/fixtures/eval-angular-spotify.json` (real-world)
102+
- `tests/fixtures/eval-controlled.json` + `tests/fixtures/codebases/eval-controlled/` (offline controlled)
103+
- **Reported metrics:** Top-1 accuracy, Top-3 recall, spec contamination rate, and a gate pass/fail

0 commit comments

Comments
 (0)