docs: document eval harness and evaluation guardrails

PatrickSys · PatrickSys · commit 2e479e4fe5bd · 2026-02-20T20:15:54.000+01:00
- add README section for npm run eval (multi-codebase, offline smoke)
- capture eval harness + fixture details in capabilities and changelog
- update motivation to call out frozen eval/regression guardrails
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,18 @@
 # Changelog
 
+## [Unreleased]
+
+### Added
+
+- Multi-codebase eval runner: `npm run eval -- <codebaseA> <codebaseB>` with per-codebase reports and combined summary.
+- Shared eval scoring/reporting module (`src/eval/*`) used by both the CLI runner and the test suite.
+- Second frozen eval fixture plus an in-repo controlled TypeScript codebase for fully-offline eval runs.
+- Regression tests covering Tree-sitter Unicode slicing, parser cleanup/reset behavior, and large/generated file skipping.
+
+### Fixed
+
+- Tree-sitter symbol extraction now treats node offsets as UTF-8 byte ranges and evicts cached parsers on failures/timeouts.
+
 ## [1.6.2] - 2026-02-17
 
 Stripped it down for token efficiency, moved CLI code out of the protocol layer, and cleared structural debt.
diff --git a/MOTIVATION.md b/MOTIVATION.md
@@ -51,6 +51,10 @@ Correct the agent once. Record the decision. From then on, it surfaces in search
 
 Before an edit, the agent gets a curated "preflight" check from three sources (code, patterns, memories). If evidence is thin or contradictory, the response tells the AI Agent to look for more evidence with a concrete next step. This is the difference between "confident assumption" and "informed decision."
 
+### Guardrails via frozen eval + regressions
+
+When retrieval quality silently degrades (Unicode slicing bugs, large generated files, parser failures), agents still produce confident output — just with worse evidence. Shipping frozen eval fixtures plus regression tests makes these failures measurable and blocks "fix the tests" style metric gaming.
+
 ## Key Design Decisions
 
 1. **Fewer tools, richer responses.** 10 tools instead of 50. One search call that aggregates everything.
diff --git a/README.md b/README.md
@@ -165,18 +165,35 @@ Record a decision once. It surfaces automatically in search results and prefligh
 
 ### All Tools
 
-| Tool                           | What it does                                                                     |
-| ------------------------------ | -------------------------------------------------------------------------------- |
+| Tool                           | What it does                                                                              |
+| ------------------------------ | ----------------------------------------------------------------------------------------- |
 | `search_codebase`              | Hybrid search with enrichment + preflight. Pass `intent="edit"` for edit readiness check. |
-| `get_team_patterns`            | Pattern frequencies, golden files, conflict detection                            |
-| `get_component_usage`          | "Find Usages" - where a library or component is imported                         |
-| `remember`                     | Record a convention, decision, gotcha, or failure                                |
-| `get_memory`                   | Query team memory with confidence decay scoring                                  |
-| `get_codebase_metadata`        | Project structure, frameworks, dependencies                                      |
-| `get_style_guide`              | Style guide rules for the current project                                        |
-| `detect_circular_dependencies` | Import cycles between files                                                      |
-| `refresh_index`                | Re-index (full or incremental) + extract git memories                            |
-| `get_indexing_status`          | Progress and stats for the current index                                         |
+| `get_team_patterns`            | Pattern frequencies, golden files, conflict detection                                     |
+| `get_component_usage`          | "Find Usages" - where a library or component is imported                                  |
+| `remember`                     | Record a convention, decision, gotcha, or failure                                         |
+| `get_memory`                   | Query team memory with confidence decay scoring                                           |
+| `get_codebase_metadata`        | Project structure, frameworks, dependencies                                               |
+| `get_style_guide`              | Style guide rules for the current project                                                 |
+| `detect_circular_dependencies` | Import cycles between files                                                               |
+| `refresh_index`                | Re-index (full or incremental) + extract git memories                                     |
+| `get_indexing_status`          | Progress and stats for the current index                                                  |
+
+## Evaluation Harness (`npm run eval`)
+
+Reproducible evaluation with frozen fixtures so ranking/chunking changes are measured honestly and regressions get caught.
+
+- Two codebases: `npm run eval -- <codebaseA> <codebaseB>`
+- Defaults: fixture A = `tests/fixtures/eval-angular-spotify.json`, fixture B = `tests/fixtures/eval-controlled.json`
+- Offline smoke (no network):
+
+```bash
+npm run eval -- tests/fixtures/codebases/eval-controlled tests/fixtures/codebases/eval-controlled \
+  --fixture-a=tests/fixtures/eval-controlled.json \
+  --fixture-b=tests/fixtures/eval-controlled.json \
+  --skip-reindex --no-rerank
+```
+
+- Flags: `--help`, `--fixture-a`, `--fixture-b`, `--skip-reindex`, `--no-rerank`, `--no-redact`
 
 ## How the Search Works
 
diff --git a/docs/capabilities.md b/docs/capabilities.md
@@ -8,23 +8,23 @@ Technical reference for what `codebase-context` ships today. For the user-facing
 
 ### Core Tools
 
-| Tool | Input | Output |
-| --- | --- | --- |
-| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`) + `searchQuality` (with `hint` when low confidence) + `preflight` ({ready, reason}). Snippets opt-in. |
-| `get_team_patterns` | optional `category` | Pattern frequencies, trends, golden files, conflicts |
-| `get_component_usage` | `name` (import source) | Files importing the given package/module |
-| `remember` | `type`, `category`, `memory`, `reason` | Persists to `.codebase-context/memory.json` |
-| `get_memory` | optional `category`, `type`, `query`, `limit` | Memories with confidence decay scoring |
+| Tool                  | Input                                                             | Output                                                                                                                                                                               |
+| --------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `search_codebase`     | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`) + `searchQuality` (with `hint` when low confidence) + `preflight` ({ready, reason}). Snippets opt-in. |
+| `get_team_patterns`   | optional `category`                                               | Pattern frequencies, trends, golden files, conflicts                                                                                                                                 |
+| `get_component_usage` | `name` (import source)                                            | Files importing the given package/module                                                                                                                                             |
+| `remember`            | `type`, `category`, `memory`, `reason`                            | Persists to `.codebase-context/memory.json`                                                                                                                                          |
+| `get_memory`          | optional `category`, `type`, `query`, `limit`                     | Memories with confidence decay scoring                                                                                                                                               |
 
 ### Utility Tools
 
-| Tool | Purpose |
-| --- | --- |
-| `get_codebase_metadata` | Framework, dependencies, project stats |
-| `get_style_guide` | Style rules from project documentation |
-| `detect_circular_dependencies` | Import cycles in the file graph |
-| `refresh_index` | Full or incremental re-index + git memory extraction |
-| `get_indexing_status` | Index state, progress, last stats |
+| Tool                           | Purpose                                              |
+| ------------------------------ | ---------------------------------------------------- |
+| `get_codebase_metadata`        | Framework, dependencies, project stats               |
+| `get_style_guide`              | Style rules from project documentation               |
+| `detect_circular_dependencies` | Import cycles in the file graph                      |
+| `refresh_index`                | Full or incremental re-index + git memory extraction |
+| `get_indexing_status`          | Index state, progress, last stats                    |
 
 ## Retrieval Pipeline
 
@@ -90,3 +90,14 @@ Output: `{ ready: boolean, reason?: string }`
 
 - **Angular**: signals, standalone components, control flow syntax, lifecycle hooks, DI patterns, component metadata
 - **Generic**: 30+ languages — TypeScript, JavaScript, Python, Java, Kotlin, C/C++, C#, Go, Rust, PHP, Ruby, Swift, Scala, Shell, config/markup formats
+
+## Evaluation Harness
+
+Reproducible evaluation is shipped as a CLI entrypoint backed by shared scoring/reporting code.
+
+- **Command:** `npm run eval -- <codebaseA> <codebaseB>` (builds first, then runs `scripts/run-eval.mjs`)
+- **Shared implementation:** `src/eval/harness.ts` + `src/eval/types.ts` (tests and CLI use the same scoring)
+- **Frozen fixtures:**
+  - `tests/fixtures/eval-angular-spotify.json` (real-world)
+  - `tests/fixtures/eval-controlled.json` + `tests/fixtures/codebases/eval-controlled/` (offline controlled)
+- **Reported metrics:** Top-1 accuracy, Top-3 recall, spec contamination rate, and a gate pass/fail