Skip to content

Commit 73b1336

Browse files
committed
docs(03-03): complete shared eval harness and multi-codebase eval plan
- add execution summary with task commits, decisions, and deviation details - record self-check status and phase readiness for next phase planning
1 parent 09022ca commit 73b1336

File tree

1 file changed

+106
-0
lines changed

1 file changed

+106
-0
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
phase: 03-evaluation-guardrails
3+
plan: 03
4+
subsystem: testing
5+
tags: [eval, harness, cli, vitest, reporting]
6+
requires:
7+
- phase: 03-evaluation-guardrails
8+
provides: frozen fixtures and regression guardrails from plans 01-02
9+
provides:
10+
- Shared eval harness module in src/eval used by tests and CLI
11+
- Multi-codebase eval runner with per-codebase and combined metrics output
12+
- npm run eval workflow that builds before importing dist
13+
affects: [phase-05-ast-aligned-chunking, phase-09-search-quality, docs-capabilities]
14+
tech-stack:
15+
added: []
16+
patterns: [shared eval scoring utilities, dual-codebase CLI evaluation]
17+
key-files:
18+
created:
19+
- src/eval/types.ts
20+
- src/eval/harness.ts
21+
modified:
22+
- tests/eval-harness.test.ts
23+
- scripts/run-eval.mjs
24+
- package.json
25+
key-decisions:
26+
- "Consolidate eval scoring/reporting into src/eval so test and CLI outputs stay aligned."
27+
- "Treat --skip-reindex as best effort: if index artifacts are missing, auto-build the index to keep eval runnable from clean checkout."
28+
patterns-established:
29+
- "Eval reports always print both wins and failures with expected vs actual top-3 evidence."
30+
- "Runner supports fixture-a/fixture-b overrides for offline deterministic verification."
31+
requirements-completed: [EVAL-01]
32+
duration: 5 min
33+
completed: 2026-02-20
34+
---
35+
36+
# Phase 03 Plan 03: Shared Eval Harness + Multi-Codebase CLI Summary
37+
38+
**Shared eval scoring/reporting now lives in `src/eval`, and `npm run eval -- <codebaseA> <codebaseB>` runs per-codebase plus combined reports with honest wins/failures output.**
39+
40+
## Performance
41+
42+
- **Duration:** 5 min
43+
- **Started:** 2026-02-20T18:45:32Z
44+
- **Completed:** 2026-02-20T18:50:41Z
45+
- **Tasks:** 2
46+
- **Files modified:** 5
47+
48+
## Accomplishments
49+
- Added reusable `evaluateFixture`, `summarizeEvaluation`, and `formatEvalReport` in `src/eval/harness.ts` with shared eval types in `src/eval/types.ts`.
50+
- Migrated `tests/eval-harness.test.ts` to consume shared harness logic while preserving frozen-fixture validation for both angular and controlled fixtures.
51+
- Upgraded `scripts/run-eval.mjs` for one or two codebases, `--fixture-a/--fixture-b`, `--help`, combined summary output, and package-version display.
52+
- Added `npm run eval` script that builds first so dist imports work from clean checkout.
53+
54+
## Task Commits
55+
56+
Each task was committed atomically:
57+
58+
1. **Task 1: Move eval harness logic into `src/eval/` and reuse from tests** - `5c5319b` (feat)
59+
2. **Task 2: Upgrade runner to multi-codebase CLI and add `npm run eval`** - `b065042` (feat)
60+
61+
**Plan metadata:** pending
62+
63+
## Files Created/Modified
64+
- `src/eval/types.ts` - Shared fixture/query/result/summary type contracts for eval harness and runner.
65+
- `src/eval/harness.ts` - Centralized evaluation scoring and report formatting with wins/failures sections.
66+
- `tests/eval-harness.test.ts` - Harness unit tests updated to consume shared module and enforce frozen fixture invariants.
67+
- `scripts/run-eval.mjs` - Multi-codebase CLI with fixture overrides, combined summary, and clean-checkout-safe behavior.
68+
- `package.json` - Adds `npm run eval` entrypoint (`pnpm run build && node scripts/run-eval.mjs`).
69+
70+
## Decisions Made
71+
- Centralized harness logic under `src/eval` to prevent scoring/report drift between tests and CLI runs.
72+
- Kept the runner explicit about failures by printing query id/text, expected patterns, and top-3 actual paths.
73+
- Added skip-reindex fallback to auto-index when artifacts are missing so the eval command remains operable in fresh environments.
74+
75+
## Deviations from Plan
76+
77+
### Auto-fixed Issues
78+
79+
**1. [Rule 3 - Blocking] Handled missing index artifacts when `--skip-reindex` is used**
80+
- **Found during:** Task 2 verification (offline two-codebase smoke run)
81+
- **Issue:** `--skip-reindex` failed on clean state with missing `.codebase-context` artifacts, causing index corruption errors before evaluation.
82+
- **Fix:** Added index artifact detection and automatic reindex fallback when skip is requested without an existing index.
83+
- **Files modified:** scripts/run-eval.mjs
84+
- **Verification:** `npm run eval -- tests/fixtures/codebases/eval-controlled tests/fixtures/codebases/eval-controlled --fixture-a=tests/fixtures/eval-controlled.json --fixture-b=tests/fixtures/eval-controlled.json --skip-reindex --no-rerank`
85+
- **Committed in:** b065042
86+
87+
---
88+
89+
**Total deviations:** 1 auto-fixed (1 blocking)
90+
**Impact on plan:** Deviation kept the eval workflow usable from clean checkout without changing planned scope.
91+
92+
## Issues Encountered
93+
None.
94+
95+
## User Setup Required
96+
None - no external service configuration required.
97+
98+
## Next Phase Readiness
99+
Phase 03 now has frozen fixtures, regression guardrails, and a reusable multi-codebase eval command.
100+
Phase is complete and ready for transition to Phase 04 grammar assets/loader work.
101+
102+
---
103+
*Phase: 03-evaluation-guardrails*
104+
*Completed: 2026-02-20*
105+
106+
## Self-Check: PASSED

0 commit comments

Comments
 (0)