|
| 1 | +--- |
| 2 | +phase: 03-evaluation-guardrails |
| 3 | +plan: 03 |
| 4 | +subsystem: testing |
| 5 | +tags: [eval, harness, cli, vitest, reporting] |
| 6 | +requires: |
| 7 | + - phase: 03-evaluation-guardrails |
| 8 | + provides: frozen fixtures and regression guardrails from plans 01-02 |
| 9 | +provides: |
| 10 | + - Shared eval harness module in src/eval used by tests and CLI |
| 11 | + - Multi-codebase eval runner with per-codebase and combined metrics output |
| 12 | + - npm run eval workflow that builds before importing dist |
| 13 | +affects: [phase-05-ast-aligned-chunking, phase-09-search-quality, docs-capabilities] |
| 14 | +tech-stack: |
| 15 | + added: [] |
| 16 | + patterns: [shared eval scoring utilities, dual-codebase CLI evaluation] |
| 17 | +key-files: |
| 18 | + created: |
| 19 | + - src/eval/types.ts |
| 20 | + - src/eval/harness.ts |
| 21 | + modified: |
| 22 | + - tests/eval-harness.test.ts |
| 23 | + - scripts/run-eval.mjs |
| 24 | + - package.json |
| 25 | +key-decisions: |
| 26 | + - "Consolidate eval scoring/reporting into src/eval so test and CLI outputs stay aligned." |
| 27 | + - "Treat --skip-reindex as best effort: if index artifacts are missing, auto-build the index to keep eval runnable from clean checkout." |
| 28 | +patterns-established: |
| 29 | + - "Eval reports always print both wins and failures with expected vs actual top-3 evidence." |
| 30 | + - "Runner supports fixture-a/fixture-b overrides for offline deterministic verification." |
| 31 | +requirements-completed: [EVAL-01] |
| 32 | +duration: 5 min |
| 33 | +completed: 2026-02-20 |
| 34 | +--- |
| 35 | + |
| 36 | +# Phase 03 Plan 03: Shared Eval Harness + Multi-Codebase CLI Summary |
| 37 | + |
| 38 | +**Shared eval scoring/reporting now lives in `src/eval`, and `npm run eval -- <codebaseA> <codebaseB>` runs per-codebase plus combined reports with honest wins/failures output.** |
| 39 | + |
| 40 | +## Performance |
| 41 | + |
| 42 | +- **Duration:** 5 min |
| 43 | +- **Started:** 2026-02-20T18:45:32Z |
| 44 | +- **Completed:** 2026-02-20T18:50:41Z |
| 45 | +- **Tasks:** 2 |
| 46 | +- **Files modified:** 5 |
| 47 | + |
| 48 | +## Accomplishments |
| 49 | +- Added reusable `evaluateFixture`, `summarizeEvaluation`, and `formatEvalReport` in `src/eval/harness.ts` with shared eval types in `src/eval/types.ts`. |
| 50 | +- Migrated `tests/eval-harness.test.ts` to consume shared harness logic while preserving frozen-fixture validation for both angular and controlled fixtures. |
| 51 | +- Upgraded `scripts/run-eval.mjs` for one or two codebases, `--fixture-a/--fixture-b`, `--help`, combined summary output, and package-version display. |
| 52 | +- Added `npm run eval` script that builds first so dist imports work from clean checkout. |
| 53 | + |
| 54 | +## Task Commits |
| 55 | + |
| 56 | +Each task was committed atomically: |
| 57 | + |
| 58 | +1. **Task 1: Move eval harness logic into `src/eval/` and reuse from tests** - `5c5319b` (feat) |
| 59 | +2. **Task 2: Upgrade runner to multi-codebase CLI and add `npm run eval`** - `b065042` (feat) |
| 60 | + |
| 61 | +**Plan metadata:** pending |
| 62 | + |
| 63 | +## Files Created/Modified |
| 64 | +- `src/eval/types.ts` - Shared fixture/query/result/summary type contracts for eval harness and runner. |
| 65 | +- `src/eval/harness.ts` - Centralized evaluation scoring and report formatting with wins/failures sections. |
| 66 | +- `tests/eval-harness.test.ts` - Harness unit tests updated to consume shared module and enforce frozen fixture invariants. |
| 67 | +- `scripts/run-eval.mjs` - Multi-codebase CLI with fixture overrides, combined summary, and clean-checkout-safe behavior. |
| 68 | +- `package.json` - Adds `npm run eval` entrypoint (`pnpm run build && node scripts/run-eval.mjs`). |
| 69 | + |
| 70 | +## Decisions Made |
| 71 | +- Centralized harness logic under `src/eval` to prevent scoring/report drift between tests and CLI runs. |
| 72 | +- Kept the runner explicit about failures by printing query id/text, expected patterns, and top-3 actual paths. |
| 73 | +- Added skip-reindex fallback to auto-index when artifacts are missing so the eval command remains operable in fresh environments. |
| 74 | + |
| 75 | +## Deviations from Plan |
| 76 | + |
| 77 | +### Auto-fixed Issues |
| 78 | + |
| 79 | +**1. [Rule 3 - Blocking] Handled missing index artifacts when `--skip-reindex` is used** |
| 80 | +- **Found during:** Task 2 verification (offline two-codebase smoke run) |
| 81 | +- **Issue:** `--skip-reindex` failed on clean state with missing `.codebase-context` artifacts, causing index corruption errors before evaluation. |
| 82 | +- **Fix:** Added index artifact detection and automatic reindex fallback when skip is requested without an existing index. |
| 83 | +- **Files modified:** scripts/run-eval.mjs |
| 84 | +- **Verification:** `npm run eval -- tests/fixtures/codebases/eval-controlled tests/fixtures/codebases/eval-controlled --fixture-a=tests/fixtures/eval-controlled.json --fixture-b=tests/fixtures/eval-controlled.json --skip-reindex --no-rerank` |
| 85 | +- **Committed in:** b065042 |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +**Total deviations:** 1 auto-fixed (1 blocking) |
| 90 | +**Impact on plan:** Deviation kept the eval workflow usable from clean checkout without changing planned scope. |
| 91 | + |
| 92 | +## Issues Encountered |
| 93 | +None. |
| 94 | + |
| 95 | +## User Setup Required |
| 96 | +None - no external service configuration required. |
| 97 | + |
| 98 | +## Next Phase Readiness |
| 99 | +Phase 03 now has frozen fixtures, regression guardrails, and a reusable multi-codebase eval command. |
| 100 | +Phase is complete and ready for transition to Phase 04 grammar assets/loader work. |
| 101 | + |
| 102 | +--- |
| 103 | +*Phase: 03-evaluation-guardrails* |
| 104 | +*Completed: 2026-02-20* |
| 105 | + |
| 106 | +## Self-Check: PASSED |
0 commit comments