Skip to content

Commit 1dd422d

Browse files
feat(agent-eval): live MCP arms, log comparison, and external workflow (#144)
* feat(agent-eval): live MCP arms, log comparison, and external workflow Close P1 agent-surface eval work: live mode dispatches handleQuery/query_recipe with a minimal CODEMAP_MCP_TOOLS allowlist, log mode compares exported session transcripts, and workflow_dispatch runs probe or live arms on external fixtures. * fix(agent-eval): address PR #144 review cycle 1 Type-safe live MCP arm, workflow path/scenario inputs, doc accuracy (probe vs log modes, probe+live labels), env restore on exit, and tests. * fix(agent-eval): address PR #144 review cycle 2 Drop redundant initCodemap in live arm, validate log/workflow paths, correct delivery tracker status, golden-queries wording. * fix(agent-eval): address PR #144 review cycle 3 Align plan tracker with in-flight #144, validate log paths in compareLogArms, and clarify PR 9 shipped vs partial status across docs. * fix(agent-eval): address PR #144 review cycle 4 Harden live eval allowlist defaults, workflow runs validation, doc cross-refs, CLI smoke tests, and comparison JSON validation. * fix(agent-eval): address PR #144 review cycle 5 Validate probes/scenarios paths before read and align live-mode help with ensureLiveEvalMcpToolsEnv (unset or blank). * fix(agent-eval): address PR #144 review cycle 6 Surface live MCP handler errors in comparison JSON, validate scenario arm shape in print-comparison-summary, and extend tests. * docs(agent-eval): publish harness layers and sample results Add research note for MCP vs traditional agent findings, eval layers table and pinned fixtures/minimal live numbers in benchmark.md. * chore: add Cursor and VS Code Codemap MCP configs Project-local MCP wiring for dev (bun src/index.ts) with watch and workspace root; other IDE MCP targets from agents init omitted. * fix(agent-eval): address PR #144 CodeRabbit review (fact-checked) Present-tense P1 tracker copy, workflow_dispatch inputs via step env, and log-mode wall-ms validation in print-comparison-summary. * fix(agent-eval): address PR #144 review cycle 7 Run golden setup after index, surface live 0-row failures, preserve averaged arm errors, align doc tables with probe/live vs log modes. * fix(agent-eval): address PR #144 review cycle 8 Harden setup paths, multi-run averaging, live allowlist errors in JSON, log compare empty MCP-on guard, summary on partial failure, doc fixes. * fix(agent-eval): address PR #144 review cycle 9 MCP-off zero-result error parity and probe summary before log compare.
1 parent e475d20 commit 1dd422d

28 files changed

Lines changed: 1744 additions & 196 deletions

.cursor/mcp.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"mcpServers": {
3+
"codemap": {
4+
"command": "bun",
5+
"args": ["src/index.ts", "mcp", "--watch", "--root", "${workspaceFolder}"]
6+
}
7+
}
8+
}

.github/CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ bun install # runs `prepare` → Husky git hooks
1212
bun run dev # same as `bun src/index.ts` — CLI from source
1313
bun test
1414
bun run test:golden # golden SQL vs fixtures/minimal (also runs at end of `bun run check`)
15-
bun run test:agent-eval # probe A/B harness smoke on fixtures/minimal (also runs at end of `bun run check`)
15+
bun run test:agent-eval # agent-eval harness smoke (probe + live; also runs at end of `bun run check`)
1616
bun run test:golden:external # Tier B: local tree via CODEMAP_ROOT / --root (not in CI)
1717
bun run check # build, then format:check + lint:ci + test + typecheck, then test:golden + test:agent-eval
1818
bun run clean # remove untracked/ignored build artifacts (keeps `.env`, `.codemap/`)
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Optional manual agent-eval on an in-repo indexed fixture (default: fixtures/minimal).
2+
# Clone external trees into the checkout first; pass repo-relative fixture_root + matching scenarios/probes.
3+
name: Agent eval (external)
4+
5+
on:
6+
workflow_dispatch:
7+
inputs:
8+
fixture_root:
9+
description: "Indexed project root — repo-relative path under the checkout (default fixtures/minimal)"
10+
required: false
11+
default: fixtures/minimal
12+
mode:
13+
description: "Harness mode — probe (queryRows) or live (MCP handlers)"
14+
required: false
15+
default: probe
16+
type: choice
17+
options:
18+
- probe
19+
- live
20+
runs:
21+
description: "Repeat count per probe"
22+
required: false
23+
default: "1"
24+
scenarios:
25+
description: "Golden scenarios JSON — repo-relative; empty = fixtures/golden/scenarios.json"
26+
required: false
27+
default: ""
28+
probes:
29+
description: "Probe definitions JSON — repo-relative; empty = scripts/agent-eval/scenarios.json"
30+
required: false
31+
default: ""
32+
33+
jobs:
34+
agent-eval-external:
35+
name: Agent eval (${{ inputs.mode }})
36+
runs-on: ubuntu-latest
37+
steps:
38+
- name: Checkout
39+
uses: actions/checkout@v4
40+
41+
- name: Setup
42+
uses: ./.github/actions/setup
43+
44+
- name: Resolve paths
45+
id: paths
46+
env:
47+
INPUT_RUNS: ${{ inputs.runs }}
48+
INPUT_FIXTURE_ROOT: ${{ inputs.fixture_root }}
49+
INPUT_SCENARIOS: ${{ inputs.scenarios }}
50+
INPUT_PROBES: ${{ inputs.probes }}
51+
run: |
52+
set -euo pipefail
53+
RUNS="$INPUT_RUNS"
54+
if ! [[ "$RUNS" =~ ^[1-9][0-9]*$ ]]; then
55+
echo "runs must be a positive integer (got: $RUNS)" >&2
56+
exit 1
57+
fi
58+
FIXTURE="$INPUT_FIXTURE_ROOT"
59+
if [[ "$FIXTURE" == *".."* ]]; then
60+
echo "fixture_root must not contain .." >&2
61+
exit 1
62+
fi
63+
FIXTURE_ABS="${{ github.workspace }}/$FIXTURE"
64+
if [[ ! -d "$FIXTURE_ABS" ]]; then
65+
echo "fixture_root not found: $FIXTURE_ABS" >&2
66+
exit 1
67+
fi
68+
echo "fixture=$FIXTURE_ABS" >> "$GITHUB_OUTPUT"
69+
SCEN="$INPUT_SCENARIOS"
70+
if [[ -n "$SCEN" ]]; then
71+
if [[ "$SCEN" == *".."* ]]; then
72+
echo "scenarios must not contain .." >&2
73+
exit 1
74+
fi
75+
SCEN_ABS="${{ github.workspace }}/$SCEN"
76+
if [[ ! -f "$SCEN_ABS" ]]; then
77+
echo "scenarios file not found: $SCEN_ABS" >&2
78+
exit 1
79+
fi
80+
echo "scenarios=$SCEN_ABS" >> "$GITHUB_OUTPUT"
81+
fi
82+
PROB="$INPUT_PROBES"
83+
if [[ -n "$PROB" ]]; then
84+
if [[ "$PROB" == *".."* ]]; then
85+
echo "probes must not contain .." >&2
86+
exit 1
87+
fi
88+
PROB_ABS="${{ github.workspace }}/$PROB"
89+
if [[ ! -f "$PROB_ABS" ]]; then
90+
echo "probes file not found: $PROB_ABS" >&2
91+
exit 1
92+
fi
93+
echo "probes=$PROB_ABS" >> "$GITHUB_OUTPUT"
94+
fi
95+
96+
- name: Golden index (fixtures/minimal only)
97+
if: inputs.fixture_root == 'fixtures/minimal'
98+
run: bun run test:golden
99+
100+
- name: Run agent-eval harness
101+
env:
102+
AGENT_EVAL_MODE: ${{ inputs.mode }}
103+
AGENT_EVAL_FIXTURE_ROOT: ${{ steps.paths.outputs.fixture }}
104+
AGENT_EVAL_RUNS: ${{ inputs.runs }}
105+
AGENT_EVAL_PRINT_SUMMARY: "1"
106+
AGENT_EVAL_SCENARIOS: ${{ steps.paths.outputs.scenarios }}
107+
AGENT_EVAL_PROBES: ${{ steps.paths.outputs.probes }}
108+
CODEMAP_MCP_TOOLS: ${{ inputs.mode == 'live' && 'query,query_recipe' || '' }}
109+
run: bash scripts/agent-eval/run-arms.sh
110+
111+
- name: Upload comparison artifact
112+
uses: actions/upload-artifact@v4
113+
with:
114+
name: agent-eval-comparison
115+
path: .agent-eval/comparison.json
116+
if-no-files-found: error

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ jobs:
9595
- name: Golden query regression (fixtures/minimal)
9696
run: bun run test:golden
9797

98-
- name: Agent eval probe harness (fixtures/minimal)
98+
- name: Agent eval harness (probe + live smoke, fixtures/minimal)
9999
run: bun run test:agent-eval
100100

101101
build:

.vscode/mcp.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"servers": {
3+
"codemap": {
4+
"type": "stdio",
5+
"command": "bun",
6+
"args": ["src/index.ts", "mcp", "--watch", "--root", "${workspaceFolder}"]
7+
}
8+
}
9+
}

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -281,11 +281,11 @@ Tooling: **Oxfmt**, **Oxlint**, **tsgo** (`@typescript/native-preview`).
281281
| Command | Purpose |
282282
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
283283
| `bun run dev` | Run the CLI from source (same as `bun src/index.ts`) |
284-
| `bun run check` | Build, format check, lint, tests, typecheck, golden queries + agent-eval probe smoke — run before pushing |
284+
| `bun run check` | Build, format check, lint, tests, typecheck, golden queries + agent-eval harness smoke — run before pushing |
285285
| `bun run fix` | Apply lint fixes, then format |
286286
| `bun run test` / `bun run typecheck` | Focused checks |
287287
| `bun run test:golden` | SQL snapshot regression on `fixtures/minimal` (included in `check`) |
288-
| `bun run test:agent-eval` | Probe A/B harness smoke on `fixtures/minimal` (included in `check`; [docs/benchmark.md § Agent eval harness](docs/benchmark.md#agent-eval-harness)) |
288+
| `bun run test:agent-eval` | Agent-eval harness smoke on `fixtures/minimal` — probe + live MCP handlers (included in `check`; [docs/benchmark.md § Agent eval harness](docs/benchmark.md#agent-eval-harness)) |
289289
| `bun run test:golden:external` | Tier B: local tree via `CODEMAP_*` / `--root` (not in default `check`) |
290290
| `bun run benchmark:query` | Compare `console.table` vs `--json` stdout size (needs local `.codemap/index.db`; [docs/benchmark.md § Query stdout](docs/benchmark.md#query-stdout-table-vs-json-benchmarkquery)) |
291291
| `bun run qa:external` | Index + sanity checks + benchmark on `CODEMAP_ROOT` / `CODEMAP_TEST_BENCH` |

0 commit comments

Comments
 (0)