feat(agent-eval): probe A/B harness and log parser (PR 9) by SutuSebastian · Pull Request #139 · stainless-code/codemap

SutuSebastian · 2026-05-25T17:47:18Z

Summary

Add dev-only scripts/agent-eval/ harness (PR 9 tracer bullet): deterministic A/B comparing MCP-on (one query per probe) vs MCP-off (glob → read × N → grep) on three golden-mirrored probes.
bash scripts/agent-eval/run-arms.sh writes local comparison JSON (.agent-eval/comparison.json); no telemetry upload.
Agent log parser for entries / messages / line export formats + sample fixture; methodology documented in docs/benchmark.md.

Test plan

bash scripts/agent-eval/run-arms.sh — 3/3 scenarios ok, comparison JSON written
bun test scripts/agent-eval/parse-agent-log.test.ts
Pre-commit hook (format, lint, parser tests)

Follow-ups (not in this PR)

CI workflow_dispatch / nightly job
Live agent A/B with MCP allowlist arms
External public fixture probes (zod, fastify)

Summary by CodeRabbit

New Features
- Added an agent evaluation harness to run deterministic A/B comparisons between two probe modes and produce JSON reports with metrics (tool calls, tokens, wall time, deltas).
Tests
- Added comprehensive tests covering log parsing, token estimation, probe execution, schema validation, and smoke runs against sample fixtures.
Documentation
- Expanded docs and README with harness usage, options, CI integration, and benchmark guidance.
Chores
- Updated CI, npm scripts, gitignore, and TypeScript includes to enable the harness.

Ship dev-only scripts/agent-eval tracer bullet so local runs compare MCP-on query vs glob/read/grep discovery on golden-mirrored probes, with optional agent transcript parsing and benchmark.md methodology docs.

changeset-bot · 2026-05-25T17:47:22Z

⚠️ No Changeset found

Latest commit: 5247b44

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-25T17:47:25Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a deterministic A/B agent-eval harness (MCP-on vs MCP-off) with probes schema, log parsing and token metrics, averaging/reporting CLI, tests, CI/package wiring, docs, and extracts golden-query resolution into a reusable resolver.

Changes

Agent-Eval Harness & Query Refactoring

Layer / File(s)	Summary
Golden query resolution extraction `scripts/query-golden.ts`, `scripts/query-golden/resolve-golden-query.ts`	Extracted golden scenario → {sql, bindValues} resolution into `resolveGoldenQuery`, supporting raw SQL and recipe-based scenarios with param resolution.
Metrics and token estimation `scripts/agent-eval/metrics.ts`, `scripts/agent-eval/probe-tokens.ts`	UTF-8 byte-length helpers and token-estimate helpers; payload sizing for MCP-on (SQL + rows + binds) and MCP-off (bytesRead + results).
Agent log parsing & CLI `scripts/agent-eval/parse-agent-log.ts`, `scripts/agent-eval/print-log-metrics.ts`	Parser for multiple JSON shapes and line logs that extracts tool sequences, byte counts, optional wall time, and estimated tokens; CLI prints parsed metrics.
Traditional probe (MCP-off baseline) `scripts/agent-eval/traditional-probe.ts`	Regex or builtin probes that expand globs, read files, match content, and return results/filesRead/bytesRead/wallMs plus deterministic tool sequence.
Probe schema, scenarios, fixtures `scripts/agent-eval/schema.ts`, `scripts/agent-eval/scenarios.json`, `fixtures/agent-eval/sample-cursor-log.json`	Zod-validated probes schema, three probe definitions, and a sample cursor/log fixture for tests.
A/B probe execution and reporting `scripts/agent-eval/run-probes.ts`, `scripts/agent-eval/run-arms.sh`	CLI runs MCP-on (golden SQL) and MCP-off (traditional) arms, computes per-arm metrics, averages runs, computes deltas and summary totals, writes JSON comparison report, and sets non-zero exit on partial failures.
Test suite for agent-eval `scripts/agent-eval/parse-agent-log.test.ts`	Comprehensive tests for log parsing formats, token/byte sizing, run-probes helpers (golden resolution, averaging, success gating), schema validation, and smoke runs against `fixtures/minimal`.
CI, package scripts, and build config `package.json`, `.github/workflows/ci.yml`, `.gitignore`, `tsconfig.json`, `src/package-scripts.test.ts`, `.github/CONTRIBUTING.md`	Add `test:agent-eval` script and include in `check`; CI test job runs it after golden tests; ignore `.agent-eval/`; include agent-eval TS files in tsconfig; update contributor docs and tests checking scripts.
Docs, plans, and roadmap `docs/benchmark.md`, `docs/plans/*`, `docs/roadmap.md`, `README.md`, `docs/README.md`, `docs/golden-queries.md`	Document harness, local run script, environment overrides, CI gating on `fixtures/minimal`, update plan/roadmap/ownership tables and contributor docs to reference agent-eval harness.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

stainless-code/codemap#8: Related golden-query tooling and earlier query-resolution work that this PR refactors and reuses.
stainless-code/codemap#71: Prior parametrized recipe and query resolution patterns referenced by the new resolver.

Suggested labels

enhancement, documentation

Poem

🐰 A harness hops from MCP to grep,
Comparing SQL swift with file-search step,
Probes tally tokens, logs sing the tune,
Two arms danced under a testing moon—
Results written down, the rabbit nods: “Well done.”

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(agent-eval): probe A/B harness and log parser (PR 9)' clearly and concisely summarizes the main changes: adding an A/B testing harness and log parser for agent evaluation, which aligns with the substantial additions to scripts/agent-eval/ and related documentation updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/agent-eval-harness

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Typecheck agent-eval scripts, wire test:agent-eval into check, fix per-arm success and token estimates, harden log parser, extract resolveGoldenQuery, and tighten benchmark/roadmap/allowlist docs.

Run test:agent-eval in the Test job, exit non-zero when scenarioSuccess is incomplete, document exit behavior and AGENT_EVAL_RUNS averaging, and sync plan/roadmap wording with the PR CI probe gate.

Export applyProbeExitCode for unit testing and mark PR 9 merge-ready in the delivery tracker after CI green.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

scripts/agent-eval/run-probes.ts (1)

26-65: ⚡ Quick win

Add concise docs for exported harness APIs with non-obvious semantics.

The exported APIs encode policy choices (success criteria, averaging/rounding, exit behavior). A short TSDoc note per export would make downstream/test usage safer and clearer.

As per coding guidelines "**/*.{ts,tsx,js,jsx}: All public APIs must have accompanying documentation."

Also applies to: 123-195, 256-303

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/agent-eval/run-probes.ts` around lines 26 - 65, Add short TSDoc
comments to each exported interface and key fields (ArmRunMetrics,
ScenarioComparison, AgentEvalComparison) describing non-obvious semantics: e.g.,
state that ArmRunMetrics.wallMs is elapsed time in milliseconds, estTokens is
estimated token usage and whether it’s rounded/averaged, toolSequence is ordered
tool names called, toolCallCount is total tool invocation count, resultCount is
number of rows returned and success flag semantics; for ScenarioComparison note
scenarioSuccess means both arms returned ≥1 row, delta fields are mcpOn - mcpOff
(sign convention), and for AgentEvalComparison document generatedAt format (ISO
string), mode fixed value "probe", runs is number of repeats, fixtureRoot
meaning, and exactly how summary aggregates (which arm contributed to
mcpOn*/mcpOff* totals and whether values are sums/averages/rounded). Ensure each
TSDoc is concise and placed immediately above the respective exported
interface/field names (ArmRunMetrics, ScenarioComparison, AgentEvalComparison,
and summary fields).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/agent-eval/parse-agent-log.ts`:
- Around line 195-201: The catch is catching errors from both JSON.parse(raw)
and parseJsonLog(...), masking shape/validation errors as “invalid JSON”; fix it
by only catching JSON.parse failures: first attempt to JSON.parse(raw) inside a
try/catch and on error throw the current `agent log: invalid JSON` with the
parse error message, then call `parseJsonLog(...)` outside that catch so any
errors thrown by `parseJsonLog` (shape/validation) are preserved and propagate
(or can be handled separately); update the block around `trimmed.startsWith("{")
|| trimmed.startsWith("[")` to reflect this separation and keep references to
`raw`, `JSON.parse`, and `parseJsonLog`.

In `@scripts/agent-eval/run-probes.ts`:
- Around line 260-262: Guard the averageSamples implementation against empty
input: at the start of the function check samples.length (n) and if zero throw a
clear Error (e.g. "averageSamples called with empty samples") or return a safe
value, then only access samples[0].prompt and compute avgArm/divisions after
that check so you never dereference samples[0] or divide by n when n is 0;
update the code paths that use samples, n, prompt, and avgArm accordingly.
- Around line 74-83: The CLI parsing accepts the next argv token for --output
and --fixture-root even when that token is another flag; update the parsing in
run-probes.ts (the branches handling a === "--output" and a ===
"--fixture-root", using variables a, argv, i, output, fixtureRoot) to validate
that argv[i+1] exists and does not start with "-" before consuming it; if
argv[i+1] is missing or startsWith("-"), throw a clear Error like "Missing value
for --output" / "Missing value for --fixture-root" instead of silently treating
a flag as the path. Ensure you increment i only after confirming the token is
valid.

---

Nitpick comments:
In `@scripts/agent-eval/run-probes.ts`:
- Around line 26-65: Add short TSDoc comments to each exported interface and key
fields (ArmRunMetrics, ScenarioComparison, AgentEvalComparison) describing
non-obvious semantics: e.g., state that ArmRunMetrics.wallMs is elapsed time in
milliseconds, estTokens is estimated token usage and whether it’s
rounded/averaged, toolSequence is ordered tool names called, toolCallCount is
total tool invocation count, resultCount is number of rows returned and success
flag semantics; for ScenarioComparison note scenarioSuccess means both arms
returned ≥1 row, delta fields are mcpOn - mcpOff (sign convention), and for
AgentEvalComparison document generatedAt format (ISO string), mode fixed value
"probe", runs is number of repeats, fixtureRoot meaning, and exactly how summary
aggregates (which arm contributed to mcpOn*/mcpOff* totals and whether values
are sums/averages/rounded). Ensure each TSDoc is concise and placed immediately
above the respective exported interface/field names (ArmRunMetrics,
ScenarioComparison, AgentEvalComparison, and summary fields).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07deb864-a257-4516-a8cf-ec5b5ae19bb0

📥 Commits

Reviewing files that changed from the base of the PR and between 829075c and f70a923.

📒 Files selected for processing (22)

.github/workflows/ci.yml
.gitignore
docs/benchmark.md
docs/plans/agent-eval-harness.md
docs/plans/agent-surface-delivery.md
docs/plans/mcp-tool-allowlist.md
docs/roadmap.md
fixtures/agent-eval/sample-cursor-log.json
package.json
scripts/agent-eval/metrics.ts
scripts/agent-eval/parse-agent-log.test.ts
scripts/agent-eval/parse-agent-log.ts
scripts/agent-eval/print-log-metrics.ts
scripts/agent-eval/probe-tokens.ts
scripts/agent-eval/run-arms.sh
scripts/agent-eval/run-probes.ts
scripts/agent-eval/scenarios.json
scripts/agent-eval/schema.ts
scripts/agent-eval/traditional-probe.ts
scripts/query-golden.ts
scripts/query-golden/resolve-golden-query.ts
tsconfig.json

Separate JSON parse errors from unsupported transcript shapes, reject flag tokens as option values, and guard averageSamples on empty input.

Add --scenarios for external fixtures, improve token math (bind values, log payloads), harden parser/CLI/schema edge cases, expand tests, and sync docs/roadmap cross-refs.

Add --probes and --skip-index, re-ceil averaged estTokens, parse structured content arrays in logs, and sync README/golden-queries/benchmark docs.

Broaden log parser tool/content coverage, align averaged deltas, guard traditional regex, restore CODEMAP_ROOT, exit 1 on partial probes, and reuse index in run-arms.sh when present.

* docs: post-merge sweep after agent-eval (#139) Delete ten shipped agent-surface plans; prune roadmap [x] backlog; refresh README, glossary, templates, and cross-refs to agents.md. * docs: fix broken table in callback-dispatch-synthesis plan Pipe characters in L.1 split the markdown table into extra columns. * docs: address PR #140 review findings Fix plan anchor slugs, MCP-only tool wording, parse-worker env lift, and delivery-tracker maintenance steps for delete+lift lifecycle.

feat(agent-eval): add probe A/B harness and agent log parser (PR 9)

d44f4b5

Ship dev-only scripts/agent-eval tracer bullet so local runs compare MCP-on query vs glob/read/grep discovery on golden-mirrored probes, with optional agent transcript parsing and benchmark.md methodology docs.

SutuSebastian added 4 commits May 25, 2026 20:47

docs(plans): link agent-eval harness to PR #139

2f6bfcf

fix(agent-eval): address PR #139 review findings

ccb3826

Typecheck agent-eval scripts, wire test:agent-eval into check, fix per-arm success and token estimates, harden log parser, extract resolveGoldenQuery, and tighten benchmark/roadmap/allowlist docs.

fix(agent-eval): gate probe harness in CI and fail on partial probes

23937fb

Run test:agent-eval in the Test job, exit non-zero when scenarioSuccess is incomplete, document exit behavior and AGENT_EVAL_RUNS averaging, and sync plan/roadmap wording with the PR CI probe gate.

test(agent-eval): cover partial-failure exit code path

f70a923

Export applyProbeExitCode for unit testing and mark PR 9 merge-ready in the delivery tracker after CI green.

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread scripts/agent-eval/parse-agent-log.ts

Comment thread scripts/agent-eval/run-probes.ts Outdated

Comment thread scripts/agent-eval/run-probes.ts

SutuSebastian added 4 commits May 25, 2026 21:44

fix(agent-eval): address CodeRabbit review on parser and CLI args

3a3590c

Separate JSON parse errors from unsupported transcript shapes, reject flag tokens as option values, and guard averageSamples on empty input.

fix(agent-eval): address final PR review findings

8d3258e

Add --scenarios for external fixtures, improve token math (bind values, log payloads), harden parser/CLI/schema edge cases, expand tests, and sync docs/roadmap cross-refs.

fix(agent-eval): close final review nits before merge

aca8066

Add --probes and --skip-index, re-ceil averaged estTokens, parse structured content arrays in logs, and sync README/golden-queries/benchmark docs.

fix(agent-eval): address final subagent review findings

5247b44

Broaden log parser tool/content coverage, align averaged deltas, guard traditional regex, restore CODEMAP_ROOT, exit 1 on partial probes, and reuse index in run-arms.sh when present.

SutuSebastian merged commit a6065ca into main May 26, 2026
10 of 11 checks passed

SutuSebastian deleted the feat/agent-eval-harness branch May 26, 2026 06:56

SutuSebastian mentioned this pull request May 26, 2026

docs: post-merge sweep after agent-eval (#139) #140

Merged

2 tasks

coderabbitai Bot mentioned this pull request May 26, 2026

feat(agent-eval): live MCP arms, log comparison, and external workflow #144

Merged

4 tasks

coderabbitai Bot mentioned this pull request Jun 10, 2026

feat(churn): git churn ingest and churn-complexity-hotspots recipe #179

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(agent-eval): probe A/B harness and log parser (PR 9)#139

feat(agent-eval): probe A/B harness and log parser (PR 9)#139
SutuSebastian merged 9 commits into
mainfrom
feat/agent-eval-harness

SutuSebastian commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

changeset-bot Bot commented May 25, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

SutuSebastian commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Follow-ups (not in this PR)

Summary by CodeRabbit

Uh oh!

changeset-bot Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SutuSebastian commented May 25, 2026 •

edited by coderabbitai Bot

Loading

changeset-bot Bot commented May 25, 2026 •

edited

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading