Skip to content

feat(agent-eval): probe A/B harness and log parser (PR 9)#139

Merged
SutuSebastian merged 9 commits into
mainfrom
feat/agent-eval-harness
May 26, 2026
Merged

feat(agent-eval): probe A/B harness and log parser (PR 9)#139
SutuSebastian merged 9 commits into
mainfrom
feat/agent-eval-harness

Conversation

@SutuSebastian

@SutuSebastian SutuSebastian commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add dev-only scripts/agent-eval/ harness (PR 9 tracer bullet): deterministic A/B comparing MCP-on (one query per probe) vs MCP-off (globread × N → grep) on three golden-mirrored probes.
  • bash scripts/agent-eval/run-arms.sh writes local comparison JSON (.agent-eval/comparison.json); no telemetry upload.
  • Agent log parser for entries / messages / line export formats + sample fixture; methodology documented in docs/benchmark.md.

Test plan

  • bash scripts/agent-eval/run-arms.sh — 3/3 scenarios ok, comparison JSON written
  • bun test scripts/agent-eval/parse-agent-log.test.ts
  • Pre-commit hook (format, lint, parser tests)

Follow-ups (not in this PR)

  • CI workflow_dispatch / nightly job
  • Live agent A/B with MCP allowlist arms
  • External public fixture probes (zod, fastify)

Summary by CodeRabbit

  • New Features

    • Added an agent evaluation harness to run deterministic A/B comparisons between two probe modes and produce JSON reports with metrics (tool calls, tokens, wall time, deltas).
  • Tests

    • Added comprehensive tests covering log parsing, token estimation, probe execution, schema validation, and smoke runs against sample fixtures.
  • Documentation

    • Expanded docs and README with harness usage, options, CI integration, and benchmark guidance.
  • Chores

    • Updated CI, npm scripts, gitignore, and TypeScript includes to enable the harness.

Review Change Stack

Ship dev-only scripts/agent-eval tracer bullet so local runs compare MCP-on
query vs glob/read/grep discovery on golden-mirrored probes, with optional
agent transcript parsing and benchmark.md methodology docs.
@changeset-bot

changeset-bot Bot commented May 25, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 5247b44

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a deterministic A/B agent-eval harness (MCP-on vs MCP-off) with probes schema, log parsing and token metrics, averaging/reporting CLI, tests, CI/package wiring, docs, and extracts golden-query resolution into a reusable resolver.

Changes

Agent-Eval Harness & Query Refactoring

Layer / File(s) Summary
Golden query resolution extraction
scripts/query-golden.ts, scripts/query-golden/resolve-golden-query.ts
Extracted golden scenario → {sql, bindValues} resolution into resolveGoldenQuery, supporting raw SQL and recipe-based scenarios with param resolution.
Metrics and token estimation
scripts/agent-eval/metrics.ts, scripts/agent-eval/probe-tokens.ts
UTF-8 byte-length helpers and token-estimate helpers; payload sizing for MCP-on (SQL + rows + binds) and MCP-off (bytesRead + results).
Agent log parsing & CLI
scripts/agent-eval/parse-agent-log.ts, scripts/agent-eval/print-log-metrics.ts
Parser for multiple JSON shapes and line logs that extracts tool sequences, byte counts, optional wall time, and estimated tokens; CLI prints parsed metrics.
Traditional probe (MCP-off baseline)
scripts/agent-eval/traditional-probe.ts
Regex or builtin probes that expand globs, read files, match content, and return results/filesRead/bytesRead/wallMs plus deterministic tool sequence.
Probe schema, scenarios, fixtures
scripts/agent-eval/schema.ts, scripts/agent-eval/scenarios.json, fixtures/agent-eval/sample-cursor-log.json
Zod-validated probes schema, three probe definitions, and a sample cursor/log fixture for tests.
A/B probe execution and reporting
scripts/agent-eval/run-probes.ts, scripts/agent-eval/run-arms.sh
CLI runs MCP-on (golden SQL) and MCP-off (traditional) arms, computes per-arm metrics, averages runs, computes deltas and summary totals, writes JSON comparison report, and sets non-zero exit on partial failures.
Test suite for agent-eval
scripts/agent-eval/parse-agent-log.test.ts
Comprehensive tests for log parsing formats, token/byte sizing, run-probes helpers (golden resolution, averaging, success gating), schema validation, and smoke runs against fixtures/minimal.
CI, package scripts, and build config
package.json, .github/workflows/ci.yml, .gitignore, tsconfig.json, src/package-scripts.test.ts, .github/CONTRIBUTING.md
Add test:agent-eval script and include in check; CI test job runs it after golden tests; ignore .agent-eval/; include agent-eval TS files in tsconfig; update contributor docs and tests checking scripts.
Docs, plans, and roadmap
docs/benchmark.md, docs/plans/*, docs/roadmap.md, README.md, docs/README.md, docs/golden-queries.md
Document harness, local run script, environment overrides, CI gating on fixtures/minimal, update plan/roadmap/ownership tables and contributor docs to reference agent-eval harness.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

enhancement, documentation

Poem

🐰 A harness hops from MCP to grep,
Comparing SQL swift with file-search step,
Probes tally tokens, logs sing the tune,
Two arms danced under a testing moon—
Results written down, the rabbit nods: “Well done.”

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(agent-eval): probe A/B harness and log parser (PR 9)' clearly and concisely summarizes the main changes: adding an A/B testing harness and log parser for agent evaluation, which aligns with the substantial additions to scripts/agent-eval/ and related documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agent-eval-harness

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Typecheck agent-eval scripts, wire test:agent-eval into check, fix per-arm
success and token estimates, harden log parser, extract resolveGoldenQuery,
and tighten benchmark/roadmap/allowlist docs.
Run test:agent-eval in the Test job, exit non-zero when scenarioSuccess
is incomplete, document exit behavior and AGENT_EVAL_RUNS averaging, and
sync plan/roadmap wording with the PR CI probe gate.
Export applyProbeExitCode for unit testing and mark PR 9 merge-ready
in the delivery tracker after CI green.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
scripts/agent-eval/run-probes.ts (1)

26-65: ⚡ Quick win

Add concise docs for exported harness APIs with non-obvious semantics.

The exported APIs encode policy choices (success criteria, averaging/rounding, exit behavior). A short TSDoc note per export would make downstream/test usage safer and clearer.

As per coding guidelines "**/*.{ts,tsx,js,jsx}: All public APIs must have accompanying documentation."

Also applies to: 123-195, 256-303

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/agent-eval/run-probes.ts` around lines 26 - 65, Add short TSDoc
comments to each exported interface and key fields (ArmRunMetrics,
ScenarioComparison, AgentEvalComparison) describing non-obvious semantics: e.g.,
state that ArmRunMetrics.wallMs is elapsed time in milliseconds, estTokens is
estimated token usage and whether it’s rounded/averaged, toolSequence is ordered
tool names called, toolCallCount is total tool invocation count, resultCount is
number of rows returned and success flag semantics; for ScenarioComparison note
scenarioSuccess means both arms returned ≥1 row, delta fields are mcpOn - mcpOff
(sign convention), and for AgentEvalComparison document generatedAt format (ISO
string), mode fixed value "probe", runs is number of repeats, fixtureRoot
meaning, and exactly how summary aggregates (which arm contributed to
mcpOn*/mcpOff* totals and whether values are sums/averages/rounded). Ensure each
TSDoc is concise and placed immediately above the respective exported
interface/field names (ArmRunMetrics, ScenarioComparison, AgentEvalComparison,
and summary fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/agent-eval/parse-agent-log.ts`:
- Around line 195-201: The catch is catching errors from both JSON.parse(raw)
and parseJsonLog(...), masking shape/validation errors as “invalid JSON”; fix it
by only catching JSON.parse failures: first attempt to JSON.parse(raw) inside a
try/catch and on error throw the current `agent log: invalid JSON` with the
parse error message, then call `parseJsonLog(...)` outside that catch so any
errors thrown by `parseJsonLog` (shape/validation) are preserved and propagate
(or can be handled separately); update the block around `trimmed.startsWith("{")
|| trimmed.startsWith("[")` to reflect this separation and keep references to
`raw`, `JSON.parse`, and `parseJsonLog`.

In `@scripts/agent-eval/run-probes.ts`:
- Around line 260-262: Guard the averageSamples implementation against empty
input: at the start of the function check samples.length (n) and if zero throw a
clear Error (e.g. "averageSamples called with empty samples") or return a safe
value, then only access samples[0].prompt and compute avgArm/divisions after
that check so you never dereference samples[0] or divide by n when n is 0;
update the code paths that use samples, n, prompt, and avgArm accordingly.
- Around line 74-83: The CLI parsing accepts the next argv token for --output
and --fixture-root even when that token is another flag; update the parsing in
run-probes.ts (the branches handling a === "--output" and a ===
"--fixture-root", using variables a, argv, i, output, fixtureRoot) to validate
that argv[i+1] exists and does not start with "-" before consuming it; if
argv[i+1] is missing or startsWith("-"), throw a clear Error like "Missing value
for --output" / "Missing value for --fixture-root" instead of silently treating
a flag as the path. Ensure you increment i only after confirming the token is
valid.

---

Nitpick comments:
In `@scripts/agent-eval/run-probes.ts`:
- Around line 26-65: Add short TSDoc comments to each exported interface and key
fields (ArmRunMetrics, ScenarioComparison, AgentEvalComparison) describing
non-obvious semantics: e.g., state that ArmRunMetrics.wallMs is elapsed time in
milliseconds, estTokens is estimated token usage and whether it’s
rounded/averaged, toolSequence is ordered tool names called, toolCallCount is
total tool invocation count, resultCount is number of rows returned and success
flag semantics; for ScenarioComparison note scenarioSuccess means both arms
returned ≥1 row, delta fields are mcpOn - mcpOff (sign convention), and for
AgentEvalComparison document generatedAt format (ISO string), mode fixed value
"probe", runs is number of repeats, fixtureRoot meaning, and exactly how summary
aggregates (which arm contributed to mcpOn*/mcpOff* totals and whether values
are sums/averages/rounded). Ensure each TSDoc is concise and placed immediately
above the respective exported interface/field names (ArmRunMetrics,
ScenarioComparison, AgentEvalComparison, and summary fields).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07deb864-a257-4516-a8cf-ec5b5ae19bb0

📥 Commits

Reviewing files that changed from the base of the PR and between 829075c and f70a923.

📒 Files selected for processing (22)
  • .github/workflows/ci.yml
  • .gitignore
  • docs/benchmark.md
  • docs/plans/agent-eval-harness.md
  • docs/plans/agent-surface-delivery.md
  • docs/plans/mcp-tool-allowlist.md
  • docs/roadmap.md
  • fixtures/agent-eval/sample-cursor-log.json
  • package.json
  • scripts/agent-eval/metrics.ts
  • scripts/agent-eval/parse-agent-log.test.ts
  • scripts/agent-eval/parse-agent-log.ts
  • scripts/agent-eval/print-log-metrics.ts
  • scripts/agent-eval/probe-tokens.ts
  • scripts/agent-eval/run-arms.sh
  • scripts/agent-eval/run-probes.ts
  • scripts/agent-eval/scenarios.json
  • scripts/agent-eval/schema.ts
  • scripts/agent-eval/traditional-probe.ts
  • scripts/query-golden.ts
  • scripts/query-golden/resolve-golden-query.ts
  • tsconfig.json

Comment thread scripts/agent-eval/parse-agent-log.ts
Comment thread scripts/agent-eval/run-probes.ts Outdated
Comment thread scripts/agent-eval/run-probes.ts
Separate JSON parse errors from unsupported transcript shapes, reject flag tokens as option values, and guard averageSamples on empty input.
Add --scenarios for external fixtures, improve token math (bind values, log payloads),
harden parser/CLI/schema edge cases, expand tests, and sync docs/roadmap cross-refs.
Add --probes and --skip-index, re-ceil averaged estTokens, parse structured
content arrays in logs, and sync README/golden-queries/benchmark docs.
Broaden log parser tool/content coverage, align averaged deltas, guard
traditional regex, restore CODEMAP_ROOT, exit 1 on partial probes, and
reuse index in run-arms.sh when present.
@SutuSebastian SutuSebastian merged commit a6065ca into main May 26, 2026
10 of 11 checks passed
@SutuSebastian SutuSebastian deleted the feat/agent-eval-harness branch May 26, 2026 06:56
SutuSebastian added a commit that referenced this pull request May 26, 2026
* docs: post-merge sweep after agent-eval (#139)

Delete ten shipped agent-surface plans; prune roadmap [x] backlog;
refresh README, glossary, templates, and cross-refs to agents.md.

* docs: fix broken table in callback-dispatch-synthesis plan

Pipe characters in L.1 split the markdown table into extra columns.

* docs: address PR #140 review findings

Fix plan anchor slugs, MCP-only tool wording, parse-worker env lift,
and delivery-tracker maintenance steps for delete+lift lifecycle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant