Skip to content

feat(agent-eval): live MCP arms, log comparison, and external workflow#144

Merged
SutuSebastian merged 13 commits into
mainfrom
feat/agent-eval-live-arms
May 26, 2026
Merged

feat(agent-eval): live MCP arms, log comparison, and external workflow#144
SutuSebastian merged 13 commits into
mainfrom
feat/agent-eval-live-arms

Conversation

@SutuSebastian

@SutuSebastian SutuSebastian commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds live agent-eval mode: dispatches golden probe tasks through handleQuery / handleQueryRecipe with a minimal CODEMAP_MCP_TOOLS=query,query_recipe allowlist (closer to real MCP than probe-mode queryRows).
  • Adds log comparison: parse exported MCP-on vs MCP-off session transcripts and emit comparison JSON + markdown summary.
  • Adds optional workflow_dispatch workflow (.github/workflows/agent-eval-external.yml) for probe or live runs on external indexed fixtures.
  • Closes P1 on agent-surface track; deletes shipped docs/plans/agent-eval-harness.md (lifted to docs/benchmark.md § Agent eval harness).

Test plan

  • bun test scripts/agent-eval (36 tests — probe + live smoke)
  • AGENT_EVAL_MODE=live AGENT_EVAL_PRINT_SUMMARY=1 bash scripts/agent-eval/run-arms.sh — 3/3 scenarios ok on fixtures/minimal
  • Log comparison smoke via AGENT_EVAL_LOG_ON / AGENT_EVAL_LOG_OFF sample fixtures
  • CI green on PR

Summary by CodeRabbit

Release Notes

  • New Features

    • Added "live" evaluation mode for agent-eval harness alongside existing "probe" mode
    • New manual GitHub Actions workflow for running agent evaluations against external fixtures
    • Added log comparison capability to analyze differences between evaluation runs with different configurations
  • Tests

    • Expanded test coverage for new evaluation modes and log comparison tools
  • Documentation

    • Updated development guides and benchmarking documentation to reflect expanded agent-eval capabilities

Review Change Stack

Close P1 agent-surface eval work: live mode dispatches handleQuery/query_recipe
with a minimal CODEMAP_MCP_TOOLS allowlist, log mode compares exported session
transcripts, and workflow_dispatch runs probe or live arms on external fixtures.
@changeset-bot

changeset-bot Bot commented May 26, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 486c4d1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@SutuSebastian, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 5 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1ff78bb-6a91-44e1-8233-0f489a32d070

📥 Commits

Reviewing files that changed from the base of the PR and between 8f07953 and 486c4d1.

📒 Files selected for processing (16)
  • .cursor/mcp.json
  • .github/workflows/agent-eval-external.yml
  • .vscode/mcp.json
  • docs/benchmark.md
  • docs/golden-queries.md
  • docs/plans/agent-surface-and-ops.md
  • docs/research/agent-eval-findings-2026-05.md
  • docs/roadmap.md
  • scripts/agent-eval/compare-live-logs.ts
  • scripts/agent-eval/live-mcp-arm.ts
  • scripts/agent-eval/parse-agent-log.test.ts
  • scripts/agent-eval/print-comparison-summary.ts
  • scripts/agent-eval/run-arms.sh
  • scripts/agent-eval/run-probes.ts
  • scripts/query-golden.ts
  • scripts/query-golden/run-setup.ts
📝 Walkthrough

Walkthrough

This PR extends the agent-eval benchmark harness from deterministic "probe" mode to add live MCP handler evaluation (live mode) alongside log comparison and dual-mode orchestration. It includes new CLI tools, workflow automation, comprehensive tests, and documentation updates.

Changes

Live MCP evaluation and log comparison harness

Layer / File(s) Summary
Live MCP tool management and payload analysis
scripts/agent-eval/mcp-allowlist.ts, scripts/agent-eval/tool-payload.ts, scripts/agent-eval/metrics.ts
Introduces LIVE_EVAL_MCP_TOOLS constant for query/query_recipe, provides environment initialization and tool validation helpers, and adds payload analysis for row counting and token estimation from tool results.
Live MCP arm execution
scripts/agent-eval/live-mcp-arm.ts
Implements runLiveMcpArm to dispatch live handlers (query vs query_recipe), measure wall time and metrics, estimate tokens from payload size, and return ArmRunMetrics with success/error details.
Live log comparison and markdown reporting
scripts/agent-eval/compare-live-logs.ts, scripts/agent-eval/print-comparison-summary.ts
Reads MCP-on vs MCP-off logs, computes per-scenario deltas, aggregates totals, and formats Markdown tables for probe/live/log comparison modes with mode-specific summary fields.
Dual-mode harness orchestration
scripts/agent-eval/run-probes.ts, scripts/agent-eval/run-arms.sh
Threads AgentEvalMode through CLI parsing, MCP-on execution dispatch (probe via SQL or live via handlers), environment management for CODEMAP_MCP_TOOLS, and conditional live log comparison invocation.
Comprehensive test suite and sample fixture
scripts/agent-eval/parse-agent-log.test.ts, fixtures/agent-eval/sample-no-mcp-log.json
Adds test suites for MCP allowlisting, live arm execution, log comparison, markdown rendering, and CLI behavior; includes sample agent log fixture for testing log parsing.
External harness workflow and CI updates
.github/workflows/agent-eval-external.yml, .github/workflows/ci.yml, .github/CONTRIBUTING.md
Adds new manual-dispatch Agent eval (external) workflow with configurable fixture, mode, runs, scenarios, and probes; updates CI step naming to reflect probe + live coverage.
Documentation and planning updates
docs/benchmark.md, docs/golden-queries.md, docs/agents.md, docs/packaging.md, docs/README.md, docs/roadmap.md, docs/plans/agent-surface-and-ops.md, docs/plans/agent-surface-delivery.md, docs/research/agent-eval-findings-2026-05.md, README.md
Rewrites benchmark harness section to document Probe/Live/Log modes and live-log comparison; updates plan status for agent surface work; adds research note on MCP vs traditional agent evaluation findings and methodology.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • stainless-code/codemap#139: Extends the deterministic A/B probe harness and log parsing infrastructure from #139 by adding live MCP handler evaluation and log comparison rather than just the original probe-only flow.

Suggested labels

enhancement

Poem

🐰 A harness grows from probe to live,
MCP arms dance, logs compare and sieve,
Dual modes measure what agents can do,
From query to findings, research runs true!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: adding live MCP arms, log comparison, and an external workflow for agent-eval testing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agent-eval-live-arms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Type-safe live MCP arm, workflow path/scenario inputs, doc accuracy
(probe vs log modes, probe+live labels), env restore on exit, and tests.
Drop redundant initCodemap in live arm, validate log/workflow paths,
correct delivery tracker status, golden-queries wording.
Align plan tracker with in-flight #144, validate log paths in compareLogArms,
and clarify PR 9 shipped vs partial status across docs.
Harden live eval allowlist defaults, workflow runs validation, doc cross-refs,
CLI smoke tests, and comparison JSON validation.
Validate probes/scenarios paths before read and align live-mode help
with ensureLiveEvalMcpToolsEnv (unset or blank).
Surface live MCP handler errors in comparison JSON, validate scenario
arm shape in print-comparison-summary, and extend tests.
Add research note for MCP vs traditional agent findings, eval layers
table and pinned fixtures/minimal live numbers in benchmark.md.
Project-local MCP wiring for dev (bun src/index.ts) with watch and
workspace root; other IDE MCP targets from agents init omitted.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
scripts/agent-eval/live-mcp-arm.ts (1)

63-64: ⚡ Quick win

Add a reason when zero-row runs are marked unsuccessful.

Line 63 can set success to false even when result.ok === true, but Line 64 only attaches error for !result.ok, leaving failed runs without diagnostics.

Proposed patch
-  return {
+  const success = result.ok && rows > 0;
+  const error =
+    !result.ok ? result.error : !success ? "agent-eval live: zero rows returned" : undefined;
+
+  return {
     wallMs,
     toolSequence,
     toolCallCount: toolSequence.length,
     resultCount: rows,
     estTokens: estimateProbeTokens(
       prompt,
       liveMcpPayloadChars(tool, callArgs, result),
     ),
-    success: result.ok && rows > 0,
-    ...(!result.ok ? { error: result.error } : {}),
+    success,
+    ...(error !== undefined ? { error } : {}),
   };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/agent-eval/live-mcp-arm.ts` around lines 63 - 64, The success flag is
computed as success: result.ok && rows > 0 but you only attach diagnostics when
!result.ok, so runs with result.ok===true and rows===0 have no error info;
update the object construction near success to include a diagnostic (e.g., an
error or reason field) whenever success is false — reference the variables
result and rows and the existing spread expression (...(!result.ok ? { error:
result.error } : {})) and change it to also emit a reason when rows === 0 (for
example { reason: 'no rows returned' } or similar) so all unsuccessful runs
carry explanatory diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/agent-eval-external.yml:
- Around line 38-39: Update the GitHub Actions steps to harden checkout and
artifact upload: for the "Checkout" step (uses: actions/checkout@v4) add
persist-credentials: false to the step inputs and pin the action to the specific
commit SHA instead of the tag; likewise, for the upload step that currently uses
actions/upload-artifact@v4, replace the tag with the action's exact commit SHA
to pin it to a known immutable version. Ensure you edit the steps named
"Checkout" and the upload-artifact step to include the SHA pins and the
persist-credentials: false configuration.
- Around line 46-77: The step "Resolve paths" currently assigns workflow inputs
directly into shell variables (RUNS, FIXTURE, SCEN, PROB) using inline
substitution which lets untrusted workflow_dispatch values be interpreted by the
shell; move these inputs into the step's env: mapping and read them from the
environment (e.g., use env RUNS: ${{ inputs.runs }} and then reference "$RUNS"
inside the script) so that the shell does not perform command substitution
before your validation, and keep the existing validation logic for RUNS,
FIXTURE, SCEN and PROB (and their *_ABS variants) intact.

In `@docs/plans/agent-surface-and-ops.md`:
- Line 29: Update the P1 section copy to use present-tense while PR `#144` is
still open: edit the line containing the parenthetical "(none after `#144` merges
— probe slice already shipped in `#139`)" and change it to present-tense wording
(for example: "(none while `#144` remains open — probe slice shipped in `#139`)") so
the "P1 — Open" heading clearly reflects current status; ensure the change
targets the text under the "P1 — Open" section in agent-surface-and-ops.md
referencing PR `#144`.

In `@docs/roadmap.md`:
- Line 63: Replace the typo phrase "completes PR 9 in `#144`" with "completes P1
in `#144`" so the roadmap uses the correct P1/P2 priority terminology; update the
parenthetical line that currently reads "(none after `#144` merges — probe shipped
in `#139`; live+log completes PR 9 in `#144`)" to use "P1" instead of "PR 9" to keep
priority naming consistent.

In `@scripts/agent-eval/print-comparison-summary.ts`:
- Around line 54-63: The log-mode validation currently checks numeric types for
summary.mcpOnTotalToolCalls, mcpOffTotalToolCalls, mcpOnTotalEstTokens, and
mcpOffTotalEstTokens but omits summary.mcpOnTotalWallMs and
summary.mcpOffTotalWallMs, allowing malformed values to pass; update the mode
=== "log" branch to also verify that if summary.mcpOnTotalWallMs or
summary.mcpOffTotalWallMs are present they are numbers, returning false if not,
and then continue to return row.scenarios.every(isLogScenario); reference the
summary object fields (mcpOnTotalWallMs, mcpOffTotalWallMs) and the existing
isLogScenario check when implementing the additional typeof checks.

---

Nitpick comments:
In `@scripts/agent-eval/live-mcp-arm.ts`:
- Around line 63-64: The success flag is computed as success: result.ok && rows
> 0 but you only attach diagnostics when !result.ok, so runs with
result.ok===true and rows===0 have no error info; update the object construction
near success to include a diagnostic (e.g., an error or reason field) whenever
success is false — reference the variables result and rows and the existing
spread expression (...(!result.ok ? { error: result.error } : {})) and change it
to also emit a reason when rows === 0 (for example { reason: 'no rows returned'
} or similar) so all unsuccessful runs carry explanatory diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 551c16cc-ccfe-416f-a8e6-ec7c0708ba8f

📥 Commits

Reviewing files that changed from the base of the PR and between e475d20 and 8f07953.

📒 Files selected for processing (24)
  • .github/CONTRIBUTING.md
  • .github/workflows/agent-eval-external.yml
  • .github/workflows/ci.yml
  • README.md
  • docs/README.md
  • docs/agents.md
  • docs/benchmark.md
  • docs/golden-queries.md
  • docs/packaging.md
  • docs/plans/agent-eval-harness.md
  • docs/plans/agent-surface-and-ops.md
  • docs/plans/agent-surface-delivery.md
  • docs/research/agent-eval-findings-2026-05.md
  • docs/roadmap.md
  • fixtures/agent-eval/sample-no-mcp-log.json
  • scripts/agent-eval/compare-live-logs.ts
  • scripts/agent-eval/live-mcp-arm.ts
  • scripts/agent-eval/mcp-allowlist.ts
  • scripts/agent-eval/metrics.ts
  • scripts/agent-eval/parse-agent-log.test.ts
  • scripts/agent-eval/print-comparison-summary.ts
  • scripts/agent-eval/run-arms.sh
  • scripts/agent-eval/run-probes.ts
  • scripts/agent-eval/tool-payload.ts
💤 Files with no reviewable changes (1)
  • docs/plans/agent-eval-harness.md

Comment thread .github/workflows/agent-eval-external.yml
Comment thread .github/workflows/agent-eval-external.yml
Comment thread docs/plans/agent-surface-and-ops.md Outdated
Comment thread docs/roadmap.md
Comment thread scripts/agent-eval/print-comparison-summary.ts
Present-tense P1 tracker copy, workflow_dispatch inputs via step env,
and log-mode wall-ms validation in print-comparison-summary.
Run golden setup after index, surface live 0-row failures, preserve
averaged arm errors, align doc tables with probe/live vs log modes.
Harden setup paths, multi-run averaging, live allowlist errors in JSON,
log compare empty MCP-on guard, summary on partial failure, doc fixes.
MCP-off zero-result error parity and probe summary before log compare.
@SutuSebastian SutuSebastian merged commit 1dd422d into main May 26, 2026
1 check passed
@SutuSebastian SutuSebastian deleted the feat/agent-eval-live-arms branch May 26, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant