feat(agent-eval): live MCP arms, log comparison, and external workflow by SutuSebastian · Pull Request #144 · stainless-code/codemap

SutuSebastian · 2026-05-26T10:39:22Z

Summary

Adds live agent-eval mode: dispatches golden probe tasks through handleQuery / handleQueryRecipe with a minimal CODEMAP_MCP_TOOLS=query,query_recipe allowlist (closer to real MCP than probe-mode queryRows).
Adds log comparison: parse exported MCP-on vs MCP-off session transcripts and emit comparison JSON + markdown summary.
Adds optional workflow_dispatch workflow (.github/workflows/agent-eval-external.yml) for probe or live runs on external indexed fixtures.
Closes P1 on agent-surface track; deletes shipped docs/plans/agent-eval-harness.md (lifted to docs/benchmark.md § Agent eval harness).

Test plan

bun test scripts/agent-eval (36 tests — probe + live smoke)
AGENT_EVAL_MODE=live AGENT_EVAL_PRINT_SUMMARY=1 bash scripts/agent-eval/run-arms.sh — 3/3 scenarios ok on fixtures/minimal
Log comparison smoke via AGENT_EVAL_LOG_ON / AGENT_EVAL_LOG_OFF sample fixtures
CI green on PR

Summary by CodeRabbit

Release Notes

New Features
- Added "live" evaluation mode for agent-eval harness alongside existing "probe" mode
- New manual GitHub Actions workflow for running agent evaluations against external fixtures
- Added log comparison capability to analyze differences between evaluation runs with different configurations
Tests
- Expanded test coverage for new evaluation modes and log comparison tools
Documentation
- Updated development guides and benchmarking documentation to reflect expanded agent-eval capabilities

Close P1 agent-surface eval work: live mode dispatches handleQuery/query_recipe with a minimal CODEMAP_MCP_TOOLS allowlist, log mode compares exported session transcripts, and workflow_dispatch runs probe or live arms on external fixtures.

changeset-bot · 2026-05-26T10:39:29Z

⚠️ No Changeset found

Latest commit: 486c4d1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-26T10:39:30Z

Warning

Review limit reached

@SutuSebastian, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 5 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1ff78bb-6a91-44e1-8233-0f489a32d070

📥 Commits

Reviewing files that changed from the base of the PR and between 8f07953 and 486c4d1.

📒 Files selected for processing (16)

.cursor/mcp.json
.github/workflows/agent-eval-external.yml
.vscode/mcp.json
docs/benchmark.md
docs/golden-queries.md
docs/plans/agent-surface-and-ops.md
docs/research/agent-eval-findings-2026-05.md
docs/roadmap.md
scripts/agent-eval/compare-live-logs.ts
scripts/agent-eval/live-mcp-arm.ts
scripts/agent-eval/parse-agent-log.test.ts
scripts/agent-eval/print-comparison-summary.ts
scripts/agent-eval/run-arms.sh
scripts/agent-eval/run-probes.ts
scripts/query-golden.ts
scripts/query-golden/run-setup.ts

📝 Walkthrough

Walkthrough

This PR extends the agent-eval benchmark harness from deterministic "probe" mode to add live MCP handler evaluation (live mode) alongside log comparison and dual-mode orchestration. It includes new CLI tools, workflow automation, comprehensive tests, and documentation updates.

Changes

Live MCP evaluation and log comparison harness

Layer / File(s)	Summary
Live MCP tool management and payload analysis `scripts/agent-eval/mcp-allowlist.ts`, `scripts/agent-eval/tool-payload.ts`, `scripts/agent-eval/metrics.ts`	Introduces `LIVE_EVAL_MCP_TOOLS` constant for query/query_recipe, provides environment initialization and tool validation helpers, and adds payload analysis for row counting and token estimation from tool results.
Live MCP arm execution `scripts/agent-eval/live-mcp-arm.ts`	Implements `runLiveMcpArm` to dispatch live handlers (query vs query_recipe), measure wall time and metrics, estimate tokens from payload size, and return `ArmRunMetrics` with success/error details.
Live log comparison and markdown reporting `scripts/agent-eval/compare-live-logs.ts`, `scripts/agent-eval/print-comparison-summary.ts`	Reads MCP-on vs MCP-off logs, computes per-scenario deltas, aggregates totals, and formats Markdown tables for probe/live/log comparison modes with mode-specific summary fields.
Dual-mode harness orchestration `scripts/agent-eval/run-probes.ts`, `scripts/agent-eval/run-arms.sh`	Threads `AgentEvalMode` through CLI parsing, MCP-on execution dispatch (probe via SQL or live via handlers), environment management for `CODEMAP_MCP_TOOLS`, and conditional live log comparison invocation.
Comprehensive test suite and sample fixture `scripts/agent-eval/parse-agent-log.test.ts`, `fixtures/agent-eval/sample-no-mcp-log.json`	Adds test suites for MCP allowlisting, live arm execution, log comparison, markdown rendering, and CLI behavior; includes sample agent log fixture for testing log parsing.
External harness workflow and CI updates `.github/workflows/agent-eval-external.yml`, `.github/workflows/ci.yml`, `.github/CONTRIBUTING.md`	Adds new manual-dispatch `Agent eval (external)` workflow with configurable fixture, mode, runs, scenarios, and probes; updates CI step naming to reflect probe + live coverage.
Documentation and planning updates `docs/benchmark.md`, `docs/golden-queries.md`, `docs/agents.md`, `docs/packaging.md`, `docs/README.md`, `docs/roadmap.md`, `docs/plans/agent-surface-and-ops.md`, `docs/plans/agent-surface-delivery.md`, `docs/research/agent-eval-findings-2026-05.md`, `README.md`	Rewrites benchmark harness section to document Probe/Live/Log modes and live-log comparison; updates plan status for agent surface work; adds research note on MCP vs traditional agent evaluation findings and methodology.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

stainless-code/codemap#139: Extends the deterministic A/B probe harness and log parsing infrastructure from #139 by adding live MCP handler evaluation and log comparison rather than just the original probe-only flow.

Suggested labels

enhancement

Poem

🐰 A harness grows from probe to live,
MCP arms dance, logs compare and sieve,
Dual modes measure what agents can do,
From query to findings, research runs true!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main changes: adding live MCP arms, log comparison, and an external workflow for agent-eval testing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/agent-eval-live-arms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Type-safe live MCP arm, workflow path/scenario inputs, doc accuracy (probe vs log modes, probe+live labels), env restore on exit, and tests.

Drop redundant initCodemap in live arm, validate log/workflow paths, correct delivery tracker status, golden-queries wording.

Align plan tracker with in-flight #144, validate log paths in compareLogArms, and clarify PR 9 shipped vs partial status across docs.

Harden live eval allowlist defaults, workflow runs validation, doc cross-refs, CLI smoke tests, and comparison JSON validation.

Validate probes/scenarios paths before read and align live-mode help with ensureLiveEvalMcpToolsEnv (unset or blank).

Surface live MCP handler errors in comparison JSON, validate scenario arm shape in print-comparison-summary, and extend tests.

Add research note for MCP vs traditional agent findings, eval layers table and pinned fixtures/minimal live numbers in benchmark.md.

Project-local MCP wiring for dev (bun src/index.ts) with watch and workspace root; other IDE MCP targets from agents init omitted.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

scripts/agent-eval/live-mcp-arm.ts (1)

63-64: ⚡ Quick win

Add a reason when zero-row runs are marked unsuccessful.

Line 63 can set success to false even when result.ok === true, but Line 64 only attaches error for !result.ok, leaving failed runs without diagnostics.

Proposed patch

-  return {
+  const success = result.ok && rows > 0;
+  const error =
+    !result.ok ? result.error : !success ? "agent-eval live: zero rows returned" : undefined;
+
+  return {
     wallMs,
     toolSequence,
     toolCallCount: toolSequence.length,
     resultCount: rows,
     estTokens: estimateProbeTokens(
       prompt,
       liveMcpPayloadChars(tool, callArgs, result),
     ),
-    success: result.ok && rows > 0,
-    ...(!result.ok ? { error: result.error } : {}),
+    success,
+    ...(error !== undefined ? { error } : {}),
   };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/agent-eval/live-mcp-arm.ts` around lines 63 - 64, The success flag is
computed as success: result.ok && rows > 0 but you only attach diagnostics when
!result.ok, so runs with result.ok===true and rows===0 have no error info;
update the object construction near success to include a diagnostic (e.g., an
error or reason field) whenever success is false — reference the variables
result and rows and the existing spread expression (...(!result.ok ? { error:
result.error } : {})) and change it to also emit a reason when rows === 0 (for
example { reason: 'no rows returned' } or similar) so all unsuccessful runs
carry explanatory diagnostics.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/agent-eval-external.yml:
- Around line 38-39: Update the GitHub Actions steps to harden checkout and
artifact upload: for the "Checkout" step (uses: actions/checkout@v4) add
persist-credentials: false to the step inputs and pin the action to the specific
commit SHA instead of the tag; likewise, for the upload step that currently uses
actions/upload-artifact@v4, replace the tag with the action's exact commit SHA
to pin it to a known immutable version. Ensure you edit the steps named
"Checkout" and the upload-artifact step to include the SHA pins and the
persist-credentials: false configuration.
- Around line 46-77: The step "Resolve paths" currently assigns workflow inputs
directly into shell variables (RUNS, FIXTURE, SCEN, PROB) using inline
substitution which lets untrusted workflow_dispatch values be interpreted by the
shell; move these inputs into the step's env: mapping and read them from the
environment (e.g., use env RUNS: ${{ inputs.runs }} and then reference "$RUNS"
inside the script) so that the shell does not perform command substitution
before your validation, and keep the existing validation logic for RUNS,
FIXTURE, SCEN and PROB (and their *_ABS variants) intact.

In `@docs/plans/agent-surface-and-ops.md`:
- Line 29: Update the P1 section copy to use present-tense while PR `#144` is
still open: edit the line containing the parenthetical "(none after `#144` merges
— probe slice already shipped in `#139`)" and change it to present-tense wording
(for example: "(none while `#144` remains open — probe slice shipped in `#139`)") so
the "P1 — Open" heading clearly reflects current status; ensure the change
targets the text under the "P1 — Open" section in agent-surface-and-ops.md
referencing PR `#144`.

In `@docs/roadmap.md`:
- Line 63: Replace the typo phrase "completes PR 9 in `#144`" with "completes P1
in `#144`" so the roadmap uses the correct P1/P2 priority terminology; update the
parenthetical line that currently reads "(none after `#144` merges — probe shipped
in `#139`; live+log completes PR 9 in `#144`)" to use "P1" instead of "PR 9" to keep
priority naming consistent.

In `@scripts/agent-eval/print-comparison-summary.ts`:
- Around line 54-63: The log-mode validation currently checks numeric types for
summary.mcpOnTotalToolCalls, mcpOffTotalToolCalls, mcpOnTotalEstTokens, and
mcpOffTotalEstTokens but omits summary.mcpOnTotalWallMs and
summary.mcpOffTotalWallMs, allowing malformed values to pass; update the mode
=== "log" branch to also verify that if summary.mcpOnTotalWallMs or
summary.mcpOffTotalWallMs are present they are numbers, returning false if not,
and then continue to return row.scenarios.every(isLogScenario); reference the
summary object fields (mcpOnTotalWallMs, mcpOffTotalWallMs) and the existing
isLogScenario check when implementing the additional typeof checks.

---

Nitpick comments:
In `@scripts/agent-eval/live-mcp-arm.ts`:
- Around line 63-64: The success flag is computed as success: result.ok && rows
> 0 but you only attach diagnostics when !result.ok, so runs with
result.ok===true and rows===0 have no error info; update the object construction
near success to include a diagnostic (e.g., an error or reason field) whenever
success is false — reference the variables result and rows and the existing
spread expression (...(!result.ok ? { error: result.error } : {})) and change it
to also emit a reason when rows === 0 (for example { reason: 'no rows returned'
} or similar) so all unsuccessful runs carry explanatory diagnostics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 551c16cc-ccfe-416f-a8e6-ec7c0708ba8f

📥 Commits

Reviewing files that changed from the base of the PR and between e475d20 and 8f07953.

📒 Files selected for processing (24)

.github/CONTRIBUTING.md
.github/workflows/agent-eval-external.yml
.github/workflows/ci.yml
README.md
docs/README.md
docs/agents.md
docs/benchmark.md
docs/golden-queries.md
docs/packaging.md
docs/plans/agent-eval-harness.md
docs/plans/agent-surface-and-ops.md
docs/plans/agent-surface-delivery.md
docs/research/agent-eval-findings-2026-05.md
docs/roadmap.md
fixtures/agent-eval/sample-no-mcp-log.json
scripts/agent-eval/compare-live-logs.ts
scripts/agent-eval/live-mcp-arm.ts
scripts/agent-eval/mcp-allowlist.ts
scripts/agent-eval/metrics.ts
scripts/agent-eval/parse-agent-log.test.ts
scripts/agent-eval/print-comparison-summary.ts
scripts/agent-eval/run-arms.sh
scripts/agent-eval/run-probes.ts
scripts/agent-eval/tool-payload.ts

💤 Files with no reviewable changes (1)

docs/plans/agent-eval-harness.md

Present-tense P1 tracker copy, workflow_dispatch inputs via step env, and log-mode wall-ms validation in print-comparison-summary.

Run golden setup after index, surface live 0-row failures, preserve averaged arm errors, align doc tables with probe/live vs log modes.

Harden setup paths, multi-run averaging, live allowlist errors in JSON, log compare empty MCP-on guard, summary on partial failure, doc fixes.

MCP-off zero-result error parity and probe summary before log compare.

SutuSebastian added 8 commits May 26, 2026 13:42

fix(agent-eval): address PR #144 review cycle 1

b41f8eb

Type-safe live MCP arm, workflow path/scenario inputs, doc accuracy (probe vs log modes, probe+live labels), env restore on exit, and tests.

fix(agent-eval): address PR #144 review cycle 2

5a81a1c

Drop redundant initCodemap in live arm, validate log/workflow paths, correct delivery tracker status, golden-queries wording.

fix(agent-eval): address PR #144 review cycle 3

0084a28

Align plan tracker with in-flight #144, validate log paths in compareLogArms, and clarify PR 9 shipped vs partial status across docs.

fix(agent-eval): address PR #144 review cycle 4

5d15fdb

Harden live eval allowlist defaults, workflow runs validation, doc cross-refs, CLI smoke tests, and comparison JSON validation.

fix(agent-eval): address PR #144 review cycle 5

1c3b87f

Validate probes/scenarios paths before read and align live-mode help with ensureLiveEvalMcpToolsEnv (unset or blank).

fix(agent-eval): address PR #144 review cycle 6

c73df44

Surface live MCP handler errors in comparison JSON, validate scenario arm shape in print-comparison-summary, and extend tests.

docs(agent-eval): publish harness layers and sample results

8f07953

Add research note for MCP vs traditional agent findings, eval layers table and pinned fixtures/minimal live numbers in benchmark.md.

chore: add Cursor and VS Code Codemap MCP configs

c6788e5

Project-local MCP wiring for dev (bun src/index.ts) with watch and workspace root; other IDE MCP targets from agents init omitted.

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread .github/workflows/agent-eval-external.yml

Comment thread .github/workflows/agent-eval-external.yml

Comment thread docs/plans/agent-surface-and-ops.md Outdated

Comment thread docs/roadmap.md

Comment thread scripts/agent-eval/print-comparison-summary.ts

SutuSebastian added 4 commits May 26, 2026 14:39

fix(agent-eval): address PR #144 CodeRabbit review (fact-checked)

5a4693d

Present-tense P1 tracker copy, workflow_dispatch inputs via step env, and log-mode wall-ms validation in print-comparison-summary.

fix(agent-eval): address PR #144 review cycle 7

0607402

Run golden setup after index, surface live 0-row failures, preserve averaged arm errors, align doc tables with probe/live vs log modes.

fix(agent-eval): address PR #144 review cycle 8

58ced63

Harden setup paths, multi-run averaging, live allowlist errors in JSON, log compare empty MCP-on guard, summary on partial failure, doc fixes.

fix(agent-eval): address PR #144 review cycle 9

486c4d1

MCP-off zero-result error parity and probe summary before log compare.

SutuSebastian merged commit 1dd422d into main May 26, 2026
1 check passed

SutuSebastian deleted the feat/agent-eval-live-arms branch May 26, 2026 12:25

SutuSebastian mentioned this pull request May 26, 2026

docs: fact-check sweep — align docs and CLI help with shipped behavior #145

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(agent-eval): live MCP arms, log comparison, and external workflow#144

feat(agent-eval): live MCP arms, log comparison, and external workflow#144
SutuSebastian merged 13 commits into
mainfrom
feat/agent-eval-live-arms

SutuSebastian commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

changeset-bot Bot commented May 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

SutuSebastian commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

changeset-bot Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SutuSebastian commented May 26, 2026 •

edited by coderabbitai Bot

Loading

changeset-bot Bot commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading