feat(agent-eval): live MCP arms, log comparison, and external workflow#144
Conversation
Close P1 agent-surface eval work: live mode dispatches handleQuery/query_recipe with a minimal CODEMAP_MCP_TOOLS allowlist, log mode compares exported session transcripts, and workflow_dispatch runs probe or live arms on external fixtures.
|
|
Warning Review limit reached
More reviews will be available in 31 minutes and 5 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (16)
📝 WalkthroughWalkthroughThis PR extends the agent-eval benchmark harness from deterministic "probe" mode to add live MCP handler evaluation ( ChangesLive MCP evaluation and log comparison harness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Type-safe live MCP arm, workflow path/scenario inputs, doc accuracy (probe vs log modes, probe+live labels), env restore on exit, and tests.
Drop redundant initCodemap in live arm, validate log/workflow paths, correct delivery tracker status, golden-queries wording.
Align plan tracker with in-flight #144, validate log paths in compareLogArms, and clarify PR 9 shipped vs partial status across docs.
Harden live eval allowlist defaults, workflow runs validation, doc cross-refs, CLI smoke tests, and comparison JSON validation.
Validate probes/scenarios paths before read and align live-mode help with ensureLiveEvalMcpToolsEnv (unset or blank).
Surface live MCP handler errors in comparison JSON, validate scenario arm shape in print-comparison-summary, and extend tests.
Add research note for MCP vs traditional agent findings, eval layers table and pinned fixtures/minimal live numbers in benchmark.md.
Project-local MCP wiring for dev (bun src/index.ts) with watch and workspace root; other IDE MCP targets from agents init omitted.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
scripts/agent-eval/live-mcp-arm.ts (1)
63-64: ⚡ Quick winAdd a reason when zero-row runs are marked unsuccessful.
Line 63 can set
successtofalseeven whenresult.ok === true, but Line 64 only attacheserrorfor!result.ok, leaving failed runs without diagnostics.Proposed patch
- return { + const success = result.ok && rows > 0; + const error = + !result.ok ? result.error : !success ? "agent-eval live: zero rows returned" : undefined; + + return { wallMs, toolSequence, toolCallCount: toolSequence.length, resultCount: rows, estTokens: estimateProbeTokens( prompt, liveMcpPayloadChars(tool, callArgs, result), ), - success: result.ok && rows > 0, - ...(!result.ok ? { error: result.error } : {}), + success, + ...(error !== undefined ? { error } : {}), };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/agent-eval/live-mcp-arm.ts` around lines 63 - 64, The success flag is computed as success: result.ok && rows > 0 but you only attach diagnostics when !result.ok, so runs with result.ok===true and rows===0 have no error info; update the object construction near success to include a diagnostic (e.g., an error or reason field) whenever success is false — reference the variables result and rows and the existing spread expression (...(!result.ok ? { error: result.error } : {})) and change it to also emit a reason when rows === 0 (for example { reason: 'no rows returned' } or similar) so all unsuccessful runs carry explanatory diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/agent-eval-external.yml:
- Around line 38-39: Update the GitHub Actions steps to harden checkout and
artifact upload: for the "Checkout" step (uses: actions/checkout@v4) add
persist-credentials: false to the step inputs and pin the action to the specific
commit SHA instead of the tag; likewise, for the upload step that currently uses
actions/upload-artifact@v4, replace the tag with the action's exact commit SHA
to pin it to a known immutable version. Ensure you edit the steps named
"Checkout" and the upload-artifact step to include the SHA pins and the
persist-credentials: false configuration.
- Around line 46-77: The step "Resolve paths" currently assigns workflow inputs
directly into shell variables (RUNS, FIXTURE, SCEN, PROB) using inline
substitution which lets untrusted workflow_dispatch values be interpreted by the
shell; move these inputs into the step's env: mapping and read them from the
environment (e.g., use env RUNS: ${{ inputs.runs }} and then reference "$RUNS"
inside the script) so that the shell does not perform command substitution
before your validation, and keep the existing validation logic for RUNS,
FIXTURE, SCEN and PROB (and their *_ABS variants) intact.
In `@docs/plans/agent-surface-and-ops.md`:
- Line 29: Update the P1 section copy to use present-tense while PR `#144` is
still open: edit the line containing the parenthetical "(none after `#144` merges
— probe slice already shipped in `#139`)" and change it to present-tense wording
(for example: "(none while `#144` remains open — probe slice shipped in `#139`)") so
the "P1 — Open" heading clearly reflects current status; ensure the change
targets the text under the "P1 — Open" section in agent-surface-and-ops.md
referencing PR `#144`.
In `@docs/roadmap.md`:
- Line 63: Replace the typo phrase "completes PR 9 in `#144`" with "completes P1
in `#144`" so the roadmap uses the correct P1/P2 priority terminology; update the
parenthetical line that currently reads "(none after `#144` merges — probe shipped
in `#139`; live+log completes PR 9 in `#144`)" to use "P1" instead of "PR 9" to keep
priority naming consistent.
In `@scripts/agent-eval/print-comparison-summary.ts`:
- Around line 54-63: The log-mode validation currently checks numeric types for
summary.mcpOnTotalToolCalls, mcpOffTotalToolCalls, mcpOnTotalEstTokens, and
mcpOffTotalEstTokens but omits summary.mcpOnTotalWallMs and
summary.mcpOffTotalWallMs, allowing malformed values to pass; update the mode
=== "log" branch to also verify that if summary.mcpOnTotalWallMs or
summary.mcpOffTotalWallMs are present they are numbers, returning false if not,
and then continue to return row.scenarios.every(isLogScenario); reference the
summary object fields (mcpOnTotalWallMs, mcpOffTotalWallMs) and the existing
isLogScenario check when implementing the additional typeof checks.
---
Nitpick comments:
In `@scripts/agent-eval/live-mcp-arm.ts`:
- Around line 63-64: The success flag is computed as success: result.ok && rows
> 0 but you only attach diagnostics when !result.ok, so runs with
result.ok===true and rows===0 have no error info; update the object construction
near success to include a diagnostic (e.g., an error or reason field) whenever
success is false — reference the variables result and rows and the existing
spread expression (...(!result.ok ? { error: result.error } : {})) and change it
to also emit a reason when rows === 0 (for example { reason: 'no rows returned'
} or similar) so all unsuccessful runs carry explanatory diagnostics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 551c16cc-ccfe-416f-a8e6-ec7c0708ba8f
📒 Files selected for processing (24)
.github/CONTRIBUTING.md.github/workflows/agent-eval-external.yml.github/workflows/ci.ymlREADME.mddocs/README.mddocs/agents.mddocs/benchmark.mddocs/golden-queries.mddocs/packaging.mddocs/plans/agent-eval-harness.mddocs/plans/agent-surface-and-ops.mddocs/plans/agent-surface-delivery.mddocs/research/agent-eval-findings-2026-05.mddocs/roadmap.mdfixtures/agent-eval/sample-no-mcp-log.jsonscripts/agent-eval/compare-live-logs.tsscripts/agent-eval/live-mcp-arm.tsscripts/agent-eval/mcp-allowlist.tsscripts/agent-eval/metrics.tsscripts/agent-eval/parse-agent-log.test.tsscripts/agent-eval/print-comparison-summary.tsscripts/agent-eval/run-arms.shscripts/agent-eval/run-probes.tsscripts/agent-eval/tool-payload.ts
💤 Files with no reviewable changes (1)
- docs/plans/agent-eval-harness.md
Present-tense P1 tracker copy, workflow_dispatch inputs via step env, and log-mode wall-ms validation in print-comparison-summary.
Run golden setup after index, surface live 0-row failures, preserve averaged arm errors, align doc tables with probe/live vs log modes.
Harden setup paths, multi-run averaging, live allowlist errors in JSON, log compare empty MCP-on guard, summary on partial failure, doc fixes.
MCP-off zero-result error parity and probe summary before log compare.
Summary
handleQuery/handleQueryRecipewith a minimalCODEMAP_MCP_TOOLS=query,query_recipeallowlist (closer to real MCP than probe-modequeryRows).workflow_dispatchworkflow (.github/workflows/agent-eval-external.yml) for probe or live runs on external indexed fixtures.docs/plans/agent-eval-harness.md(lifted todocs/benchmark.md§ Agent eval harness).Test plan
bun test scripts/agent-eval(36 tests — probe + live smoke)AGENT_EVAL_MODE=live AGENT_EVAL_PRINT_SUMMARY=1 bash scripts/agent-eval/run-arms.sh— 3/3 scenarios ok onfixtures/minimalAGENT_EVAL_LOG_ON/AGENT_EVAL_LOG_OFFsample fixturesSummary by CodeRabbit
Release Notes
New Features
Tests
Documentation