feat(agent-eval): probe A/B harness and log parser (PR 9)#139
Conversation
Ship dev-only scripts/agent-eval tracer bullet so local runs compare MCP-on query vs glob/read/grep discovery on golden-mirrored probes, with optional agent transcript parsing and benchmark.md methodology docs.
|
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a deterministic A/B agent-eval harness (MCP-on vs MCP-off) with probes schema, log parsing and token metrics, averaging/reporting CLI, tests, CI/package wiring, docs, and extracts golden-query resolution into a reusable resolver. ChangesAgent-Eval Harness & Query Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Typecheck agent-eval scripts, wire test:agent-eval into check, fix per-arm success and token estimates, harden log parser, extract resolveGoldenQuery, and tighten benchmark/roadmap/allowlist docs.
Run test:agent-eval in the Test job, exit non-zero when scenarioSuccess is incomplete, document exit behavior and AGENT_EVAL_RUNS averaging, and sync plan/roadmap wording with the PR CI probe gate.
Export applyProbeExitCode for unit testing and mark PR 9 merge-ready in the delivery tracker after CI green.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
scripts/agent-eval/run-probes.ts (1)
26-65: ⚡ Quick winAdd concise docs for exported harness APIs with non-obvious semantics.
The exported APIs encode policy choices (success criteria, averaging/rounding, exit behavior). A short TSDoc note per export would make downstream/test usage safer and clearer.
As per coding guidelines "
**/*.{ts,tsx,js,jsx}: All public APIs must have accompanying documentation."Also applies to: 123-195, 256-303
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/agent-eval/run-probes.ts` around lines 26 - 65, Add short TSDoc comments to each exported interface and key fields (ArmRunMetrics, ScenarioComparison, AgentEvalComparison) describing non-obvious semantics: e.g., state that ArmRunMetrics.wallMs is elapsed time in milliseconds, estTokens is estimated token usage and whether it’s rounded/averaged, toolSequence is ordered tool names called, toolCallCount is total tool invocation count, resultCount is number of rows returned and success flag semantics; for ScenarioComparison note scenarioSuccess means both arms returned ≥1 row, delta fields are mcpOn - mcpOff (sign convention), and for AgentEvalComparison document generatedAt format (ISO string), mode fixed value "probe", runs is number of repeats, fixtureRoot meaning, and exactly how summary aggregates (which arm contributed to mcpOn*/mcpOff* totals and whether values are sums/averages/rounded). Ensure each TSDoc is concise and placed immediately above the respective exported interface/field names (ArmRunMetrics, ScenarioComparison, AgentEvalComparison, and summary fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/agent-eval/parse-agent-log.ts`:
- Around line 195-201: The catch is catching errors from both JSON.parse(raw)
and parseJsonLog(...), masking shape/validation errors as “invalid JSON”; fix it
by only catching JSON.parse failures: first attempt to JSON.parse(raw) inside a
try/catch and on error throw the current `agent log: invalid JSON` with the
parse error message, then call `parseJsonLog(...)` outside that catch so any
errors thrown by `parseJsonLog` (shape/validation) are preserved and propagate
(or can be handled separately); update the block around `trimmed.startsWith("{")
|| trimmed.startsWith("[")` to reflect this separation and keep references to
`raw`, `JSON.parse`, and `parseJsonLog`.
In `@scripts/agent-eval/run-probes.ts`:
- Around line 260-262: Guard the averageSamples implementation against empty
input: at the start of the function check samples.length (n) and if zero throw a
clear Error (e.g. "averageSamples called with empty samples") or return a safe
value, then only access samples[0].prompt and compute avgArm/divisions after
that check so you never dereference samples[0] or divide by n when n is 0;
update the code paths that use samples, n, prompt, and avgArm accordingly.
- Around line 74-83: The CLI parsing accepts the next argv token for --output
and --fixture-root even when that token is another flag; update the parsing in
run-probes.ts (the branches handling a === "--output" and a ===
"--fixture-root", using variables a, argv, i, output, fixtureRoot) to validate
that argv[i+1] exists and does not start with "-" before consuming it; if
argv[i+1] is missing or startsWith("-"), throw a clear Error like "Missing value
for --output" / "Missing value for --fixture-root" instead of silently treating
a flag as the path. Ensure you increment i only after confirming the token is
valid.
---
Nitpick comments:
In `@scripts/agent-eval/run-probes.ts`:
- Around line 26-65: Add short TSDoc comments to each exported interface and key
fields (ArmRunMetrics, ScenarioComparison, AgentEvalComparison) describing
non-obvious semantics: e.g., state that ArmRunMetrics.wallMs is elapsed time in
milliseconds, estTokens is estimated token usage and whether it’s
rounded/averaged, toolSequence is ordered tool names called, toolCallCount is
total tool invocation count, resultCount is number of rows returned and success
flag semantics; for ScenarioComparison note scenarioSuccess means both arms
returned ≥1 row, delta fields are mcpOn - mcpOff (sign convention), and for
AgentEvalComparison document generatedAt format (ISO string), mode fixed value
"probe", runs is number of repeats, fixtureRoot meaning, and exactly how summary
aggregates (which arm contributed to mcpOn*/mcpOff* totals and whether values
are sums/averages/rounded). Ensure each TSDoc is concise and placed immediately
above the respective exported interface/field names (ArmRunMetrics,
ScenarioComparison, AgentEvalComparison, and summary fields).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 07deb864-a257-4516-a8cf-ec5b5ae19bb0
📒 Files selected for processing (22)
.github/workflows/ci.yml.gitignoredocs/benchmark.mddocs/plans/agent-eval-harness.mddocs/plans/agent-surface-delivery.mddocs/plans/mcp-tool-allowlist.mddocs/roadmap.mdfixtures/agent-eval/sample-cursor-log.jsonpackage.jsonscripts/agent-eval/metrics.tsscripts/agent-eval/parse-agent-log.test.tsscripts/agent-eval/parse-agent-log.tsscripts/agent-eval/print-log-metrics.tsscripts/agent-eval/probe-tokens.tsscripts/agent-eval/run-arms.shscripts/agent-eval/run-probes.tsscripts/agent-eval/scenarios.jsonscripts/agent-eval/schema.tsscripts/agent-eval/traditional-probe.tsscripts/query-golden.tsscripts/query-golden/resolve-golden-query.tstsconfig.json
Separate JSON parse errors from unsupported transcript shapes, reject flag tokens as option values, and guard averageSamples on empty input.
Add --scenarios for external fixtures, improve token math (bind values, log payloads), harden parser/CLI/schema edge cases, expand tests, and sync docs/roadmap cross-refs.
Add --probes and --skip-index, re-ceil averaged estTokens, parse structured content arrays in logs, and sync README/golden-queries/benchmark docs.
Broaden log parser tool/content coverage, align averaged deltas, guard traditional regex, restore CODEMAP_ROOT, exit 1 on partial probes, and reuse index in run-arms.sh when present.
* docs: post-merge sweep after agent-eval (#139) Delete ten shipped agent-surface plans; prune roadmap [x] backlog; refresh README, glossary, templates, and cross-refs to agents.md. * docs: fix broken table in callback-dispatch-synthesis plan Pipe characters in L.1 split the markdown table into extra columns. * docs: address PR #140 review findings Fix plan anchor slugs, MCP-only tool wording, parse-worker env lift, and delivery-tracker maintenance steps for delete+lift lifecycle.
Summary
scripts/agent-eval/harness (PR 9 tracer bullet): deterministic A/B comparing MCP-on (onequeryper probe) vs MCP-off (glob→read× N →grep) on three golden-mirrored probes.bash scripts/agent-eval/run-arms.shwrites local comparison JSON (.agent-eval/comparison.json); no telemetry upload.docs/benchmark.md.Test plan
bash scripts/agent-eval/run-arms.sh— 3/3 scenarios ok, comparison JSON writtenbun test scripts/agent-eval/parse-agent-log.test.tsFollow-ups (not in this PR)
workflow_dispatch/ nightly jobSummary by CodeRabbit
New Features
Tests
Documentation
Chores