feat: add byte/token cost metrics to workflow evals#1031
Conversation
85a8a26 to
8604ef8
Compare
| "category": "mcp", | ||
| "query": "Use the apify/example-mcp-server to execute one of its tools", | ||
| "reference": "The agent must: 1) search for example MCP Actor, 2) use fetch-actor-details to list tools, 3) use call-actor to execute. Tests full MCP server discovery and execution workflow." | ||
| "reference": "The agent must: 1) use fetch-actor-details to list tools, 2) use call-actor to execute. Tests full MCP server discovery and execution workflow." |
There was a problem hiding this comment.
I made this change because the agent would often call the Actor directly, and the judge would fail it because it did not search for the Actor using the MCP server.
|
Before merging this, I would like to run a full test and commit the results (overwritting the previous results) so that all future evals will have some baseline. |
There was a problem hiding this comment.
Thanks Robert, looks good to me. A few issue below from Claude. Feel free to ignore :).
B1 — Type lie in TestResultRecord [blocker]
resultBytes, promptTokens, completionTokens, totalTokens are declared as required number in TestResultRecord, but any record written before this PR lacks them at runtime. Two callsites paper over this with as number | undefined casts. TypeScript won't catch a future path that forgets the cast.
Fix: mark all four fields optional in TestResultRecord and remove the casts.
M1 — Baseline loads on every run, even without --baseline [major]
baselinePath defaults to results.json unconditionally, so every invocation reads and parses that file and prints a baseline status line even when the user didn't ask for comparison. The --help table shows no default, contradicting the README and the actual behavior.
Fix: only load baseline when argv.baseline !== undefined, or document the implicit default in --help.
M2 — Token totals should be undefined, not 0, when the provider omits usage [major]
promptTokens and completionTokens are initialised at 0 and only incremented when llmResponse.usage is present. If a provider never returns usage — or returns it for some turns but not others — the function still returns numeric totals of 0 or a partial sum, both of which look like real data.
Since ConversationHistory.promptTokens/completionTokens/totalTokens are already typed as optional, the fix is to leave them undefined when no usage was ever received:
let promptTokens = 0;
let completionTokens = 0;
let hasUsage = false;
// inside the loop:
if (llmResponse.usage) {
hasUsage = true;
promptTokens += llmResponse.usage.promptTokens;
completionTokens += llmResponse.usage.completionTokens;
}
// in the return value:
promptTokens: hasUsage ? promptTokens : undefined,
completionTokens: hasUsage ? completionTokens : undefined,
totalTokens: hasUsage ? promptTokens + completionTokens : undefined,Downstream ?? 0 fallbacks in results_writer.ts already handle undefined, so behaviour is unchanged for providers that do report usage. The difference is that a provider that never reports usage produces undefined in ConversationHistory rather than a 0 that reads as an accurate measurement.
m1 — Mutation-after-push inconsistency [minor]
The JSON-parse error path sets resultBytes before turn.toolResults.push(errorResult) (clean). The normal success/failure path pushes result first, then sets result.resultBytes (works via JS reference semantics, but inconsistent and surprising). Move the assignment above the push to match the error path.
m2 — PR description claims test cases that aren't in the diff [minor]
The description says "Large-dataset retrieval cases plus call/storage cases for Google Search, Instagram, TikTok, YouTube, and Reddit scrapers." The actual test_cases.json diff is a single reference-string edit on one existing test — no new cases were added. Please update the description.
n1 — Unit test gaps [nit]
Missing coverage: error-path resultBytes (JSON parse failure branch), multi-turn token accumulation, and the undefined behaviour when a provider omits usage for all turns.
Generated by Claude Code
What
Adds cost instrumentation to the workflow eval harness and more eval coverage.
resultBytes(UTF-8 bytes of tool results fed to the agent) andpromptTokens/completionTokens/totalTokens(billed across agent LLM calls, judge excluded) inresults.json.--baseline <path>flag prints per-test and aggregate byte/token deltas against a saved results file (default: the committedresults.json). Reporting only — never fails the run.Notes
This branch is the salvaged, TOON-free remainder of the TOON evaluation work. The TOON-specific eval scaffolding, the
APIFY_MCP_DISABLE_TOONopt-out, thegpt-tokenizerdev dependency, and the recorded result artifacts are excluded. Removing TOON from the product itself is a separate PR.Verification
type-check,lint,test:unit(978 pass),format,check:agents— all clean.