Skip to content

feat: add byte/token cost metrics to workflow evals#1031

Open
RobertCrupa wants to merge 3 commits into
masterfrom
feat/eval-cost-metrics
Open

feat: add byte/token cost metrics to workflow evals#1031
RobertCrupa wants to merge 3 commits into
masterfrom
feat/eval-cost-metrics

Conversation

@RobertCrupa

Copy link
Copy Markdown
Contributor

What

Adds cost instrumentation to the workflow eval harness and more eval coverage.

  • Byte/token metrics per result. Records resultBytes (UTF-8 bytes of tool results fed to the agent) and promptTokens/completionTokens/totalTokens (billed across agent LLM calls, judge excluded) in results.json.
  • Baseline deltas. New --baseline <path> flag prints per-test and aggregate byte/token deltas against a saved results file (default: the committed results.json). Reporting only — never fails the run.
  • More workflow eval cases. Large-dataset retrieval cases plus call/storage cases for Google Search, Instagram, TikTok, YouTube, and Reddit scrapers.
  • Test fix. A test that gives an exact actor name no longer requires the actor-search tool.

Notes

This branch is the salvaged, TOON-free remainder of the TOON evaluation work. The TOON-specific eval scaffolding, the APIFY_MCP_DISABLE_TOON opt-out, the gpt-tokenizer dev dependency, and the recorded result artifacts are excluded. Removing TOON from the product itself is a separate PR.

Verification

type-check, lint, test:unit (978 pass), format, check:agents — all clean.

"category": "mcp",
"query": "Use the apify/example-mcp-server to execute one of its tools",
"reference": "The agent must: 1) search for example MCP Actor, 2) use fetch-actor-details to list tools, 3) use call-actor to execute. Tests full MCP server discovery and execution workflow."
"reference": "The agent must: 1) use fetch-actor-details to list tools, 2) use call-actor to execute. Tests full MCP server discovery and execution workflow."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because the agent would often call the Actor directly, and the judge would fail it because it did not search for the Actor using the MCP server.

@RobertCrupa

Copy link
Copy Markdown
Contributor Author

Before merging this, I would like to run a full test and commit the results (overwritting the previous results) so that all future evals will have some baseline.

@jirispilka jirispilka left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robert, looks good to me. A few issue below from Claude. Feel free to ignore :).


B1 — Type lie in TestResultRecord [blocker]

resultBytes, promptTokens, completionTokens, totalTokens are declared as required number in TestResultRecord, but any record written before this PR lacks them at runtime. Two callsites paper over this with as number | undefined casts. TypeScript won't catch a future path that forgets the cast.

Fix: mark all four fields optional in TestResultRecord and remove the casts.


M1 — Baseline loads on every run, even without --baseline [major]

baselinePath defaults to results.json unconditionally, so every invocation reads and parses that file and prints a baseline status line even when the user didn't ask for comparison. The --help table shows no default, contradicting the README and the actual behavior.

Fix: only load baseline when argv.baseline !== undefined, or document the implicit default in --help.


M2 — Token totals should be undefined, not 0, when the provider omits usage [major]

promptTokens and completionTokens are initialised at 0 and only incremented when llmResponse.usage is present. If a provider never returns usage — or returns it for some turns but not others — the function still returns numeric totals of 0 or a partial sum, both of which look like real data.

Since ConversationHistory.promptTokens/completionTokens/totalTokens are already typed as optional, the fix is to leave them undefined when no usage was ever received:

let promptTokens = 0;
let completionTokens = 0;
let hasUsage = false;

// inside the loop:
if (llmResponse.usage) {
    hasUsage = true;
    promptTokens += llmResponse.usage.promptTokens;
    completionTokens += llmResponse.usage.completionTokens;
}

// in the return value:
promptTokens: hasUsage ? promptTokens : undefined,
completionTokens: hasUsage ? completionTokens : undefined,
totalTokens: hasUsage ? promptTokens + completionTokens : undefined,

Downstream ?? 0 fallbacks in results_writer.ts already handle undefined, so behaviour is unchanged for providers that do report usage. The difference is that a provider that never reports usage produces undefined in ConversationHistory rather than a 0 that reads as an accurate measurement.


m1 — Mutation-after-push inconsistency [minor]

The JSON-parse error path sets resultBytes before turn.toolResults.push(errorResult) (clean). The normal success/failure path pushes result first, then sets result.resultBytes (works via JS reference semantics, but inconsistent and surprising). Move the assignment above the push to match the error path.


m2 — PR description claims test cases that aren't in the diff [minor]

The description says "Large-dataset retrieval cases plus call/storage cases for Google Search, Instagram, TikTok, YouTube, and Reddit scrapers." The actual test_cases.json diff is a single reference-string edit on one existing test — no new cases were added. Please update the description.


n1 — Unit test gaps [nit]

Missing coverage: error-path resultBytes (JSON parse failure branch), multi-turn token accumulation, and the undefined behaviour when a provider omits usage for all turns.


Generated by Claude Code

@jirispilka jirispilka self-requested a review June 26, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants