feat: add byte/token cost metrics to workflow evals by RobertCrupa · Pull Request #1031 · apify/apify-mcp-server

RobertCrupa · 2026-06-26T11:00:08Z

What

Adds cost instrumentation to the workflow eval harness and more eval coverage.

Byte/token metrics per result. Records resultBytes (UTF-8 bytes of tool results fed to the agent) and promptTokens/completionTokens/totalTokens (billed across agent LLM calls, judge excluded) in results.json.
Baseline deltas. New --baseline <path> flag prints per-test and aggregate byte/token deltas against a saved results file (default: the committed results.json). Reporting only — never fails the run.
More workflow eval cases. Large-dataset retrieval cases plus call/storage cases for Google Search, Instagram, TikTok, YouTube, and Reddit scrapers.
Test fix. A test that gives an exact actor name no longer requires the actor-search tool.

Notes

This branch is the salvaged, TOON-free remainder of the TOON evaluation work. The TOON-specific eval scaffolding, the APIFY_MCP_DISABLE_TOON opt-out, the gpt-tokenizer dev dependency, and the recorded result artifacts are excluded. Removing TOON from the product itself is a separate PR.

Verification

type-check, lint, test:unit (978 pass), format, check:agents — all clean.

… tool

RobertCrupa · 2026-06-26T11:56:11Z

      "category": "mcp",
      "query": "Use the apify/example-mcp-server to execute one of its tools",
-      "reference": "The agent must: 1) search for example MCP Actor, 2) use fetch-actor-details to list tools, 3) use call-actor to execute. Tests full MCP server discovery and execution workflow."
+      "reference": "The agent must: 1) use fetch-actor-details to list tools, 2) use call-actor to execute. Tests full MCP server discovery and execution workflow."


I made this change because the agent would often call the Actor directly, and the judge would fail it because it did not search for the Actor using the MCP server.

RobertCrupa · 2026-06-26T12:16:18Z

Before merging this, I would like to run a full test and commit the results (overwritting the previous results) so that all future evals will have some baseline.

jirispilka

Thanks Robert, looks good to me. A few issue below from Claude. Feel free to ignore :).

B1 — Type lie in `TestResultRecord` [blocker]

resultBytes, promptTokens, completionTokens, totalTokens are declared as required number in TestResultRecord, but any record written before this PR lacks them at runtime. Two callsites paper over this with as number | undefined casts. TypeScript won't catch a future path that forgets the cast.

Fix: mark all four fields optional in TestResultRecord and remove the casts.

M1 — Baseline loads on every run, even without `--baseline` [major]

baselinePath defaults to results.json unconditionally, so every invocation reads and parses that file and prints a baseline status line even when the user didn't ask for comparison. The --help table shows no default, contradicting the README and the actual behavior.

Fix: only load baseline when argv.baseline !== undefined, or document the implicit default in --help.

M2 — Token totals should be `undefined`, not `0`, when the provider omits usage [major]

promptTokens and completionTokens are initialised at 0 and only incremented when llmResponse.usage is present. If a provider never returns usage — or returns it for some turns but not others — the function still returns numeric totals of 0 or a partial sum, both of which look like real data.

Since ConversationHistory.promptTokens/completionTokens/totalTokens are already typed as optional, the fix is to leave them undefined when no usage was ever received:

let promptTokens = 0;
let completionTokens = 0;
let hasUsage = false;

// inside the loop:
if (llmResponse.usage) {
    hasUsage = true;
    promptTokens += llmResponse.usage.promptTokens;
    completionTokens += llmResponse.usage.completionTokens;
}

// in the return value:
promptTokens: hasUsage ? promptTokens : undefined,
completionTokens: hasUsage ? completionTokens : undefined,
totalTokens: hasUsage ? promptTokens + completionTokens : undefined,

Downstream ?? 0 fallbacks in results_writer.ts already handle undefined, so behaviour is unchanged for providers that do report usage. The difference is that a provider that never reports usage produces undefined in ConversationHistory rather than a 0 that reads as an accurate measurement.

m1 — Mutation-after-push inconsistency [minor]

The JSON-parse error path sets resultBytes before turn.toolResults.push(errorResult) (clean). The normal success/failure path pushes result first, then sets result.resultBytes (works via JS reference semantics, but inconsistent and surprising). Move the assignment above the push to match the error path.

m2 — PR description claims test cases that aren't in the diff [minor]

The description says "Large-dataset retrieval cases plus call/storage cases for Google Search, Instagram, TikTok, YouTube, and Reddit scrapers." The actual test_cases.json diff is a single reference-string edit on one existing test — no new cases were added. Please update the description.

n1 — Unit test gaps [nit]

Missing coverage: error-path resultBytes (JSON parse failure branch), multi-turn token accumulation, and the undefined behaviour when a provider omits usage for all turns.

Generated by Claude Code

RobertCrupa added 3 commits June 26, 2026 11:55

fix(test): test giving exact actor name does not require actor search…

737edee

… tool

feat(tests): added resulting bytes as a metric in evals

f953893

feat: add byte/token to eval results.

8604ef8

RobertCrupa mentioned this pull request Jun 26, 2026

feat: add large-dataset and scraper workflow eval cases #1033

Open

RobertCrupa force-pushed the feat/eval-cost-metrics branch from 85a8a26 to 8604ef8 Compare June 26, 2026 11:13

RobertCrupa requested review from MQ37 and jirispilka June 26, 2026 11:25

RobertCrupa commented Jun 26, 2026

View reviewed changes

jirispilka requested changes Jun 26, 2026

View reviewed changes

jirispilka self-requested a review June 26, 2026 23:24

jirispilka approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add byte/token cost metrics to workflow evals#1031

feat: add byte/token cost metrics to workflow evals#1031
RobertCrupa wants to merge 3 commits into
masterfrom
feat/eval-cost-metrics

RobertCrupa commented Jun 26, 2026

Uh oh!

RobertCrupa Jun 26, 2026

Uh oh!

RobertCrupa commented Jun 26, 2026

Uh oh!

jirispilka left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

RobertCrupa commented Jun 26, 2026

What

Notes

Verification

Uh oh!

RobertCrupa Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

RobertCrupa commented Jun 26, 2026

Uh oh!

jirispilka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

B1 — Type lie in TestResultRecord [blocker]

M1 — Baseline loads on every run, even without --baseline [major]

M2 — Token totals should be undefined, not 0, when the provider omits usage [major]

m1 — Mutation-after-push inconsistency [minor]

m2 — PR description claims test cases that aren't in the diff [minor]

n1 — Unit test gaps [nit]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jirispilka left a comment •

edited

Loading

B1 — Type lie in `TestResultRecord` [blocker]

M1 — Baseline loads on every run, even without `--baseline` [major]

M2 — Token totals should be `undefined`, not `0`, when the provider omits usage [major]