feat(evaluation): final-answer output with trace artifacts by christso · Pull Request #1364 · EntityProcess/agentv

christso · 2026-06-12T05:32:04Z

Closes #1362.

Accepted contract

interface EvaluationCaseResult {
  output: string; // final answer / scored result only
  trace: Trace;   // canonical normalized execution record
}

output is the final answer/scored result only, never a transcript or message array.
trace is AgentV's canonical execution record: normalized messages, tool calls/results, errors, timing, usage/cost, provider/session provenance, replay/eval metrics, and transcript export data.
Provider-native session IDs live under trace.metadata.provider_session_id.
Canonical per-test artifacts are outputs/answer.md and outputs/transcript.jsonl.
outputs/response.md / response_path remain deprecated compatibility-only; no canonical outputs/transcript.md is introduced.
Index/wire paths are snake_case: output_path / answer_path → outputs/answer.md, transcript_path → outputs/transcript.jsonl.

What changed

Added/renamed the canonical Trace / Trace* model and conversion helpers, keeping deprecated NormalizedTrace* aliases where compatibility is needed.
Updated evaluation results, grader context, and SDK schemas so scored output is a string final answer while messages/trace carry transcript context.
Updated code-grader stdin to include final output/answer plus transcript-aware messages, trace, and trace_summary.
Updated executable prompt-template stdin to include final output/answer plus transcript-aware messages and full trace.
Updated artifact/index writing to emit outputs/answer.md and trace-derived outputs/transcript.jsonl with response.md only as compatibility alias.
Removed repo-local .ntm tracker artifacts and removed the duplicate pi-sdk-openai target block.
CI stabilization commit 3d9064c4 refreshed stale core/eval/CLI fixtures that still expected output-as-message-array.
Reviewer P1 fix commit 34dc821f prevents definePromptTemplate stdin validation failures by sending context.candidate as output/answer and the transcript under messages/trace.

Red/green evidence

RED: GitHub Actions Test job on run 27396803445 failed with 16 tests that still expected transcript-shaped output or missing trace behavior.
GREEN: Local verification after 3d9064c4 passed the formerly failing core/eval/CLI suites and the full repo test command.
Review RED: Prompt-template executable payload still sent output as a message array, which would fail @agentv/eval PromptTemplateInputSchema validation for definePromptTemplate scripts.
Review GREEN: New regression test runs an executable prompt through the @agentv/eval prompt-template runtime and verifies output/answer are final-answer strings while transcript data is available in messages and trace.messages.
Contract checks covered: code-grader fixtures parse/use final output text; conversation-mode tests assert final answer in output and full transcript in trace.messages; artifact tests assert answer.md/transcript.jsonl canonical outputs.

Verification

bun --filter @agentv/core test test/evaluation/graders/prompt-resolution.test.ts test/evaluation/orchestrator.test.ts — 104 pass, 0 fail
bun --filter @agentv/eval test test/define-prompt-template.test.ts — 10 pass, 0 fail
bun run typecheck — pass
bun run lint — pass
bun run test — core/eval/phoenix/cli/dashboard all pass (@agentv/core: 1848 pass; @agentv/eval: 70 pass; agentv: 568 pass)
bun run build — pass
bun run validate:examples — 58 valid, 0 invalid
git diff --check — pass

Remaining status

No known test failures or blockers remain. Commit 34dc821fbdb0c95cfa1f98f5bf6dccbdae58192d is pushed to feat/1362; the PR is marked ready for review. The GitHub Actions run for this head is green across Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages.

cloudflare-workers-and-pages · 2026-06-12T05:32:41Z

Deploying agentv with Cloudflare Pages

Latest commit:	`34dc821`
Status:	✅ Deploy successful!
Preview URL:	https://d315bbf9.agentv.pages.dev
Branch Preview URL:	https://feat-1362.agentv.pages.dev

View logs

christso added 4 commits June 12, 2026 07:24

feat(trace): add canonical evaluation trace model

06a69e8

feat(evaluation): score final output with full trace

31e25b0

feat(cli): write answer and transcript artifacts

0822982

chore: remove repo-local ntm artifacts

1fdb9a2

christso added 2 commits June 12, 2026 07:33

chore(targets): remove duplicate pi sdk openai target

e685449

fix(evaluation): stabilize final output trace contract

3d9064c

christso marked this pull request as ready for review June 12, 2026 07:54

fix(evaluation): pass final output to prompt templates

34dc821

christso merged commit 4d0defc into main Jun 12, 2026
8 checks passed

christso deleted the feat/1362 branch June 12, 2026 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): final-answer output with trace artifacts#1364

feat(evaluation): final-answer output with trace artifacts#1364
christso merged 7 commits into
mainfrom
feat/1362

christso commented Jun 12, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Accepted contract

What changed

Red/green evidence

Verification

Remaining status

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jun 12, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 12, 2026 •

edited

Loading