Skip to content

feat(evaluation): final-answer output with trace artifacts#1364

Merged
christso merged 7 commits into
mainfrom
feat/1362
Jun 12, 2026
Merged

feat(evaluation): final-answer output with trace artifacts#1364
christso merged 7 commits into
mainfrom
feat/1362

Conversation

@christso

@christso christso commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Closes #1362.

Accepted contract

interface EvaluationCaseResult {
  output: string; // final answer / scored result only
  trace: Trace;   // canonical normalized execution record
}
  • output is the final answer/scored result only, never a transcript or message array.
  • trace is AgentV's canonical execution record: normalized messages, tool calls/results, errors, timing, usage/cost, provider/session provenance, replay/eval metrics, and transcript export data.
  • Provider-native session IDs live under trace.metadata.provider_session_id.
  • Canonical per-test artifacts are outputs/answer.md and outputs/transcript.jsonl.
  • outputs/response.md / response_path remain deprecated compatibility-only; no canonical outputs/transcript.md is introduced.
  • Index/wire paths are snake_case: output_path / answer_pathoutputs/answer.md, transcript_pathoutputs/transcript.jsonl.

What changed

  • Added/renamed the canonical Trace / Trace* model and conversion helpers, keeping deprecated NormalizedTrace* aliases where compatibility is needed.
  • Updated evaluation results, grader context, and SDK schemas so scored output is a string final answer while messages/trace carry transcript context.
  • Updated code-grader stdin to include final output/answer plus transcript-aware messages, trace, and trace_summary.
  • Updated executable prompt-template stdin to include final output/answer plus transcript-aware messages and full trace.
  • Updated artifact/index writing to emit outputs/answer.md and trace-derived outputs/transcript.jsonl with response.md only as compatibility alias.
  • Removed repo-local .ntm tracker artifacts and removed the duplicate pi-sdk-openai target block.
  • CI stabilization commit 3d9064c4 refreshed stale core/eval/CLI fixtures that still expected output-as-message-array.
  • Reviewer P1 fix commit 34dc821f prevents definePromptTemplate stdin validation failures by sending context.candidate as output/answer and the transcript under messages/trace.

Red/green evidence

  • RED: GitHub Actions Test job on run 27396803445 failed with 16 tests that still expected transcript-shaped output or missing trace behavior.
  • GREEN: Local verification after 3d9064c4 passed the formerly failing core/eval/CLI suites and the full repo test command.
  • Review RED: Prompt-template executable payload still sent output as a message array, which would fail @agentv/eval PromptTemplateInputSchema validation for definePromptTemplate scripts.
  • Review GREEN: New regression test runs an executable prompt through the @agentv/eval prompt-template runtime and verifies output/answer are final-answer strings while transcript data is available in messages and trace.messages.
  • Contract checks covered: code-grader fixtures parse/use final output text; conversation-mode tests assert final answer in output and full transcript in trace.messages; artifact tests assert answer.md/transcript.jsonl canonical outputs.

Verification

  • bun --filter @agentv/core test test/evaluation/graders/prompt-resolution.test.ts test/evaluation/orchestrator.test.ts — 104 pass, 0 fail
  • bun --filter @agentv/eval test test/define-prompt-template.test.ts — 10 pass, 0 fail
  • bun run typecheck — pass
  • bun run lint — pass
  • bun run test — core/eval/phoenix/cli/dashboard all pass (@agentv/core: 1848 pass; @agentv/eval: 70 pass; agentv: 568 pass)
  • bun run build — pass
  • bun run validate:examples — 58 valid, 0 invalid
  • git diff --check — pass

Remaining status

No known test failures or blockers remain. Commit 34dc821fbdb0c95cfa1f98f5bf6dccbdae58192d is pushed to feat/1362; the PR is marked ready for review. The GitHub Actions run for this head is green across Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 12, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 34dc821
Status: ✅  Deploy successful!
Preview URL: https://d315bbf9.agentv.pages.dev
Branch Preview URL: https://feat-1362.agentv.pages.dev

View logs

@christso christso marked this pull request as ready for review June 12, 2026 07:54
@christso christso merged commit 4d0defc into main Jun 12, 2026
8 checks passed
@christso christso deleted the feat/1362 branch June 12, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema: final-answer output with trace-derived transcript artifacts

1 participant