Skip to content

feat(eval): validate trace (behavioral) expectations (expect.trace)#16

Open
kunalkushwaha wants to merge 1 commit into
mainfrom
feat/eval-trace-assertions
Open

feat(eval): validate trace (behavioral) expectations (expect.trace)#16
kunalkushwaha wants to merge 1 commit into
mainfrom
feat/eval-trace-assertions

Conversation

@kunalkushwaha

Copy link
Copy Markdown
Member

What

Implements the previously stubbed expect.trace assertions in the eval runner. Today the runner has a literal // TODO: Validate trace expectations and the fully-typed TraceExpectation struct goes unused. This wires it up so eval tests can check how an agent reached its answer — not just the output text.

This is feature B2 (behavioral assertions) from the FEATURES.md roadmap.

How it works

After the content match passes, the runner fetches the run's trace from the EvalServer (GET /traces/{id}) and evaluates the assertions. Tool calls also use the tools_called field from the /invoke response, so tool_calls is still checked even if the trace fetch fails.

tests:
  - name: "Answers about Paris using search, efficiently"
    input: "What's the weather in Paris?"
    expect:
      type: contains
      values: ["Paris"]
      trace:
        tool_calls: ["search"]                  # each listed tool must have been called
        llm_calls: 2                             # exact LLM-call count
        execution_path: ["research", "format"]   # ordered subsequence of span names
        min_steps: 2
        max_steps: 8
Field Check
tool_calls every listed tool must appear (subset)
llm_calls exact match when > 0
execution_path listed names appear in order (gaps allowed)
min_steps / max_steps observed step count within bounds

Changes

  • internal/eval/trace_validator.goValidateTrace (pure) + buildObservedTrace normalizer + minimal decode types.
  • internal/eval/http_target.goFetchTrace(traceID) against GET /traces/{id}.
  • internal/eval/runner.go — wires validation in after the content match (replaces the TODO).
  • docs/EVAL.md — new "Trace (Behavioral) Assertions" section documenting the real expect.trace schema.

Testing

  • go build, go vet, go test ./..., gofmt all green.
  • First unit tests for the eval package: ValidateTrace (11 cases), buildObservedTrace (with/without a trace), isOrderedSubsequence, and an httptest-backed FetchTrace (success + error paths) — so the behavior is verified without a live LLM/EvalServer.

Independent branch off main (alongside #13/#14/#15). The canonical expectation schema lives in internal/eval/types.go.

🤖 Generated with Claude Code

Implements the previously stubbed `expect.trace` assertions in the eval runner,
so tests can check *how* an agent produced an answer, not just the content.

- ValidateTrace: pure validator for tool_calls (subset), llm_calls (exact),
  execution_path (ordered subsequence), and min/max_steps.
- buildObservedTrace: normalizes the EvalServer trace + invoke `tools_called`
  into an ObservedTrace (distinct tools, LLM-call count, path, step count).
- HTTPTarget.FetchTrace: fetches a run's trace from GET /traces/{id}.
- Runner: after the content match, fetches the trace and validates it; tool
  calls fall back to the invoke response when the trace can't be fetched.
- Docs: new "Trace (Behavioral) Assertions" section in docs/EVAL.md.

Adds the eval package's first unit tests (validator, normalizer, and an
httptest-backed FetchTrace), so this is verified without a live LLM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant