Draft Proposal: Agent Session Result Layer by elronbandel · Pull Request #1 · elronbandel/every_eval_ever

elronbandel · 2026-03-17T16:00:07Z

Draft Proposal: Agent Session Result Layer

This PR proposes extending Every Eval Ever with session-level reporting for agentic evaluations. It is a starting point for discussion, not a finished specification.

Full context and motivation: *What Agent Evaluation Teams Don’t Tell You*

The gap

EEE already captures:

aggregate score outcomes
agentic setup (agentic_eval_config, eval_limits, sandbox)
instance-level traces (interaction_type, messages, tool_calls)

What is missing is standardized session-level semantics: how the run ended, which side failed, how much interaction occurred, and what system was actually evaluated. These are part of the evaluation result, not just diagnostics.

Proposed extensions (all optional under `evaluation_results[]`)

session_result

status: success | unsuccessful | unfinished | error | cancelled | limit_reached
is_finished: boolean
finish_accepted: boolean
stop_reason: agent_done | timeout | max_steps | error | cancelled | benchmark_policy
error_attribution: agent | benchmark | external | unknown
error_detail: string

session_accounting

step_count, action_count, invalid_action_count, parallel_action_max: integer
time_to_first_action, wall_clock_seconds, agent_cost, benchmark_cost: number

agent_system / benchmark_system

White-box description of the evaluated agent (models, tools, subagents, memory) and benchmark-side runtime/grader/protocol

eval_conditions

internet_access, memory_exposure, reset_policy, permissions, repeated_runs, seed

robustness (optional / emerging)

method, num_variants, variance_metric, variance_value

Backward compatibility

No existing fields are removed or changed
All new fields are optional
Existing records remain valid
Model-only evaluations are unchanged

Suggested rollout

Add fields as optional in schema
Encourage early adoption of session_result and session_accounting
Keep composition/conditions/robustness optional while conventions converge

Discussion welcome

Feedback is especially welcome on:

field naming
schema placement/hierarchy
priority order for adoption

This proposal grew out of building the Open General Agent Leaderboard and surveying eight evaluation systems. More detail is in the linked post above.

Start discussion on agent session result layer

6872be4

elronbandel closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft Proposal: Agent Session Result Layer#1

Draft Proposal: Agent Session Result Layer#1
elronbandel wants to merge 1 commit into
mainfrom
feature/session-result-layer

elronbandel commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elronbandel commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!