Skip to content

Draft Proposal: Agent Session Result Layer#1

Closed
elronbandel wants to merge 1 commit into
mainfrom
feature/session-result-layer
Closed

Draft Proposal: Agent Session Result Layer#1
elronbandel wants to merge 1 commit into
mainfrom
feature/session-result-layer

Conversation

@elronbandel
Copy link
Copy Markdown
Owner

@elronbandel elronbandel commented Mar 17, 2026

Draft Proposal: Agent Session Result Layer

This PR proposes extending Every Eval Ever with session-level reporting for agentic evaluations. It is a starting point for discussion, not a finished specification.

Full context and motivation: *What Agent Evaluation Teams Don’t Tell You*

The gap

EEE already captures:

  • aggregate score outcomes
  • agentic setup (agentic_eval_config, eval_limits, sandbox)
  • instance-level traces (interaction_type, messages, tool_calls)

What is missing is standardized session-level semantics: how the run ended, which side failed, how much interaction occurred, and what system was actually evaluated. These are part of the evaluation result, not just diagnostics.

Proposed extensions (all optional under evaluation_results[])

session_result

  • status: success | unsuccessful | unfinished | error | cancelled | limit_reached
  • is_finished: boolean
  • finish_accepted: boolean
  • stop_reason: agent_done | timeout | max_steps | error | cancelled | benchmark_policy
  • error_attribution: agent | benchmark | external | unknown
  • error_detail: string

session_accounting

  • step_count, action_count, invalid_action_count, parallel_action_max: integer
  • time_to_first_action, wall_clock_seconds, agent_cost, benchmark_cost: number

agent_system / benchmark_system

  • White-box description of the evaluated agent (models, tools, subagents, memory) and benchmark-side runtime/grader/protocol

eval_conditions

  • internet_access, memory_exposure, reset_policy, permissions, repeated_runs, seed

robustness (optional / emerging)

  • method, num_variants, variance_metric, variance_value

Backward compatibility

  • No existing fields are removed or changed
  • All new fields are optional
  • Existing records remain valid
  • Model-only evaluations are unchanged

Suggested rollout

  • Add fields as optional in schema
  • Encourage early adoption of session_result and session_accounting
  • Keep composition/conditions/robustness optional while conventions converge

Discussion welcome

Feedback is especially welcome on:

  • field naming
  • schema placement/hierarchy
  • priority order for adoption

This proposal grew out of building the Open General Agent Leaderboard and surveying eight evaluation systems. More detail is in the linked post above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant