Skip to content

feat(aggregators): add general aggregator for evaluation reports#230

Open
venkatkrish543re wants to merge 2 commits into
strands-agents:mainfrom
venkatkrish543re:feature/aggregators
Open

feat(aggregators): add general aggregator for evaluation reports#230
venkatkrish543re wants to merge 2 commits into
strands-agents:mainfrom
venkatkrish543re:feature/aggregators

Conversation

@venkatkrish543re
Copy link
Copy Markdown

@venkatkrish543re venkatkrish543re commented May 15, 2026

Description

Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), computes descriptive stats. The three overridable hooks(_group_key, _filter_entry, _build_result) for project specific subclasses

LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.

Related Issues

#114

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agent-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), filters corrupted trials, computes descriptive stats and optional efficiency rollup, and runs paired statistics (Wilcoxon, paired-t, McNemar) when exactly two conditions are detected with trial_idx pairing.

LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.
Comment thread src/strands_evals/aggregators/base.py Outdated
"condition_label": str, # condition this trial belongs to;
# required for paired comparisons
"trial_idx": int, # pairing key across conditions
"corrupted": bool, # excluded from stats when True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is supposed to be only about general case aggregator. This "corrupt" concept and the trajectory_pointers don't belong to this PR.

Comment thread src/strands_evals/aggregators/base.py Outdated
Comment thread src/strands_evals/aggregators/base.py Outdated
@venkatkrish543re venkatkrish543re marked this pull request as ready for review May 15, 2026 23:19
Comment thread src/strands_evals/aggregators/base.py Outdated
Comment thread src/strands_evals/aggregators/base.py Outdated
Comment thread src/strands_evals/aggregators/base.py Outdated
Comment thread src/strands_evals/aggregators/base.py Outdated
Comment thread src/strands_evals/aggregators/base.py Outdated
@venkatkrish543re venkatkrish543re marked this pull request as draft May 19, 2026 14:32
… rollup, raw_values, trajectory_pointers, and corruption filtering from the base. Exposed _group_key / _filter_entry / _build_result as overridable hooks so project-specific behavior can layer on top without reimplementing core grouping.
@venkatkrish543re venkatkrish543re marked this pull request as ready for review May 19, 2026 19:03

import logging
from collections import defaultdict
from typing import Any, Optional, cast
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

are dropped.
"""
if not entry["case_key"]:
logger.debug("Skipping entry with no case_key or name")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The logging call lacks structured fields. Per STYLE_GUIDE.md, logs should have context fields first, followed by a human-readable message after a pipe:

logger.debug("field1=<%s>, field2=<%s> | human readable message", field1, field2)

Suggestion:

logger.debug("case_key=<%s>, name=<%s> | skipping entry with no case_key or name", entry.get("case_key"), entry.get("name", ""))

self,
name: str = "Aggregator",
model: Optional[Any] = None,
system_prompt: Optional[str] = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The model parameter uses Optional[Any] which is too permissive. The established pattern in this repo (see evaluators/evaluator.py) uses Model | str | None with from strands.models.model import Model.

Suggestion:

from strands.models.model import Model

def __init__(
    self,
    name: str = "Aggregator",
    model: Model | str | None = None,
    system_prompt: str | None = None,
):

This gives users IDE autocompletion and catches type errors early.

"metadata": metadata,
}

def _group_key(self, entry: dict) -> Optional[tuple]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: _group_key returns Optional[tuple] with an unparameterized tuple. Since subclasses are expected to override this and the return value is used as a dictionary key, specifying the type more precisely would improve type safety. Also, Optional is imported from typing but other modules in this repo use typing_extensions or the X | None syntax.

Suggestion:

def _group_key(self, entry: dict) -> tuple[str, ...] | None:

)
result = agent(prompt, structured_output_model=_Summary)
return cast(_Summary, result.structured_output).summary
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: _invoke_summary_model catches all exceptions with a broad except Exception and silently degrades to returning an empty string. While graceful degradation is appropriate, swallowing all exceptions (including KeyboardInterrupt subclass scenarios, credential errors, or configuration bugs) can hide real problems from users.

Suggestion: Consider catching more specific exceptions (e.g., the known Strands SDK exceptions) or at minimum logging at error level rather than warning when the failure is unexpected. Also, the log message should use structured fields:

logger.warning("model=<%s>, error=<%s> | llm summarization failed", self.model, e)


render_display(self)

def to_file(self, path: str) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The to_file and from_file methods accept str but not Path. The existing EvaluationReport.to_file has the same signature, but accepting str | Path would be more Pythonic and consistent with common library patterns.

Suggestion:

def to_file(self, path: str | Path) -> None:
@classmethod
def from_file(cls, path: str | Path) -> "AggregationReport":

@github-actions
Copy link
Copy Markdown

Issue: The PR description links to issue #114 (Chaos/Resiliency Evaluation) which appears unrelated to aggregation functionality. This makes it harder to trace the motivation and requirements for this feature.

Suggestion: Please link to the correct issue that tracks the aggregator feature, or create one if it doesn't exist. This helps reviewers understand the acceptance criteria and future readers understand the context.


return grouped

def _make_entry(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The _group_results method has a return type annotation of dict[tuple, list[dict]], and _build_result takes entries: list[dict]. All internal data flows through untyped dict objects. This makes it harder for subclass authors to understand what keys are available in an entry.

Suggestion: Consider defining a TypedDict for the entry shape (or at minimum, document the expected keys in a docstring). This is particularly important since _make_entry, _group_key, _filter_entry, and _build_result are all designed as extension points:

class _Entry(TypedDict):
    case_key: str
    evaluator_name: str
    score: float
    passed: bool
    reason: str
    metadata: dict

@github-actions
Copy link
Copy Markdown

Issue: This PR introduces a new public API (Aggregator, AggregationReport, AggregationResult) that customers will use directly. Per the API Bar Raising guidelines, the PR description should include:

  • Expected use cases for the feature
  • Example code snippets demonstrating usage
  • Complete API signatures with default parameter values
  • Module exports

The PR also lacks the needs-api-review label, which is required for new public classes/abstractions.

Suggestion:

  1. Add the needs-api-review label
  2. Update the PR description with a usage example:
from strands_evals.aggregators import Aggregator, AggregationReport

agg = Aggregator(name="my-experiment")
report: AggregationReport = agg.aggregate(evaluation_reports)
report.run_display()
report.to_file("results.json")
  1. Document the public API surface (what's exported, what's customizable)

@github-actions
Copy link
Copy Markdown

Review Summary

Assessment: Request Changes

The aggregator concept is well-designed with clean extension points (_group_key, _filter_entry, _build_result). The overall architecture follows the existing project patterns for report types and display. However, there are several issues that need to be addressed before merging.

Review Categories
  • Project structure: Tests are in the wrong directory (tests/aggregators/ vs tests/strands_evals/aggregators/), and the module isn't exported from the top-level __init__.py
  • API bar raising: This introduces a new public API surface but lacks the needs-api-review label and the PR description doesn't include usage examples or API signatures per the bar-raising process
  • Type safety: The model parameter uses Optional[Any] instead of the established Model | str | None pattern, and internal data flows through untyped dict objects making the extension points harder to use correctly
  • Style compliance: Logging calls don't follow the structured field format defined in STYLE_GUIDE.md
  • Linked issue: References [FEATURE] Chaos/Resiliency Evaluation of Agents #114 (chaos testing) which appears unrelated to aggregation

The extensibility design with the three hooks is thoughtful and the test coverage is solid for the current scope.

"""General-purpose aggregator over evaluation reports.

The default ``Aggregator`` groups per-trial results by
``(case_key, evaluator_name)`` and computes descriptive statistics
Copy link
Copy Markdown
Collaborator

@ybdarrenwang ybdarrenwang May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think grouping by evaluator_name makes sense because, if you think of the 2 dimensions of case vs. evaluator in an experiment, since the existing experiment report basically groups by case (summarizing all evaluation results w.r.t. 1 case), it makes sense that this new aggregator summarizes from the other dimension which is evaluator_name. And that's why in _group_results there's a flatten (breaking down reports by case X evaluator) → regrouping (by evaluator) happening.

On the other hand though, despite I do see why you introduced metadata["case_key"] which tries to provide an additional grouping id besides case name, how's the user experience going to be? Your current implementation:

  • prevents grouping different case name if case_key is not assigned (because case_key = case name by default);
  • when use wants to mix match aggregate different subsets of cases, they have to assign new case_key every time.

Can users not simply manage the grouping by case name? That is:

cases = [
    Case(name="foo", ...),
    Case(name="bar", ...),
    Case(name="baz", ...)
]

reports = Experiment(cases=cases, ...).run_evaluations(...)

aggregated_report_1 = Aggregator(reports, cases=["foo", "bar"])
aggregated_report_2 = Aggregator(reports, cases=["bar", "baz"])

# Per-group aggregation
# ------------------------------------------------------------------

def _build_result(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I want to verify: when a subclass overrides _build_result, does it receive the raw list of trial entries as its second argument, or just the pre-computed summary stats (mean, min, max, pass_rate)?

If it gets the raw entries, we're fine. Skill Eval needs the raw per-pair scores to compute a bootstrap 95% CI on the win-rate, and Chaos Testing likely needs the raw failure events to count specific failure modes. Neither of those can be reconstructed from the summary stats.

If _build_result only sees the summary stats, every project subclass would have to re-read the original EvaluationReport list themselves, which defeats the whole purpose of grouping. In that case I'd ask for the second argument to be the raw entries list (with the summary stats available as a third argument or as a method on self).

A code snippet showing what we'd want to write in our Skill Eval subclass:

def _build_result(self, group_key, entries):
    result = super()._build_result(group_key, entries)
    scores = [e["score"] for e in entries]   # raw per-pair scores
    result.bootstrap_ci_low, result.bootstrap_ci_high = bootstrap_winrate(scores)
    return result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants