feat(aggregators): add general aggregator for evaluation reports by venkatkrish543re · Pull Request #230 · strands-agents/evals

venkatkrish543re · 2026-05-15T19:51:42Z

Description

Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), computes descriptive stats. The three overridable hooks(_group_key, _filter_entry, _build_result) for project specific subclasses

LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.

Related Issues

#114

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agent-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), filters corrupted trials, computes descriptive stats and optional efficiency rollup, and runs paired statistics (Wilcoxon, paired-t, McNemar) when exactly two conditions are detected with trial_idx pairing. LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.

ybdarrenwang · 2026-05-15T22:37:13Z

+        "condition_label": str,     # condition this trial belongs to;
+                                    # required for paired comparisons
+        "trial_idx": int,           # pairing key across conditions
+        "corrupted": bool,          # excluded from stats when True


This PR is supposed to be only about general case aggregator. This "corrupt" concept and the trajectory_pointers don't belong to this PR.

… rollup, raw_values, trajectory_pointers, and corruption filtering from the base. Exposed _group_key / _filter_entry / _build_result as overridable hooks so project-specific behavior can layer on top without reimplementing core grouping.

poshinchen · 2026-05-19T19:24:04Z

+
+import logging
+from collections import defaultdict
+from typing import Any, Optional, cast


github-actions · 2026-05-19T19:30:09Z

+        are dropped.
+        """
+        if not entry["case_key"]:
+            logger.debug("Skipping entry with no case_key or name")


Issue: The logging call lacks structured fields. Per STYLE_GUIDE.md, logs should have context fields first, followed by a human-readable message after a pipe:

logger.debug("field1=<%s>, field2=<%s> | human readable message", field1, field2)

Suggestion:

logger.debug("case_key=<%s>, name=<%s> | skipping entry with no case_key or name", entry.get("case_key"), entry.get("name", ""))

github-actions · 2026-05-19T19:30:11Z

+        self,
+        name: str = "Aggregator",
+        model: Optional[Any] = None,
+        system_prompt: Optional[str] = None,


Issue: The model parameter uses Optional[Any] which is too permissive. The established pattern in this repo (see evaluators/evaluator.py) uses Model | str | None with from strands.models.model import Model.

Suggestion:

from strands.models.model import Model def __init__( self, name: str = "Aggregator", model: Model | str | None = None, system_prompt: str | None = None, ):

This gives users IDE autocompletion and catches type errors early.

github-actions · 2026-05-19T19:30:13Z

+            "metadata": metadata,
+        }
+
+    def _group_key(self, entry: dict) -> Optional[tuple]:


Issue: _group_key returns Optional[tuple] with an unparameterized tuple. Since subclasses are expected to override this and the return value is used as a dictionary key, specifying the type more precisely would improve type safety. Also, Optional is imported from typing but other modules in this repo use typing_extensions or the X | None syntax.

Suggestion:

def _group_key(self, entry: dict) -> tuple[str, ...] | None:

github-actions · 2026-05-19T19:30:15Z

+            )
+            result = agent(prompt, structured_output_model=_Summary)
+            return cast(_Summary, result.structured_output).summary
+        except Exception as e:


Issue: _invoke_summary_model catches all exceptions with a broad except Exception and silently degrades to returning an empty string. While graceful degradation is appropriate, swallowing all exceptions (including KeyboardInterrupt subclass scenarios, credential errors, or configuration bugs) can hide real problems from users.

Suggestion: Consider catching more specific exceptions (e.g., the known Strands SDK exceptions) or at minimum logging at error level rather than warning when the failure is unexpected. Also, the log message should use structured fields:

logger.warning("model=<%s>, error=<%s> | llm summarization failed", self.model, e)

github-actions · 2026-05-19T19:30:17Z

+
+        render_display(self)
+
+    def to_file(self, path: str) -> None:


Issue: The to_file and from_file methods accept str but not Path. The existing EvaluationReport.to_file has the same signature, but accepting str | Path would be more Pythonic and consistent with common library patterns.

Suggestion:

def to_file(self, path: str | Path) -> None:

@classmethod def from_file(cls, path: str | Path) -> "AggregationReport":

github-actions · 2026-05-19T19:30:19Z

Issue: The PR description links to issue #114 (Chaos/Resiliency Evaluation) which appears unrelated to aggregation functionality. This makes it harder to trace the motivation and requirements for this feature.

Suggestion: Please link to the correct issue that tracks the aggregator feature, or create one if it doesn't exist. This helps reviewers understand the acceptance criteria and future readers understand the context.

github-actions · 2026-05-19T19:30:20Z

+
+        return grouped
+
+    def _make_entry(


Issue: The _group_results method has a return type annotation of dict[tuple, list[dict]], and _build_result takes entries: list[dict]. All internal data flows through untyped dict objects. This makes it harder for subclass authors to understand what keys are available in an entry.

Suggestion: Consider defining a TypedDict for the entry shape (or at minimum, document the expected keys in a docstring). This is particularly important since _make_entry, _group_key, _filter_entry, and _build_result are all designed as extension points:

class _Entry(TypedDict): case_key: str evaluator_name: str score: float passed: bool reason: str metadata: dict

github-actions · 2026-05-19T19:30:22Z

Issue: This PR introduces a new public API (Aggregator, AggregationReport, AggregationResult) that customers will use directly. Per the API Bar Raising guidelines, the PR description should include:

Expected use cases for the feature
Example code snippets demonstrating usage
Complete API signatures with default parameter values
Module exports

The PR also lacks the needs-api-review label, which is required for new public classes/abstractions.

Suggestion:

Add the needs-api-review label
Update the PR description with a usage example:

from strands_evals.aggregators import Aggregator, AggregationReport

agg = Aggregator(name="my-experiment")
report: AggregationReport = agg.aggregate(evaluation_reports)
report.run_display()
report.to_file("results.json")

Document the public API surface (what's exported, what's customizable)

github-actions · 2026-05-19T19:30:22Z

Review Summary

Assessment: Request Changes

The aggregator concept is well-designed with clean extension points (_group_key, _filter_entry, _build_result). The overall architecture follows the existing project patterns for report types and display. However, there are several issues that need to be addressed before merging.

Review Categories

Project structure: Tests are in the wrong directory (tests/aggregators/ vs tests/strands_evals/aggregators/), and the module isn't exported from the top-level __init__.py
API bar raising: This introduces a new public API surface but lacks the needs-api-review label and the PR description doesn't include usage examples or API signatures per the bar-raising process
Type safety: The model parameter uses Optional[Any] instead of the established Model | str | None pattern, and internal data flows through untyped dict objects making the extension points harder to use correctly
Style compliance: Logging calls don't follow the structured field format defined in STYLE_GUIDE.md
Linked issue: References [FEATURE] Chaos/Resiliency Evaluation of Agents #114 (chaos testing) which appears unrelated to aggregation

The extensibility design with the three hooks is thoughtful and the test coverage is solid for the current scope.

ybdarrenwang · 2026-05-19T21:18:10Z

+"""General-purpose aggregator over evaluation reports.
+
+The default ``Aggregator`` groups per-trial results by
+``(case_key, evaluator_name)`` and computes descriptive statistics


I think grouping by evaluator_name makes sense because, if you think of the 2 dimensions of case vs. evaluator in an experiment, since the existing experiment report basically groups by case (summarizing all evaluation results w.r.t. 1 case), it makes sense that this new aggregator summarizes from the other dimension which is evaluator_name. And that's why in _group_results there's a flatten (breaking down reports by case X evaluator) → regrouping (by evaluator) happening.

On the other hand though, despite I do see why you introduced metadata["case_key"] which tries to provide an additional grouping id besides case name, how's the user experience going to be? Your current implementation:

prevents grouping different case name if case_key is not assigned (because case_key = case name by default);

when use wants to mix match aggregate different subsets of cases, they have to assign new case_key every time.

Can users not simply manage the grouping by case name? That is:

cases = [ Case(name="foo", ...), Case(name="bar", ...), Case(name="baz", ...) ] reports = Experiment(cases=cases, ...).run_evaluations(...) aggregated_report_1 = Aggregator(reports, cases=["foo", "bar"]) aggregated_report_2 = Aggregator(reports, cases=["bar", "baz"])

sangminwoo · 2026-05-21T00:09:38Z

+    # Per-group aggregation
+    # ------------------------------------------------------------------
+
+    def _build_result(


What I want to verify: when a subclass overrides _build_result, does it receive the raw list of trial entries as its second argument, or just the pre-computed summary stats (mean, min, max, pass_rate)?

If it gets the raw entries, we're fine. Skill Eval needs the raw per-pair scores to compute a bootstrap 95% CI on the win-rate, and Chaos Testing likely needs the raw failure events to count specific failure modes. Neither of those can be reconstructed from the summary stats.

If _build_result only sees the summary stats, every project subclass would have to re-read the original EvaluationReport list themselves, which defeats the whole purpose of grouping. In that case I'd ask for the second argument to be the raw entries list (with the summary stats available as a third argument or as a method on self).

A code snippet showing what we'd want to write in our Skill Eval subclass:

def _build_result(self, group_key, entries): result = super()._build_result(group_key, entries) scores = [e["score"] for e in entries] # raw per-pair scores result.bootstrap_ci_low, result.bootstrap_ci_high = bootstrap_winrate(scores) return result

venkatkrish543re requested a deployment to manual-approval May 15, 2026 19:52 — with GitHub Actions Waiting

ybdarrenwang reviewed May 15, 2026

View reviewed changes

Comment thread src/strands_evals/aggregators/base.py Outdated

ybdarrenwang reviewed May 15, 2026

View reviewed changes

Comment thread src/strands_evals/aggregators/base.py Outdated

venkatkrish543re marked this pull request as ready for review May 15, 2026 23:19

venkatkrish543re requested a deployment to manual-approval May 15, 2026 23:19 — with GitHub Actions Waiting

sangminwoo reviewed May 15, 2026

View reviewed changes

venkatkrish543re marked this pull request as draft May 19, 2026 14:32

venkatkrish543re requested a deployment to manual-approval May 19, 2026 16:42 — with GitHub Actions Waiting

venkatkrish543re marked this pull request as ready for review May 19, 2026 19:03

venkatkrish543re temporarily deployed to manual-approval May 19, 2026 19:03 — with GitHub Actions Inactive

poshinchen reviewed May 19, 2026

View reviewed changes

github-actions Bot added the strands-running label May 19, 2026

github-actions Bot reviewed May 19, 2026

View reviewed changes

github-actions Bot removed the strands-running label May 19, 2026

ybdarrenwang reviewed May 19, 2026

View reviewed changes

sangminwoo reviewed May 21, 2026

View reviewed changes

Conversation

venkatkrish543re commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

ybdarrenwang May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poshinchen May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Review Summary

Uh oh!

ybdarrenwang May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sangminwoo May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

venkatkrish543re commented May 15, 2026 •

edited

Loading

ybdarrenwang May 19, 2026 •

edited

Loading