feat(aggregators): add general aggregator for evaluation reports#230
feat(aggregators): add general aggregator for evaluation reports#230venkatkrish543re wants to merge 2 commits into
Conversation
Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), filters corrupted trials, computes descriptive stats and optional efficiency rollup, and runs paired statistics (Wilcoxon, paired-t, McNemar) when exactly two conditions are detected with trial_idx pairing. LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.
| "condition_label": str, # condition this trial belongs to; | ||
| # required for paired comparisons | ||
| "trial_idx": int, # pairing key across conditions | ||
| "corrupted": bool, # excluded from stats when True |
There was a problem hiding this comment.
This PR is supposed to be only about general case aggregator. This "corrupt" concept and the trajectory_pointers don't belong to this PR.
… rollup, raw_values, trajectory_pointers, and corruption filtering from the base. Exposed _group_key / _filter_entry / _build_result as overridable hooks so project-specific behavior can layer on top without reimplementing core grouping.
|
|
||
| import logging | ||
| from collections import defaultdict | ||
| from typing import Any, Optional, cast |
| are dropped. | ||
| """ | ||
| if not entry["case_key"]: | ||
| logger.debug("Skipping entry with no case_key or name") |
There was a problem hiding this comment.
Issue: The logging call lacks structured fields. Per STYLE_GUIDE.md, logs should have context fields first, followed by a human-readable message after a pipe:
logger.debug("field1=<%s>, field2=<%s> | human readable message", field1, field2)Suggestion:
logger.debug("case_key=<%s>, name=<%s> | skipping entry with no case_key or name", entry.get("case_key"), entry.get("name", ""))| self, | ||
| name: str = "Aggregator", | ||
| model: Optional[Any] = None, | ||
| system_prompt: Optional[str] = None, |
There was a problem hiding this comment.
Issue: The model parameter uses Optional[Any] which is too permissive. The established pattern in this repo (see evaluators/evaluator.py) uses Model | str | None with from strands.models.model import Model.
Suggestion:
from strands.models.model import Model
def __init__(
self,
name: str = "Aggregator",
model: Model | str | None = None,
system_prompt: str | None = None,
):This gives users IDE autocompletion and catches type errors early.
| "metadata": metadata, | ||
| } | ||
|
|
||
| def _group_key(self, entry: dict) -> Optional[tuple]: |
There was a problem hiding this comment.
Issue: _group_key returns Optional[tuple] with an unparameterized tuple. Since subclasses are expected to override this and the return value is used as a dictionary key, specifying the type more precisely would improve type safety. Also, Optional is imported from typing but other modules in this repo use typing_extensions or the X | None syntax.
Suggestion:
def _group_key(self, entry: dict) -> tuple[str, ...] | None:| ) | ||
| result = agent(prompt, structured_output_model=_Summary) | ||
| return cast(_Summary, result.structured_output).summary | ||
| except Exception as e: |
There was a problem hiding this comment.
Issue: _invoke_summary_model catches all exceptions with a broad except Exception and silently degrades to returning an empty string. While graceful degradation is appropriate, swallowing all exceptions (including KeyboardInterrupt subclass scenarios, credential errors, or configuration bugs) can hide real problems from users.
Suggestion: Consider catching more specific exceptions (e.g., the known Strands SDK exceptions) or at minimum logging at error level rather than warning when the failure is unexpected. Also, the log message should use structured fields:
logger.warning("model=<%s>, error=<%s> | llm summarization failed", self.model, e)|
|
||
| render_display(self) | ||
|
|
||
| def to_file(self, path: str) -> None: |
There was a problem hiding this comment.
Issue: The to_file and from_file methods accept str but not Path. The existing EvaluationReport.to_file has the same signature, but accepting str | Path would be more Pythonic and consistent with common library patterns.
Suggestion:
def to_file(self, path: str | Path) -> None:@classmethod
def from_file(cls, path: str | Path) -> "AggregationReport":|
Issue: The PR description links to issue #114 (Chaos/Resiliency Evaluation) which appears unrelated to aggregation functionality. This makes it harder to trace the motivation and requirements for this feature. Suggestion: Please link to the correct issue that tracks the aggregator feature, or create one if it doesn't exist. This helps reviewers understand the acceptance criteria and future readers understand the context. |
|
|
||
| return grouped | ||
|
|
||
| def _make_entry( |
There was a problem hiding this comment.
Issue: The _group_results method has a return type annotation of dict[tuple, list[dict]], and _build_result takes entries: list[dict]. All internal data flows through untyped dict objects. This makes it harder for subclass authors to understand what keys are available in an entry.
Suggestion: Consider defining a TypedDict for the entry shape (or at minimum, document the expected keys in a docstring). This is particularly important since _make_entry, _group_key, _filter_entry, and _build_result are all designed as extension points:
class _Entry(TypedDict):
case_key: str
evaluator_name: str
score: float
passed: bool
reason: str
metadata: dict|
Issue: This PR introduces a new public API (
The PR also lacks the Suggestion:
from strands_evals.aggregators import Aggregator, AggregationReport
agg = Aggregator(name="my-experiment")
report: AggregationReport = agg.aggregate(evaluation_reports)
report.run_display()
report.to_file("results.json")
|
Review SummaryAssessment: Request Changes The aggregator concept is well-designed with clean extension points ( Review Categories
The extensibility design with the three hooks is thoughtful and the test coverage is solid for the current scope. |
| """General-purpose aggregator over evaluation reports. | ||
|
|
||
| The default ``Aggregator`` groups per-trial results by | ||
| ``(case_key, evaluator_name)`` and computes descriptive statistics |
There was a problem hiding this comment.
I think grouping by evaluator_name makes sense because, if you think of the 2 dimensions of case vs. evaluator in an experiment, since the existing experiment report basically groups by case (summarizing all evaluation results w.r.t. 1 case), it makes sense that this new aggregator summarizes from the other dimension which is evaluator_name. And that's why in _group_results there's a flatten (breaking down reports by case X evaluator) → regrouping (by evaluator) happening.
On the other hand though, despite I do see why you introduced metadata["case_key"] which tries to provide an additional grouping id besides case name, how's the user experience going to be? Your current implementation:
- prevents grouping different case name if
case_keyis not assigned (because case_key = case name by default); - when use wants to mix match aggregate different subsets of cases, they have to assign new
case_keyevery time.
Can users not simply manage the grouping by case name? That is:
cases = [
Case(name="foo", ...),
Case(name="bar", ...),
Case(name="baz", ...)
]
reports = Experiment(cases=cases, ...).run_evaluations(...)
aggregated_report_1 = Aggregator(reports, cases=["foo", "bar"])
aggregated_report_2 = Aggregator(reports, cases=["bar", "baz"])
| # Per-group aggregation | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def _build_result( |
There was a problem hiding this comment.
What I want to verify: when a subclass overrides _build_result, does it receive the raw list of trial entries as its second argument, or just the pre-computed summary stats (mean, min, max, pass_rate)?
If it gets the raw entries, we're fine. Skill Eval needs the raw per-pair scores to compute a bootstrap 95% CI on the win-rate, and Chaos Testing likely needs the raw failure events to count specific failure modes. Neither of those can be reconstructed from the summary stats.
If _build_result only sees the summary stats, every project subclass would have to re-read the original EvaluationReport list themselves, which defeats the whole purpose of grouping. In that case I'd ask for the second argument to be the raw entries list (with the summary stats available as a third argument or as a method on self).
A code snippet showing what we'd want to write in our Skill Eval subclass:
def _build_result(self, group_key, entries):
result = super()._build_result(group_key, entries)
scores = [e["score"] for e in entries] # raw per-pair scores
result.bootstrap_ci_low, result.bootstrap_ci_high = bootstrap_winrate(scores)
return result
Description
Adds a general aggregator over EvaluationReport lists. Groups results by (case_key, evaluator_name), computes descriptive stats. The three overridable hooks(_group_key, _filter_entry, _build_result) for project specific subclasses
LLM summary is optional via a configurable system_prompt. JSON serialization and terminal display included.
Related Issues
#114
Documentation PR
Type of Change
New feature
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agent-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.