Skip to content

Commit 1fa00e1

Browse files
[Evaluation] Fix unhashable list crash in binary aggregation (#46743)
* [Evaluation] Fix unhashable list crash in binary aggregation Wrap value_counts().to_dict() in _aggregation_binary_output with try/except TypeError. Columns matching outputs.*_result whose values are unhashable (e.g. lists) are now skipped with a warning instead of aborting the entire evaluate() call with EvaluationException: (InternalError) unhashable type: 'list'. Adds a unit test covering a mixed DataFrame (valid pass/fail column + list-valued column) and a CHANGELOG entry under 1.16.7 (Unreleased). * [Evaluation] Assert warning is emitted for unhashable result columns --------- Co-authored-by: Manas Kawale <manaskawale@microsoft.com>
1 parent fb275f9 commit 1fa00e1

3 files changed

Lines changed: 46 additions & 2 deletions

File tree

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
### Bugs Fixed
1616

17+
- Fixed `evaluate()` raising `EvaluationException: (InternalError) unhashable type: 'list'` when an evaluator emitted a list value under a `_result`-suffixed column. Binary aggregation now skips such columns with a warning instead of aborting the entire run.
1718
- Fixed row classification double-counting in `_calculate_aoai_evaluation_summary` where errored rows were counted separately and could also be counted as passed/failed. Rows are now classified into mutually exclusive buckets with priority: passed > failed > errored > skipped.
1819
- Fixed row classification where rows with empty or missing results lists were incorrectly counted as "passed" (the condition `passed_count == len(results) - error_count` evaluated `0 == 0` as True).
1920
- Fixed `_get_metric_result` prefix matching where shorter metric names (e.g., `xpia`) could match before longer, more-specific ones (e.g., `xpia_manipulated_content`). Now sorts by length descending for correct longest-prefix matching.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -276,8 +276,19 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
276276
)
277277
continue
278278
if evaluator_name:
279-
# Count the occurrences of each unique value (pass/fail)
280-
value_counts = df[col].value_counts().to_dict()
279+
try:
280+
# Count the occurrences of each unique value (pass/fail)
281+
value_counts = df[col].value_counts().to_dict()
282+
except TypeError as ex:
283+
# Column contains unhashable values (e.g., lists/dicts) and is therefore
284+
# not a binary pass/fail result column. Skip it instead of aborting the
285+
# entire evaluation aggregation.
286+
LOGGER.warning(
287+
"Skipping column '%s' for binary aggregation due to unhashable values: %s",
288+
col,
289+
ex,
290+
)
291+
continue
281292

282293
# Calculate the proportion of EVALUATION_PASS_FAIL_MAPPING[True] results
283294
total_rows = len(df)

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -740,6 +740,38 @@ def test_general_aggregation(self):
740740
assert "bad_thing.boolean_with_nan" not in aggregation
741741
assert "bad_thing.boolean_with_none" not in aggregation
742742

743+
def test_binary_aggregation_skips_unhashable_result_columns(self, caplog):
744+
"""A `_result` column containing list values must not crash binary aggregation."""
745+
data = {
746+
# Valid binary pass/fail column - should be aggregated.
747+
"outputs.good_eval.metric_result": ["pass", "pass", "fail", "pass"],
748+
# Malformed column whose values are lists (unhashable) - should be skipped
749+
# with a warning instead of raising TypeError: unhashable type: 'list'.
750+
"outputs.bad_eval.metric_result": [["a"], ["b"], ["c"], ["d"]],
751+
}
752+
data_df = pd.DataFrame(data)
753+
754+
with caplog.at_level(logging.WARNING, logger="azure.ai.evaluation._evaluate._evaluate"):
755+
aggregation = _aggregate_metrics(data_df, {})
756+
757+
assert "good_eval.binary_aggregate" in aggregation
758+
assert aggregation["good_eval.binary_aggregate"] == 0.75
759+
assert "bad_eval.binary_aggregate" not in aggregation
760+
761+
# The malformed column must be reported via a warning so silent drops are
762+
# caught by this regression test.
763+
unhashable_warnings = [
764+
record
765+
for record in caplog.records
766+
if record.levelno == logging.WARNING
767+
and "outputs.bad_eval.metric_result" in record.getMessage()
768+
and "unhashable" in record.getMessage()
769+
]
770+
assert unhashable_warnings, (
771+
"Expected a warning mentioning 'outputs.bad_eval.metric_result' and 'unhashable', "
772+
f"got: {[r.getMessage() for r in caplog.records]}"
773+
)
774+
743775
def test_aggregate_label_defect_metrics_with_nan_in_details(self):
744776
"""Test that NaN/None values in details column are properly ignored during aggregation."""
745777
data = {

0 commit comments

Comments
 (0)