Skip to content

Commit 6b06163

Browse files
aprilk-msCopilot
andauthored
[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them (#45574)
* Fix AOAI evaluation to preserve list values instead of stringifying them The _convert_value helper in _get_data_source was converting list values to strings via str(), turning [] into '[]'. The AOAI API then rejected these with 'is not of type array' errors. Move list from the stringify branch to the pass-through branch alongside dict, since both are structured JSON types that should be preserved as native objects for proper serialization. Update existing test assertions and add a new test for list/dict value preservation including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Infer array/object schema types for list/dict columns in flat mode The flat schema generator in _generate_data_source_config now samples the first row to emit the correct JSON Schema type (array, object, or string) instead of defaulting everything to string. This ensures the schema aligns with the data produced by _convert_value. Add test for schema type inference and an integration test verifying schema-data alignment for list/dict columns including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix pass_threshold propagation and zero-threshold logging - Use 'is not None' instead of truthiness check in _build_internal_log_attributes so threshold=0 is not silently dropped. - Propagate _pass_threshold from evaluator_config into testing_criteria_metadata in _extract_testing_criteria_metadata. - Inject pass_threshold into metric results in _process_criteria_metrics when the evaluator (e.g. PythonGrader) does not emit one, without overwriting evaluator-provided thresholds. - Add 12 unit tests covering all three changes including zero-value edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip None/NaN rows when inferring schema types The flat schema generator now scans past None and NaN values to find the first non-null sample for type inference, instead of only checking iloc[0]. This avoids schema-data mismatches when the first row has missing values but later rows contain lists or dicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review comments - Use _is_none_or_nan for threshold injection check so NaN thresholds are also replaced by pass_threshold from config. - Use pd.isna with guard for list/dict when skipping null sentinels (handles pd.NA, NaT, etc. in addition to None and float NaN). - Infer leaf types in nested schema via leaf_type_map parameter on _build_schema_tree_from_paths so nested paths with list/dict data get array/object schema types instead of always defaulting to string. - Add tests for leaf_type_map, nested schema type inference, pd.NA handling, and NaN threshold injection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply black formatting to pass CI checks Use line-length=120 from eng/black-pyproject.toml config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 233f129 commit 6b06163

4 files changed

Lines changed: 416 additions & 13 deletions

File tree

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1103,7 +1103,7 @@ def _build_internal_log_attributes(
11031103
# Create a copy of the base log attributes
11041104
internal_log_attributes: Dict[str, str] = log_attributes.copy()
11051105
# Add threshold if present
1106-
if event_data.get("threshold"):
1106+
if event_data.get("threshold") is not None:
11071107
internal_log_attributes["gen_ai.evaluation.threshold"] = str(event_data["threshold"])
11081108

11091109
# Add testing criteria details if present
@@ -2030,6 +2030,11 @@ def _extract_testing_criteria_metadata(
20302030
"metrics": metrics,
20312031
"is_inverse": is_inverse,
20322032
}
2033+
# Propagate pass_threshold from evaluator config so result events can include it
2034+
if evaluator_config and criteria_name in evaluator_config:
2035+
pass_threshold = evaluator_config[criteria_name].get("_pass_threshold")
2036+
if pass_threshold is not None:
2037+
testing_criteria_metadata[criteria_name]["pass_threshold"] = pass_threshold
20332038

20342039
return testing_criteria_metadata
20352040

@@ -2503,6 +2508,14 @@ def _process_criteria_metrics(
25032508
# Extract metric values
25042509
result_per_metric = _extract_metric_values(criteria_name, criteria_type, metrics, expected_metrics, logger)
25052510

2511+
# Inject threshold from evaluator config when not present in raw results
2512+
# (e.g., PythonGrader/code evaluators don't emit a threshold column)
2513+
config_threshold = testing_criteria_metadata.get(criteria_name, {}).get("pass_threshold")
2514+
if config_threshold is not None:
2515+
for metric_values in result_per_metric.values():
2516+
if _is_none_or_nan(metric_values.get("threshold")):
2517+
metric_values["threshold"] = config_threshold
2518+
25062519
# Convert to result objects
25072520
results = []
25082521
top_sample = {}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py

Lines changed: 50 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -590,6 +590,7 @@ def _get_graders_and_column_mappings(
590590
def _build_schema_tree_from_paths(
591591
paths: List[str],
592592
force_leaf_type: str = "string",
593+
leaf_type_map: Optional[Dict[str, str]] = None,
593594
) -> Dict[str, Any]:
594595
"""
595596
Build a nested JSON schema (object) from a list of dot-delimited paths.
@@ -629,33 +630,40 @@ def _build_schema_tree_from_paths(
629630
:param force_leaf_type: The JSON Schema ``type`` value to assign to every leaf node
630631
produced from the supplied paths. Defaults to ``"string"``.
631632
:type force_leaf_type: str
633+
:param leaf_type_map: Optional mapping from leaf path to JSON Schema type. When
634+
provided, overrides ``force_leaf_type`` for any path present in this map.
635+
:type leaf_type_map: Optional[Dict[str, str]]
632636
:return: A JSON Schema fragment describing the hierarchical structure implied by
633637
the input paths. The returned schema root always has ``type: object`` with
634638
recursively nested ``properties`` / ``required`` keys.
635639
:rtype: Dict[str, Any]
636640
"""
637-
# Build tree where each node: {"__children__": { segment: node, ... }, "__leaf__": bool }
638-
root: Dict[str, Any] = {"__children__": {}, "__leaf__": False}
641+
# Build tree where each node: {"__children__": { segment: node, ... }, "__leaf__": bool, "__path__": str }
642+
root: Dict[str, Any] = {"__children__": {}, "__leaf__": False, "__path__": ""}
639643

640644
def insert(path: str):
641645
parts = [p for p in path.split(".") if p]
642646
node = root
643647
for i, part in enumerate(parts):
644648
children = node["__children__"]
645649
if part not in children:
646-
children[part] = {"__children__": {}, "__leaf__": False}
650+
children[part] = {"__children__": {}, "__leaf__": False, "__path__": ""}
647651
node = children[part]
648652
if i == len(parts) - 1:
649653
node["__leaf__"] = True
654+
node["__path__"] = path
650655

651656
for p in paths:
652657
insert(p)
653658

659+
_leaf_types = leaf_type_map or {}
660+
654661
def to_schema(node: Dict[str, Any]) -> Dict[str, Any]:
655662
children = node["__children__"]
656663
if not children:
657-
# Leaf node
658-
return {"type": force_leaf_type}
664+
# Leaf node — use per-leaf type if available, else force_leaf_type
665+
leaf_type = _leaf_types.get(node["__path__"], force_leaf_type)
666+
return {"type": leaf_type}
659667
props = {}
660668
required = []
661669
for name, child in children.items():
@@ -715,8 +723,24 @@ def _generate_data_source_config(input_data_df: pd.DataFrame, column_mapping: Di
715723
props = data_source_config["item_schema"]["properties"]
716724
req = data_source_config["item_schema"]["required"]
717725
for key in column_mapping.keys():
718-
if key in input_data_df and len(input_data_df[key]) > 0 and isinstance(input_data_df[key].iloc[0], list):
726+
sample = None
727+
if key in input_data_df:
728+
for candidate in input_data_df[key]:
729+
# Skip null-like scalar values (None, NaN, pd.NA, NaT, etc.)
730+
if isinstance(candidate, (list, dict)):
731+
sample = candidate
732+
break
733+
try:
734+
if candidate is not None and not pd.isna(candidate):
735+
sample = candidate
736+
break
737+
except (TypeError, ValueError):
738+
sample = candidate
739+
break
740+
if isinstance(sample, list):
719741
props[key] = {"type": "array"}
742+
elif isinstance(sample, dict):
743+
props[key] = {"type": "object"}
720744
else:
721745
props[key] = {"type": "string"}
722746
req.append(key)
@@ -754,7 +778,24 @@ def _generate_data_source_config(input_data_df: pd.DataFrame, column_mapping: Di
754778
LOGGER.info(f"AOAI: Effective paths after stripping wrapper: {effective_paths}")
755779

756780
LOGGER.info(f"AOAI: Building nested schema from {len(effective_paths)} effective paths...")
757-
nested_schema = _build_schema_tree_from_paths(effective_paths, force_leaf_type="string")
781+
782+
# Infer leaf types from the DataFrame so nested schemas also get array/object types
783+
leaf_type_map: Dict[str, str] = {}
784+
for ref_path, eff_path in zip(referenced_paths, effective_paths if strip_wrapper else referenced_paths):
785+
if ref_path in input_data_df:
786+
for candidate in input_data_df[ref_path]:
787+
if isinstance(candidate, (list, dict)):
788+
leaf_type_map[eff_path] = "array" if isinstance(candidate, list) else "object"
789+
break
790+
try:
791+
if candidate is not None and not pd.isna(candidate):
792+
break
793+
except (TypeError, ValueError):
794+
break
795+
796+
nested_schema = _build_schema_tree_from_paths(
797+
effective_paths, force_leaf_type="string", leaf_type_map=leaf_type_map
798+
)
758799

759800
LOGGER.info(f"AOAI: Nested schema generated successfully with type '{nested_schema.get('type')}'")
760801
return {
@@ -816,9 +857,9 @@ def _convert_value(val: Any) -> Any:
816857
if isinstance(val, bool):
817858
return val
818859
# Align numerics with legacy text-only JSONL payloads by turning them into strings.
819-
if isinstance(val, (int, float, list)):
860+
if isinstance(val, (int, float)):
820861
return str(val)
821-
if isinstance(val, (dict)):
862+
if isinstance(val, (list, dict)):
822863
return val
823864
return str(val)
824865

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py

Lines changed: 178 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,26 @@ def test_mixed_depth_paths(self):
171171
assert nested["type"] == "object"
172172
assert "field" in nested["properties"]
173173

174+
def test_leaf_type_map_overrides_force_leaf_type(self):
175+
"""Test that leaf_type_map overrides force_leaf_type for specific paths."""
176+
paths = ["query", "tags", "metadata"]
177+
leaf_type_map = {"tags": "array", "metadata": "object"}
178+
schema = _build_schema_tree_from_paths(paths, force_leaf_type="string", leaf_type_map=leaf_type_map)
179+
180+
assert schema["properties"]["query"]["type"] == "string"
181+
assert schema["properties"]["tags"]["type"] == "array"
182+
assert schema["properties"]["metadata"]["type"] == "object"
183+
184+
def test_leaf_type_map_nested_paths(self):
185+
"""Test leaf_type_map with nested paths."""
186+
paths = ["context.tags", "context.query"]
187+
leaf_type_map = {"context.tags": "array"}
188+
schema = _build_schema_tree_from_paths(paths, force_leaf_type="string", leaf_type_map=leaf_type_map)
189+
190+
context = schema["properties"]["context"]
191+
assert context["properties"]["tags"]["type"] == "array"
192+
assert context["properties"]["query"]["type"] == "string"
193+
174194

175195
@pytest.mark.unittest
176196
class TestGenerateDataSourceConfig:
@@ -297,6 +317,102 @@ def test_single_nested_path(self, flat_test_data):
297317
# After wrapper stripping, should see context
298318
assert "context" in schema["properties"]
299319

320+
def test_flat_schema_infers_list_and_dict_types(self, flat_test_data):
321+
"""Test that flat schema correctly infers array/object types from data."""
322+
flat_test_data["tags"] = [["tag1", "tag2"], ["tag3"], []]
323+
flat_test_data["metadata"] = [{"key": "val"}, {"key2": "val2"}, {}]
324+
flat_test_data["score"] = [95, 87, 92]
325+
326+
column_mapping = {
327+
"query": "${data.query}",
328+
"tags": "${data.tags}",
329+
"metadata": "${data.metadata}",
330+
"score": "${data.score}",
331+
}
332+
333+
config = _generate_data_source_config(flat_test_data, column_mapping)
334+
335+
properties = config["item_schema"]["properties"]
336+
# Strings should be typed as string
337+
assert properties["query"]["type"] == "string"
338+
# Lists should be typed as array
339+
assert properties["tags"]["type"] == "array"
340+
# Dicts should be typed as object
341+
assert properties["metadata"]["type"] == "object"
342+
# Numerics (converted to str by _convert_value) should be typed as string
343+
assert properties["score"]["type"] == "string"
344+
345+
def test_flat_schema_skips_none_nan_for_type_inference(self):
346+
"""Test that schema inference skips None/NaN rows to find the real type."""
347+
import numpy as np
348+
349+
df = pd.DataFrame(
350+
{
351+
"tags": [None, ["tag1", "tag2"], ["tag3"]],
352+
"metadata": [np.nan, {"key": "val"}, {}],
353+
"query": [None, None, "hello"],
354+
}
355+
)
356+
column_mapping = {
357+
"tags": "${data.tags}",
358+
"metadata": "${data.metadata}",
359+
"query": "${data.query}",
360+
}
361+
362+
config = _generate_data_source_config(df, column_mapping)
363+
properties = config["item_schema"]["properties"]
364+
365+
# Should look past None in row 0 and find list in row 1
366+
assert properties["tags"]["type"] == "array"
367+
# Should look past NaN in row 0 and find dict in row 1
368+
assert properties["metadata"]["type"] == "object"
369+
# All None → falls back to string
370+
assert properties["query"]["type"] == "string"
371+
372+
def test_flat_schema_skips_pd_na_for_type_inference(self):
373+
"""Test that schema inference skips pd.NA sentinel values."""
374+
df = pd.DataFrame(
375+
{
376+
"tags": [pd.NA, ["tag1", "tag2"], ["tag3"]],
377+
"query": ["hello", "world", "test"],
378+
}
379+
)
380+
column_mapping = {
381+
"tags": "${data.tags}",
382+
"query": "${data.query}",
383+
}
384+
385+
config = _generate_data_source_config(df, column_mapping)
386+
properties = config["item_schema"]["properties"]
387+
388+
assert properties["tags"]["type"] == "array"
389+
assert properties["query"]["type"] == "string"
390+
391+
def test_nested_schema_infers_list_and_dict_leaf_types(self):
392+
"""Test that nested schema infers array/object types for leaf nodes."""
393+
df = pd.DataFrame(
394+
[
395+
{
396+
"item.query": "hello",
397+
"item.tags": ["tag1", "tag2"],
398+
"item.metadata": {"key": "val"},
399+
}
400+
]
401+
)
402+
column_mapping = {
403+
"query": "${data.item.query}",
404+
"tags": "${data.item.tags}",
405+
"metadata": "${data.item.metadata}",
406+
}
407+
408+
config = _generate_data_source_config(df, column_mapping)
409+
schema = config["item_schema"]
410+
411+
# After wrapper stripping, leaves should have inferred types
412+
assert schema["properties"]["query"]["type"] == "string"
413+
assert schema["properties"]["tags"]["type"] == "array"
414+
assert schema["properties"]["metadata"]["type"] == "object"
415+
300416

301417
@pytest.mark.unittest
302418
class TestGetDataSource:
@@ -437,7 +553,7 @@ def test_data_source_with_item_column_and_nested_values(self, nested_item_keywor
437553
# Ensure we did not accidentally nest another 'item' key inside the wrapper
438554
assert "item" not in item_payload
439555
assert item_payload["sample"]["output_text"] == "someoutput"
440-
assert item_payload["sample"]["output_items"] == "['item1', 'item2']"
556+
assert item_payload["sample"]["output_items"] == ["item1", "item2"]
441557

442558
def test_data_source_with_item_sample_column_and_nested_values(self, nested_item_sample_keyword_data):
443559
"""Ensure rows that already have an 'item' column keep nested dicts intact."""
@@ -464,7 +580,7 @@ def test_data_source_with_item_sample_column_and_nested_values(self, nested_item
464580
# Ensure we did not accidentally nest another 'item' key inside the wrapper
465581
assert "item" not in item_payload
466582
assert item_payload["sample"]["output_text"] == "someoutput"
467-
assert item_payload["sample"]["output_items"] == "['item1', 'item2']"
583+
assert item_payload["sample"]["output_items"] == ["item1", "item2"]
468584

469585
def test_data_source_with_sample_output_metadata(self, flat_sample_output_data):
470586
"""Ensure flat rows that include dotted sample metadata remain accessible."""
@@ -485,7 +601,7 @@ def test_data_source_with_sample_output_metadata(self, flat_sample_output_data):
485601
assert row["test"]["test_string"] == "baking cakes is fun!"
486602
# sample.output_text should follow the row through normalization without being stringified
487603
assert row["sample.output_text"] == "someoutput"
488-
assert row["sample.output_items"] == "['item1', 'item2']"
604+
assert row["sample.output_items"] == ["item1", "item2"]
489605

490606
def test_data_source_with_numeric_values(self, flat_test_data):
491607
"""Test data source generation converts numeric values to strings."""
@@ -504,6 +620,35 @@ def test_data_source_with_numeric_values(self, flat_test_data):
504620
assert isinstance(content[0][WRAPPER_KEY]["score"], str)
505621
assert isinstance(content[0][WRAPPER_KEY]["confidence"], str)
506622

623+
def test_data_source_with_list_and_dict_values(self, flat_test_data):
624+
"""Test data source generation preserves list and dict values as-is."""
625+
flat_test_data["tags"] = [["tag1", "tag2"], ["tag3"], []]
626+
flat_test_data["metadata"] = [{"key": "val"}, {"key2": "val2"}, {}]
627+
628+
column_mapping = {
629+
"query": "${data.query}",
630+
"tags": "${data.tags}",
631+
"metadata": "${data.metadata}",
632+
}
633+
634+
data_source = _get_data_source(flat_test_data, column_mapping)
635+
636+
content = data_source["source"]["content"]
637+
638+
# Lists should be preserved as lists, not stringified
639+
assert content[0][WRAPPER_KEY]["tags"] == ["tag1", "tag2"]
640+
assert isinstance(content[0][WRAPPER_KEY]["tags"], list)
641+
# Empty lists should also be preserved
642+
assert content[2][WRAPPER_KEY]["tags"] == []
643+
assert isinstance(content[2][WRAPPER_KEY]["tags"], list)
644+
645+
# Dicts should be preserved as dicts
646+
assert content[0][WRAPPER_KEY]["metadata"] == {"key": "val"}
647+
assert isinstance(content[0][WRAPPER_KEY]["metadata"], dict)
648+
# Empty dicts should also be preserved
649+
assert content[2][WRAPPER_KEY]["metadata"] == {}
650+
assert isinstance(content[2][WRAPPER_KEY]["metadata"], dict)
651+
507652
def test_empty_dataframe(self):
508653
"""Test data source generation with empty dataframe."""
509654
empty_df = pd.DataFrame()
@@ -600,3 +745,33 @@ def test_nested_schema_and_data_alignment(self, nested_test_data):
600745
assert "query" in item
601746
assert "context" in item
602747
assert "company" in item["context"]
748+
749+
def test_flat_schema_and_data_alignment_with_list_and_dict(self, flat_test_data):
750+
"""Test that schema types and data values agree for list/dict columns."""
751+
flat_test_data["tags"] = [["tag1", "tag2"], ["tag3"], []]
752+
flat_test_data["metadata"] = [{"key": "val"}, {"key2": "val2"}, {}]
753+
754+
column_mapping = {
755+
"query": "${data.query}",
756+
"tags": "${data.tags}",
757+
"metadata": "${data.metadata}",
758+
}
759+
760+
config = _generate_data_source_config(flat_test_data, column_mapping)
761+
data_source = _get_data_source(flat_test_data, column_mapping)
762+
763+
schema_props = config["item_schema"]["properties"]
764+
data_item = data_source["source"]["content"][0][WRAPPER_KEY]
765+
766+
# Schema declares array → data contains a list
767+
assert schema_props["tags"]["type"] == "array"
768+
assert isinstance(data_item["tags"], list)
769+
770+
# Schema declares object → data contains a dict
771+
assert schema_props["metadata"]["type"] == "object"
772+
assert isinstance(data_item["metadata"], dict)
773+
774+
# Empty collections should also align
775+
empty_item = data_source["source"]["content"][2][WRAPPER_KEY]
776+
assert isinstance(empty_item["tags"], list)
777+
assert isinstance(empty_item["metadata"], dict)

0 commit comments

Comments
 (0)