Skip to content

Commit 063446b

Browse files
aprilk-msCopilot
andauthored
azure-ai-evaluation: forward evaluator properties to App Insights (#46942)
* azure-ai-evaluation: forward evaluator properties to App Insights Add a generic `gen_ai.evaluation.properties` JSON attribute (inside `internal_properties`) that carries arbitrary evaluator-specific keys from each event's `properties` payload to App Insights. Previously `_log_events_to_app_insights` only forwarded four red-team keys (`attack_success`, `attack_technique`, `attack_complexity`, `attack_success_threshold`). Structured evaluator outputs such as the rubric evaluator's `dimension_scores` were silently dropped before emission, breaking downstream Kusto/dashboards that rely on the per- dimension scores being queryable. Behavior changes: * Keys already covered by dedicated red-team attributes are excluded from the generic JSON blob to avoid duplicate emission. * Non-dict `properties` values are skipped defensively without raising. * Payloads larger than 7500 characters are truncated to stay under the App Insights attribute value limit (~8 KiB). Tests added under `TestLogEventsProperties` exercising the rubric, red-team coexistence, absent/empty/non-dict, and truncation paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address Copilot review feedback - Hoist isinstance(properties, dict) guard over the entire properties block so unexpected payload shapes (None, list, str, ...) cannot crash the existing red-team forwarder. - Replace string-slice truncation with a valid JSON marker ({'truncated': true, 'original_size_bytes': N}) so downstream json.loads consumers can always parse gen_ai.evaluation.properties. - Update CHANGELOG to describe new truncation marker behavior. - Add parametrized test exercising None/int/str/list/tuple property shapes; update oversize test to assert the emitted JSON is parseable and contains the truncation marker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Run black on _evaluate.py to satisfy CI lint CI 'Run Black' check failed because the dict-comprehension at line 1290 fit on a single 120-char line. No semantic change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent b6ad563 commit 063446b

3 files changed

Lines changed: 247 additions & 4 deletions

File tree

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- Added `status` field (`"completed"`, `"error"`, `"skipped"`) on evaluation result items to indicate evaluator execution outcome.
1010
- Added `skipped` and `errored` counts to `result_counts` and `per_testing_criteria_results` in AOAI evaluation summaries.
1111
- Added `skipped` to `ResultCount` and `skipped`/`errored` to `PerTestingCriteriaResult` typed contracts.
12+
- App Insights logging now forwards arbitrary evaluator-specific keys from each event's `properties` payload as a single `gen_ai.evaluation.properties` JSON attribute (carried inside `internal_properties`). Previously only the four red-team keys (`attack_success`, `attack_technique`, `attack_complexity`, `attack_success_threshold`) were forwarded; structured outputs such as rubric `dimension_scores` were silently dropped. Payloads larger than 7500 characters are replaced with a valid JSON marker (`{"truncated": true, "original_size_bytes": <n>}`) so consumers can always `json.loads` the value. Non-dict `properties` payloads are now safely ignored instead of raising in the red-team forwarder.
1213

1314
### Breaking Changes
1415

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,24 @@
7373
"groundedness_pro_label": "groundedness_pro_passing_rate",
7474
}
7575

76+
# Property keys that already have dedicated named attributes in
77+
# ``_log_events_to_app_insights``. They are excluded from the generic
78+
# ``gen_ai.evaluation.properties`` JSON forwarder so the data is not emitted
79+
# twice.
80+
_DEDICATED_EVALUATION_PROPERTY_KEYS = frozenset(
81+
(
82+
"attack_success",
83+
"attack_technique",
84+
"attack_complexity",
85+
"attack_success_threshold",
86+
)
87+
)
88+
89+
# Maximum serialized length of the ``gen_ai.evaluation.properties`` attribute
90+
# payload. App Insights caps individual attribute values around 8 KiB; we
91+
# truncate slightly below that to leave headroom for OTel framing.
92+
_MAX_EVALUATION_PROPERTIES_JSON_LEN = 7500
93+
7694

7795
class __EvaluatorInfo(TypedDict):
7896
result: pd.DataFrame
@@ -1246,10 +1264,11 @@ def _log_events_to_app_insights(
12461264
if error:
12471265
standard_log_attributes["error.type"] = error
12481266

1249-
# Handle redteam attack properties if present
1250-
if "properties" in event_data:
1251-
properties = event_data["properties"]
1252-
1267+
# Handle evaluator-specific structured properties (red-team attack metadata,
1268+
# rubric dimension scores, etc.). Guard the whole block with an isinstance
1269+
# check so unexpected payload shapes (None, list, str, ...) cannot raise here.
1270+
properties = event_data.get("properties")
1271+
if isinstance(properties, dict):
12531272
if "attack_success" in properties:
12541273
internal_log_attributes["gen_ai.redteam.attack.success"] = str(properties["attack_success"])
12551274

@@ -1266,6 +1285,41 @@ def _log_events_to_app_insights(
12661285
properties["attack_success_threshold"]
12671286
)
12681287

1288+
# Forward any other evaluator-specific structured properties (e.g. rubric
1289+
# ``dimension_scores``) as a single JSON attribute so consumers can query
1290+
# them in App Insights. Keys with dedicated named attributes above are
1291+
# excluded to avoid duplicate emission.
1292+
extra_properties = {
1293+
k: v for k, v in properties.items() if k not in _DEDICATED_EVALUATION_PROPERTY_KEYS
1294+
}
1295+
if extra_properties:
1296+
try:
1297+
properties_json = json.dumps(extra_properties, default=str)
1298+
except (TypeError, ValueError) as ex:
1299+
LOGGER.warning(
1300+
"Failed to serialize evaluator properties for App Insights: %s",
1301+
ex,
1302+
)
1303+
else:
1304+
if len(properties_json) > _MAX_EVALUATION_PROPERTIES_JSON_LEN:
1305+
# Slicing the JSON string would produce an unterminated, invalid
1306+
# payload that downstream ``json.loads`` consumers cannot parse.
1307+
# Emit a small, valid JSON marker instead so consumers can detect
1308+
# the drop and reason about it.
1309+
LOGGER.warning(
1310+
"Evaluator properties JSON length %d exceeds %d; "
1311+
"dropping payload from App Insights and emitting truncation marker.",
1312+
len(properties_json),
1313+
_MAX_EVALUATION_PROPERTIES_JSON_LEN,
1314+
)
1315+
properties_json = json.dumps(
1316+
{
1317+
"truncated": True,
1318+
"original_size_bytes": len(properties_json),
1319+
}
1320+
)
1321+
internal_log_attributes["gen_ai.evaluation.properties"] = properties_json
1322+
12691323
# Add data source item attributes if present
12701324
if response_id:
12711325
standard_log_attributes["gen_ai.response.id"] = response_id

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2498,6 +2498,194 @@ def test_token_usage_partial_only_prompt(self):
24982498
assert "gen_ai.evaluation.usage.output_tokens" not in internal_props
24992499

25002500

2501+
@pytest.mark.unittest
2502+
@pytest.mark.skipif(MISSING_OPENTELEMETRY, reason="This test requires the opentelemetry package")
2503+
class TestLogEventsProperties:
2504+
"""Tests for evaluator ``properties`` forwarding in _log_events_to_app_insights."""
2505+
2506+
def _make_mock_event_logger(self):
2507+
emitted = []
2508+
2509+
class FakeEventLogger:
2510+
def emit(self, event):
2511+
emitted.append(event)
2512+
2513+
return FakeEventLogger(), emitted
2514+
2515+
def test_rubric_dimension_scores_forwarded_as_json(self):
2516+
"""Rubric ``dimension_scores`` should be forwarded as a JSON attribute."""
2517+
event_logger, emitted = self._make_mock_event_logger()
2518+
dimension_scores = [
2519+
{"id": "resolution_progress", "score": 0.8, "applicable": True, "weight": 10, "reason": "ok"},
2520+
{"id": "general_quality", "score": 0.5, "applicable": True, "weight": 5, "reason": "fine"},
2521+
]
2522+
events = [
2523+
{
2524+
"metric": "custom-rubric",
2525+
"score": 0.7,
2526+
"properties": {"dimension_scores": dimension_scores},
2527+
}
2528+
]
2529+
app_insights_config = {"connection_string": "fake"}
2530+
2531+
_log_events_to_app_insights(
2532+
event_logger=event_logger,
2533+
events=events,
2534+
log_attributes={},
2535+
app_insights_config=app_insights_config,
2536+
)
2537+
2538+
assert len(emitted) == 1
2539+
attrs = emitted[0].attributes
2540+
internal_props = json.loads(attrs["internal_properties"])
2541+
assert "gen_ai.evaluation.properties" in internal_props
2542+
payload = json.loads(internal_props["gen_ai.evaluation.properties"])
2543+
assert payload == {"dimension_scores": dimension_scores}
2544+
2545+
def test_redteam_keys_excluded_from_generic_properties(self):
2546+
"""Dedicated red-team keys should keep their own attributes and NOT appear in the JSON blob."""
2547+
event_logger, emitted = self._make_mock_event_logger()
2548+
events = [
2549+
{
2550+
"metric": "redteam",
2551+
"properties": {
2552+
"attack_success": True,
2553+
"attack_technique": "jailbreak",
2554+
"attack_complexity": "easy",
2555+
"attack_success_threshold": 0.5,
2556+
"extra_metric": 42,
2557+
},
2558+
}
2559+
]
2560+
app_insights_config = {"connection_string": "fake"}
2561+
2562+
_log_events_to_app_insights(
2563+
event_logger=event_logger,
2564+
events=events,
2565+
log_attributes={},
2566+
app_insights_config=app_insights_config,
2567+
)
2568+
2569+
assert len(emitted) == 1
2570+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2571+
# Dedicated red-team attributes stay as before
2572+
assert internal_props["gen_ai.redteam.attack.success"] == "True"
2573+
assert internal_props["gen_ai.redteam.attack.technique"] == "jailbreak"
2574+
assert internal_props["gen_ai.redteam.attack.complexity"] == "easy"
2575+
assert internal_props["gen_ai.redteam.attack.success_threshold"] == "0.5"
2576+
# Generic forwarder carries only the non-dedicated keys
2577+
payload = json.loads(internal_props["gen_ai.evaluation.properties"])
2578+
assert payload == {"extra_metric": 42}
2579+
2580+
def test_no_generic_attribute_when_only_redteam_keys_present(self):
2581+
"""When properties contains only dedicated red-team keys, no generic JSON attribute is emitted."""
2582+
event_logger, emitted = self._make_mock_event_logger()
2583+
events = [
2584+
{
2585+
"metric": "redteam",
2586+
"properties": {
2587+
"attack_success": True,
2588+
"attack_technique": "jailbreak",
2589+
},
2590+
}
2591+
]
2592+
app_insights_config = {"connection_string": "fake"}
2593+
2594+
_log_events_to_app_insights(
2595+
event_logger=event_logger,
2596+
events=events,
2597+
log_attributes={},
2598+
app_insights_config=app_insights_config,
2599+
)
2600+
2601+
assert len(emitted) == 1
2602+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2603+
assert "gen_ai.evaluation.properties" not in internal_props
2604+
2605+
def test_no_generic_attribute_when_properties_absent(self):
2606+
"""No ``gen_ai.evaluation.properties`` attribute when event has no ``properties`` key."""
2607+
event_logger, emitted = self._make_mock_event_logger()
2608+
events = [{"metric": "coherence", "score": 4.5}]
2609+
app_insights_config = {"connection_string": "fake"}
2610+
2611+
_log_events_to_app_insights(
2612+
event_logger=event_logger,
2613+
events=events,
2614+
log_attributes={},
2615+
app_insights_config=app_insights_config,
2616+
)
2617+
2618+
assert len(emitted) == 1
2619+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2620+
assert "gen_ai.evaluation.properties" not in internal_props
2621+
2622+
def test_non_dict_properties_does_not_emit_generic_attribute(self):
2623+
"""Non-dict ``properties`` (e.g. a string) should be skipped without raising."""
2624+
event_logger, emitted = self._make_mock_event_logger()
2625+
events = [{"metric": "coherence", "score": 4.5, "properties": "not a dict"}]
2626+
app_insights_config = {"connection_string": "fake"}
2627+
2628+
_log_events_to_app_insights(
2629+
event_logger=event_logger,
2630+
events=events,
2631+
log_attributes={},
2632+
app_insights_config=app_insights_config,
2633+
)
2634+
2635+
assert len(emitted) == 1
2636+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2637+
assert "gen_ai.evaluation.properties" not in internal_props
2638+
2639+
@pytest.mark.parametrize("bad_properties", [None, 42, "not a dict", ["attack_success"], ("attack_success",)])
2640+
def test_unexpected_properties_shape_does_not_crash(self, bad_properties):
2641+
"""Non-dict ``properties`` shapes must not crash the red-team or generic forwarders."""
2642+
event_logger, emitted = self._make_mock_event_logger()
2643+
events = [{"metric": "redteam", "properties": bad_properties}]
2644+
app_insights_config = {"connection_string": "fake"}
2645+
2646+
_log_events_to_app_insights(
2647+
event_logger=event_logger,
2648+
events=events,
2649+
log_attributes={},
2650+
app_insights_config=app_insights_config,
2651+
)
2652+
2653+
assert len(emitted) == 1
2654+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2655+
# No red-team or generic attributes should be derived from a non-dict payload.
2656+
assert "gen_ai.evaluation.properties" not in internal_props
2657+
assert not any(k.startswith("gen_ai.redteam.") for k in internal_props)
2658+
2659+
def test_oversized_properties_payload_emits_valid_json_truncation_marker(self):
2660+
"""Oversized properties payloads are replaced with a valid JSON truncation marker."""
2661+
event_logger, emitted = self._make_mock_event_logger()
2662+
big_value = "x" * 20000 # well over the 7500 char cap
2663+
events = [
2664+
{
2665+
"metric": "rubric",
2666+
"properties": {"big": big_value},
2667+
}
2668+
]
2669+
app_insights_config = {"connection_string": "fake"}
2670+
2671+
_log_events_to_app_insights(
2672+
event_logger=event_logger,
2673+
events=events,
2674+
log_attributes={},
2675+
app_insights_config=app_insights_config,
2676+
)
2677+
2678+
assert len(emitted) == 1
2679+
internal_props = json.loads(emitted[0].attributes["internal_properties"])
2680+
assert "gen_ai.evaluation.properties" in internal_props
2681+
# The emitted value must remain valid, parseable JSON.
2682+
marker = json.loads(internal_props["gen_ai.evaluation.properties"])
2683+
assert marker["truncated"] is True
2684+
assert marker["original_size_bytes"] > 7500
2685+
# Marker itself must be well under the cap.
2686+
assert len(internal_props["gen_ai.evaluation.properties"]) <= 7500
2687+
2688+
25012689
class TestAdjustForInverseMetric:
25022690
"""Tests for _adjust_for_inverse_metric handling of boolean labels."""
25032691

0 commit comments

Comments
 (0)