azure-ai-evaluation: forward evaluator properties to App Insights (#46942)

aprilk-ms · Copilot · web-flow · commit 063446bc757e · 2026-05-17T23:19:10.000-07:00
* azure-ai-evaluation: forward evaluator properties to App Insights

Add a generic `gen_ai.evaluation.properties` JSON attribute (inside

`internal_properties`) that carries arbitrary evaluator-specific keys

from each event's `properties` payload to App Insights.

Previously `_log_events_to_app_insights` only forwarded four red-team

keys (`attack_success`, `attack_technique`, `attack_complexity`,

`attack_success_threshold`). Structured evaluator outputs such as the

rubric evaluator's `dimension_scores` were silently dropped before

emission, breaking downstream Kusto/dashboards that rely on the per-

dimension scores being queryable.

Behavior changes:

* Keys already covered by dedicated red-team attributes are excluded

  from the generic JSON blob to avoid duplicate emission.

* Non-dict `properties` values are skipped defensively without

  raising.

* Payloads larger than 7500 characters are truncated to stay under

  the App Insights attribute value limit (~8 KiB).

Tests added under `TestLogEventsProperties` exercising the rubric,

red-team coexistence, absent/empty/non-dict, and truncation paths.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Address Copilot review feedback

- Hoist isinstance(properties, dict) guard over the entire properties
  block so unexpected payload shapes (None, list, str, ...) cannot crash
  the existing red-team forwarder.
- Replace string-slice truncation with a valid JSON marker
  ({'truncated': true, 'original_size_bytes': N}) so downstream
  json.loads consumers can always parse gen_ai.evaluation.properties.
- Update CHANGELOG to describe new truncation marker behavior.
- Add parametrized test exercising None/int/str/list/tuple property
  shapes; update oversize test to assert the emitted JSON is parseable
  and contains the truncation marker.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Run black on _evaluate.py to satisfy CI lint

CI 'Run Black' check failed because the dict-comprehension at line 1290 fit on a single 120-char line. No semantic change.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

---------

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
@@ -9,6 +9,7 @@
 - Added `status` field (`"completed"`, `"error"`, `"skipped"`) on evaluation result items to indicate evaluator execution outcome.
 - Added `skipped` and `errored` counts to `result_counts` and `per_testing_criteria_results` in AOAI evaluation summaries.
 - Added `skipped` to `ResultCount` and `skipped`/`errored` to `PerTestingCriteriaResult` typed contracts.
+- App Insights logging now forwards arbitrary evaluator-specific keys from each event's `properties` payload as a single `gen_ai.evaluation.properties` JSON attribute (carried inside `internal_properties`). Previously only the four red-team keys (`attack_success`, `attack_technique`, `attack_complexity`, `attack_success_threshold`) were forwarded; structured outputs such as rubric `dimension_scores` were silently dropped. Payloads larger than 7500 characters are replaced with a valid JSON marker (`{"truncated": true, "original_size_bytes": <n>}`) so consumers can always `json.loads` the value. Non-dict `properties` payloads are now safely ignored instead of raising in the red-team forwarder.
 
 ### Breaking Changes
 
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py
@@ -73,6 +73,24 @@
     "groundedness_pro_label": "groundedness_pro_passing_rate",
 }
 
+# Property keys that already have dedicated named attributes in
+# ``_log_events_to_app_insights``. They are excluded from the generic
+# ``gen_ai.evaluation.properties`` JSON forwarder so the data is not emitted
+# twice.
+_DEDICATED_EVALUATION_PROPERTY_KEYS = frozenset(
+    (
+        "attack_success",
+        "attack_technique",
+        "attack_complexity",
+        "attack_success_threshold",
+    )
+)
+
+# Maximum serialized length of the ``gen_ai.evaluation.properties`` attribute
+# payload. App Insights caps individual attribute values around 8 KiB; we
+# truncate slightly below that to leave headroom for OTel framing.
+_MAX_EVALUATION_PROPERTIES_JSON_LEN = 7500
+
 
 class __EvaluatorInfo(TypedDict):
     result: pd.DataFrame
@@ -1246,10 +1264,11 @@ def _log_events_to_app_insights(
                 if error:
                     standard_log_attributes["error.type"] = error
 
-                # Handle redteam attack properties if present
-                if "properties" in event_data:
-                    properties = event_data["properties"]
-
+                # Handle evaluator-specific structured properties (red-team attack metadata,
+                # rubric dimension scores, etc.). Guard the whole block with an isinstance
+                # check so unexpected payload shapes (None, list, str, ...) cannot raise here.
+                properties = event_data.get("properties")
+                if isinstance(properties, dict):
                     if "attack_success" in properties:
                         internal_log_attributes["gen_ai.redteam.attack.success"] = str(properties["attack_success"])
 
@@ -1266,6 +1285,41 @@ def _log_events_to_app_insights(
                             properties["attack_success_threshold"]
                         )
 
+                    # Forward any other evaluator-specific structured properties (e.g. rubric
+                    # ``dimension_scores``) as a single JSON attribute so consumers can query
+                    # them in App Insights. Keys with dedicated named attributes above are
+                    # excluded to avoid duplicate emission.
+                    extra_properties = {
+                        k: v for k, v in properties.items() if k not in _DEDICATED_EVALUATION_PROPERTY_KEYS
+                    }
+                    if extra_properties:
+                        try:
+                            properties_json = json.dumps(extra_properties, default=str)
+                        except (TypeError, ValueError) as ex:
+                            LOGGER.warning(
+                                "Failed to serialize evaluator properties for App Insights: %s",
+                                ex,
+                            )
+                        else:
+                            if len(properties_json) > _MAX_EVALUATION_PROPERTIES_JSON_LEN:
+                                # Slicing the JSON string would produce an unterminated, invalid
+                                # payload that downstream ``json.loads`` consumers cannot parse.
+                                # Emit a small, valid JSON marker instead so consumers can detect
+                                # the drop and reason about it.
+                                LOGGER.warning(
+                                    "Evaluator properties JSON length %d exceeds %d; "
+                                    "dropping payload from App Insights and emitting truncation marker.",
+                                    len(properties_json),
+                                    _MAX_EVALUATION_PROPERTIES_JSON_LEN,
+                                )
+                                properties_json = json.dumps(
+                                    {
+                                        "truncated": True,
+                                        "original_size_bytes": len(properties_json),
+                                    }
+                                )
+                            internal_log_attributes["gen_ai.evaluation.properties"] = properties_json
+
                 # Add data source item attributes if present
                 if response_id:
                     standard_log_attributes["gen_ai.response.id"] = response_id
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py b/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py
@@ -2498,6 +2498,194 @@ def test_token_usage_partial_only_prompt(self):
         assert "gen_ai.evaluation.usage.output_tokens" not in internal_props
 
 
+@pytest.mark.unittest
+@pytest.mark.skipif(MISSING_OPENTELEMETRY, reason="This test requires the opentelemetry package")
+class TestLogEventsProperties:
+    """Tests for evaluator ``properties`` forwarding in _log_events_to_app_insights."""
+
+    def _make_mock_event_logger(self):
+        emitted = []
+
+        class FakeEventLogger:
+            def emit(self, event):
+                emitted.append(event)
+
+        return FakeEventLogger(), emitted
+
+    def test_rubric_dimension_scores_forwarded_as_json(self):
+        """Rubric ``dimension_scores`` should be forwarded as a JSON attribute."""
+        event_logger, emitted = self._make_mock_event_logger()
+        dimension_scores = [
+            {"id": "resolution_progress", "score": 0.8, "applicable": True, "weight": 10, "reason": "ok"},
+            {"id": "general_quality", "score": 0.5, "applicable": True, "weight": 5, "reason": "fine"},
+        ]
+        events = [
+            {
+                "metric": "custom-rubric",
+                "score": 0.7,
+                "properties": {"dimension_scores": dimension_scores},
+            }
+        ]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        attrs = emitted[0].attributes
+        internal_props = json.loads(attrs["internal_properties"])
+        assert "gen_ai.evaluation.properties" in internal_props
+        payload = json.loads(internal_props["gen_ai.evaluation.properties"])
+        assert payload == {"dimension_scores": dimension_scores}
+
+    def test_redteam_keys_excluded_from_generic_properties(self):
+        """Dedicated red-team keys should keep their own attributes and NOT appear in the JSON blob."""
+        event_logger, emitted = self._make_mock_event_logger()
+        events = [
+            {
+                "metric": "redteam",
+                "properties": {
+                    "attack_success": True,
+                    "attack_technique": "jailbreak",
+                    "attack_complexity": "easy",
+                    "attack_success_threshold": 0.5,
+                    "extra_metric": 42,
+                },
+            }
+        ]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        # Dedicated red-team attributes stay as before
+        assert internal_props["gen_ai.redteam.attack.success"] == "True"
+        assert internal_props["gen_ai.redteam.attack.technique"] == "jailbreak"
+        assert internal_props["gen_ai.redteam.attack.complexity"] == "easy"
+        assert internal_props["gen_ai.redteam.attack.success_threshold"] == "0.5"
+        # Generic forwarder carries only the non-dedicated keys
+        payload = json.loads(internal_props["gen_ai.evaluation.properties"])
+        assert payload == {"extra_metric": 42}
+
+    def test_no_generic_attribute_when_only_redteam_keys_present(self):
+        """When properties contains only dedicated red-team keys, no generic JSON attribute is emitted."""
+        event_logger, emitted = self._make_mock_event_logger()
+        events = [
+            {
+                "metric": "redteam",
+                "properties": {
+                    "attack_success": True,
+                    "attack_technique": "jailbreak",
+                },
+            }
+        ]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        assert "gen_ai.evaluation.properties" not in internal_props
+
+    def test_no_generic_attribute_when_properties_absent(self):
+        """No ``gen_ai.evaluation.properties`` attribute when event has no ``properties`` key."""
+        event_logger, emitted = self._make_mock_event_logger()
+        events = [{"metric": "coherence", "score": 4.5}]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        assert "gen_ai.evaluation.properties" not in internal_props
+
+    def test_non_dict_properties_does_not_emit_generic_attribute(self):
+        """Non-dict ``properties`` (e.g. a string) should be skipped without raising."""
+        event_logger, emitted = self._make_mock_event_logger()
+        events = [{"metric": "coherence", "score": 4.5, "properties": "not a dict"}]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        assert "gen_ai.evaluation.properties" not in internal_props
+
+    @pytest.mark.parametrize("bad_properties", [None, 42, "not a dict", ["attack_success"], ("attack_success",)])
+    def test_unexpected_properties_shape_does_not_crash(self, bad_properties):
+        """Non-dict ``properties`` shapes must not crash the red-team or generic forwarders."""
+        event_logger, emitted = self._make_mock_event_logger()
+        events = [{"metric": "redteam", "properties": bad_properties}]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        # No red-team or generic attributes should be derived from a non-dict payload.
+        assert "gen_ai.evaluation.properties" not in internal_props
+        assert not any(k.startswith("gen_ai.redteam.") for k in internal_props)
+
+    def test_oversized_properties_payload_emits_valid_json_truncation_marker(self):
+        """Oversized properties payloads are replaced with a valid JSON truncation marker."""
+        event_logger, emitted = self._make_mock_event_logger()
+        big_value = "x" * 20000  # well over the 7500 char cap
+        events = [
+            {
+                "metric": "rubric",
+                "properties": {"big": big_value},
+            }
+        ]
+        app_insights_config = {"connection_string": "fake"}
+
+        _log_events_to_app_insights(
+            event_logger=event_logger,
+            events=events,
+            log_attributes={},
+            app_insights_config=app_insights_config,
+        )
+
+        assert len(emitted) == 1
+        internal_props = json.loads(emitted[0].attributes["internal_properties"])
+        assert "gen_ai.evaluation.properties" in internal_props
+        # The emitted value must remain valid, parseable JSON.
+        marker = json.loads(internal_props["gen_ai.evaluation.properties"])
+        assert marker["truncated"] is True
+        assert marker["original_size_bytes"] > 7500
+        # Marker itself must be well under the cap.
+        assert len(internal_props["gen_ai.evaluation.properties"]) <= 7500
+
+
 class TestAdjustForInverseMetric:
     """Tests for _adjust_for_inverse_metric handling of boolean labels."""