[Evaluation] Fix App Insights emission silently dropping all events when evaluator definition has no "metrics"#47175
Draft
slister1001 wants to merge 1 commit into
Draft
Conversation
…inition has no `metrics`
Rubric evaluators registered without metric metadata produce an evaluator
definition payload of `{"type": "rubric"}` (RAISvc's `RubricBasedEvaluatorDefinition.Metrics`
defaults to empty and validation does not require it). In
`_build_internal_log_attributes` the helper called
`evaluator_definition.get("metrics").get(metric_name)` which raises
`AttributeError: 'NoneType' object has no attribute 'get'` when `metrics`
is missing or set to None. That exception is caught by the per-event
`try/except` in `_log_events_to_app_insights` and silently swallowed, so
`event_logger.emit()` is never called for any event in the run. The net
effect is that App Insights receives zero `gen_ai.evaluation.result` events
for the entire eval — observed in production for `rubric-manual-260526043804-e45a09`
in the westus2 bug bash project, while other rubric evaluators in the same
workspace that had populated metric metadata continued to emit normally.
Guard the metrics lookup so a single misshapen definition does not abort
emission for the whole run.
Added regression tests for `_build_internal_log_attributes` covering:
missing `metrics` key, `metrics: None`, `metrics: {}`, `metrics: [...]`
(malformed type), and the happy path where metric metadata is preserved.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ea539db to
31160ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a silent App Insights emission bug where rubric evaluators registered without metric metadata produced zero
gen_ai.evaluation.resultevents for the entire run, even though the eval itself completed successfully.Root cause
In
_build_internal_log_attributes(_evaluate.py:1138):RAISvc's
RubricBasedEvaluatorDefinition.Metricsdefaults to empty and is not required byValidate(), so a rubric registered without metric metadata is sent as{"type": "rubric"}— nometricskey at all.evaluator_definition.get("metrics")returnsNone, thenNone.get(...)raises:That exception is caught by the per-event
try/exceptat_log_events_to_app_insightsline 1287 and silently swallowed beforeevent_logger.emit()is reached. Because every event in the run hits the same code path with the same evaluator definition, every event is dropped — App Insights receives nothing.Observed in production: rubric eval
rubric-manual-260526043804-e45a09in the westus2 bug bash project had zero AppInsights events while other rubric evals in the same workspace (e.g.,aprilk-repro) that had populated metric metadata emitted normally.Fix
Guard the metrics lookup with
isinstance(metrics_section, dict)so a single misshapen definition does not abort emission for the entire run. The event is still emitted; only the optionalmin_value/max_value/desirable_direction/typeattributes are absent for evaluators that did not register them.Tests
Added
TestBuildInternalLogAttributesEvaluatorDefinitionwith 5 regression cases:test_definition_without_metrics_key_does_not_raise— the exact failing payload ({"type": "rubric"})test_definition_with_metrics_none_does_not_raisetest_definition_with_metrics_empty_dict_does_not_raisetest_definition_with_metrics_list_does_not_raise— defensive against future malformed typestest_definition_with_metric_metadata_still_populates_attributes— happy path unchangedAll 10 tests in
TestBuildInternalLogAttributesThreshold+TestBuildInternalLogAttributesEvaluatorDefinitionpass.Validation
tox run -e black -c ../../../eng/tox/tox.ini --root .— cleanpytest tests/unittests/test_evaluate.py::TestBuildInternalLogAttributesEvaluatorDefinition tests/unittests/test_evaluate.py::TestBuildInternalLogAttributesThreshold -v— 10 passed