Commit b06269f
Standardize Output Schema for Evalautors (#46436)
* Update Tool Call Accuracy to output unified format
* Update tests
* reformatting
* Refactor not applicable result method calls
* Fix test assertions for new unified output format and apply black formatting (#46336)
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
* Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355)
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
* Fix tool call accuracy test for skipped output schema (#46356)
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
* Standradize Output Scheme
* Add explicit _KEY_PREFIX/_RESULT_KEY
* add missing evaluators to init
* Align evaluator unit tests with new unified output schema
* Update recordings tag to solve e2e tests
* Run formatting
* Align evaluator unit tests with unified output schema and refresh recordings
* Restore legacy `_result` and bare evaluator-name keys for backward compat
* resolve conflict
* Refresh azure-ai-evaluation test recordings for standardized evaluator output schema
* Update multimodal test assertion for new schema and refresh recordings tag
* Remove unused label assignment in navigation efficiency
Remove assignment of match_result to additional_properties_metrics['label']
* update _return_not_applicable_result
* Return "not_applicable" instead of "pass"
* update evaluators
* Fix error
* Add results back
* undo unrelated change
* undo key_prefix change
* Revert `_evaluate.py` changes from #46436 on `mohessie/standardize_output_schema` (#46835)
* Initial plan
* Revert _evaluate.py changes from PR 46436 by restoring file from main
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8462065c-c6cf-473a-9421-84eaf0a44b5b
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
* update tool_selection prompty
* Fix evaluation unit tests: replace `_KEY_PREFIX` with `_RESULT_KEY` across 7 test files (#46852)
* Initial plan
* Fix evaluation unit test failures: replace _KEY_PREFIX with _RESULT_KEY and align test expectations
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/b75cef24-3217-4d44-a0ad-51d690e90035
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
* reformatting
* Fix rouge KeyError and inject _passed key in base evaluator
Two fixes for failing e2e tests on standardize_output_schema PR:
1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError.
2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields.
* Fix key errors
* update test records
* Update recordings
* Fix result key assignment in base prompt evaluation
* Change 'reasoning' to 'reason' in evaluation prompt
* Update _document_retrieval.py
* Update task instruction from 'reasoning' to 'reason'
* update records
* Add ndcg_score to document retrieval results
* Align evaluator metric mapping for standardized single-metric outputs (#46900)
* Initial plan
* Align evaluator metric mappings with single-metric output schema
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
---------
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>1 parent d2f0225 commit b06269f
53 files changed
Lines changed: 1180 additions & 930 deletions
File tree
- sdk/evaluation/azure-ai-evaluation
- azure/ai/evaluation
- _evaluators
- _bleu
- _coherence
- _common
- _document_retrieval
- _f1_score
- _fluency
- _gleu
- _groundedness
- _intent_resolution
- _meteor
- _relevance
- _response_completeness
- _retrieval
- _rouge
- _similarity
- _task_adherence
- _task_completion
- _tool_call_accuracy
- _tool_call_success
- _tool_input_accuracy
- _tool_output_utilization
- _tool_selection
- tests
- e2etests
- unittests
- test_evaluators
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
| 16 | + | |
15 | 17 | | |
16 | 18 | | |
17 | 19 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
6 | | - | |
| 5 | + | |
| 6 | + | |
Lines changed: 2 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
| 106 | + | |
117 | 107 | | |
118 | 108 | | |
119 | 109 | | |
120 | 110 | | |
121 | 111 | | |
122 | 112 | | |
123 | | - | |
| 113 | + | |
124 | 114 | | |
125 | 115 | | |
126 | 116 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | | - | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
90 | 91 | | |
| 92 | + | |
91 | 93 | | |
| 94 | + | |
| 95 | + | |
92 | 96 | | |
| 97 | + | |
93 | 98 | | |
94 | 99 | | |
95 | 100 | | |
| |||
Lines changed: 8 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
96 | 99 | | |
97 | | - | |
98 | | - | |
99 | 100 | | |
Lines changed: 35 additions & 27 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
619 | 619 | | |
620 | 620 | | |
621 | 621 | | |
| 622 | + | |
622 | 623 | | |
623 | | - | |
624 | | - | |
625 | | - | |
626 | | - | |
627 | | - | |
628 | | - | |
629 | | - | |
630 | | - | |
631 | | - | |
632 | | - | |
633 | | - | |
634 | | - | |
635 | | - | |
636 | | - | |
637 | | - | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
638 | 636 | | |
639 | | - | |
640 | | - | |
641 | | - | |
642 | | - | |
643 | | - | |
644 | | - | |
645 | | - | |
646 | | - | |
647 | | - | |
648 | | - | |
649 | | - | |
650 | | - | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
651 | 659 | | |
652 | 660 | | |
653 | 661 | | |
| |||
Lines changed: 67 additions & 83 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
201 | 202 | | |
202 | 203 | | |
203 | 204 | | |
204 | | - | |
| 205 | + | |
205 | 206 | | |
206 | 207 | | |
207 | 208 | | |
| |||
216 | 217 | | |
217 | 218 | | |
218 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
219 | 223 | | |
220 | 224 | | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
233 | | - | |
234 | | - | |
235 | | - | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
241 | | - | |
242 | | - | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
250 | 260 | | |
251 | | - | |
252 | | - | |
253 | | - | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
254 | 267 | | |
255 | | - | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
| 268 | + | |
262 | 269 | | |
263 | 270 | | |
264 | | - | |
265 | 271 | | |
266 | 272 | | |
267 | 273 | | |
268 | 274 | | |
269 | 275 | | |
270 | 276 | | |
271 | 277 | | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
272 | 297 | | |
273 | 298 | | |
274 | 299 | | |
| |||
401 | 426 | | |
402 | 427 | | |
403 | 428 | | |
404 | | - | |
405 | | - | |
406 | | - | |
407 | | - | |
408 | | - | |
409 | | - | |
410 | | - | |
411 | | - | |
412 | | - | |
413 | | - | |
414 | | - | |
415 | | - | |
416 | | - | |
417 | | - | |
418 | | - | |
419 | | - | |
420 | | - | |
421 | | - | |
422 | | - | |
423 | | - | |
424 | | - | |
425 | | - | |
426 | | - | |
427 | | - | |
428 | | - | |
429 | | - | |
430 | | - | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | | - | |
436 | | - | |
437 | | - | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | 429 | | |
444 | 430 | | |
445 | 431 | | |
| |||
455 | 441 | | |
456 | 442 | | |
457 | 443 | | |
458 | | - | |
459 | | - | |
460 | | - | |
461 | 444 | | |
| 445 | + | |
462 | 446 | | |
463 | 447 | | |
464 | 448 | | |
| |||
0 commit comments