Add multi-turn quality tests for coherence, customer_satisfaction, groundedness, task_completion evaluators by AliMahmoudzadeh · Pull Request #5072 · Azure/azureml-assets

AliMahmoudzadeh · 2026-05-20T20:39:44Z

Summary

Adds 16 multi-turn quality tests across 4 evaluators that validate real LLM evaluation with Azure OpenAI (no mocking). Each test trace was validated across 7 judge models with 100% pass rate.

Tests Added

Evaluator	Tests	Cases
Coherence	4	fail (incoherent), pass (minor tangent), pass (perfect flow), skip (user derails)
Customer Satisfaction	4	fail (dismissive), fail (curt with tool calls), edge (generic advice), pass (full resolution)
Groundedness	5	fail (fabricated meds), fail (contradicts tool), pass (clarifying only), pass (correct but incomplete), pass (fully grounded)
Task Completion	3	pass (flight+hotel with tools), fail (missing charts), fail (ignores refinement)

Key Implementation Details

*
ormalize_messages_for_evaluator()* helper in \common_test_data.py\ converts OpenAI-format tool call messages to Azure AI Evaluation content-block format
Coherence & Groundedness test classes override \�xpected_result_fields\ and _extract_and_print_result\ because their multi-turn _build_result\ omits the _passed\ field
Coherence skip test overrides \�ssert_not_applicable\ since the LLM reason doesn't contain the fixed 'not applicable' string

Files Changed

\common_test_data.py\ added
ormalize_messages_for_evaluator()\ helper
_init_.py\ registered 4 new multi-turn test classes
4 new test files (\ est_*_multi_turn.py)

Test Results

All 16 tests pass locally with both gpt-5-nano (~3.5 min) and gpt-5.4 (~2.5 min).

…oundedness, task_completion evaluators Add 16 multi-turn quality tests (4 coherence, 4 CSAT, 5 groundedness, 3 task_completion) that validate real LLM evaluation with Azure OpenAI (no mocking). Test traces validated across 7 judge models with 100% pass rate. Key implementation details: - normalize_messages_for_evaluator() helper converts OpenAI-format tool call messages to Azure AI Evaluation content-block format - Coherence/groundedness test classes override expected_result_fields (multi-turn _build_result omits _passed) and _extract_and_print_result (derives passed from label) - Coherence skip test overrides assert_not_applicable (LLM reason doesn't contain fixed 'not applicable' string) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-05-20T20:42:05Z

Test Results for assets-test

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 8b509dc.

♻️ This comment has been updated with latest results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AliMahmoudzadeh requested review from a team as code owners May 20, 2026 20:39

AliMahmoudzadeh temporarily deployed to Testing May 20, 2026 20:40 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing May 20, 2026 20:42 — with GitHub Actions Inactive

Fix lint: remove unused import, wrap long line

b6ad2ba

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AliMahmoudzadeh temporarily deployed to Testing May 20, 2026 21:45 — with GitHub Actions Inactive

salma-elshafey approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into amah/multi-turn-quality-tests

8b509dc

AliMahmoudzadeh temporarily deployed to Testing May 21, 2026 18:32 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing May 21, 2026 18:33 — with GitHub Actions Inactive

AliMahmoudzadeh merged commit f202b50 into main May 21, 2026
38 checks passed

AliMahmoudzadeh deleted the amah/multi-turn-quality-tests branch May 21, 2026 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-turn quality tests for coherence, customer_satisfaction, groundedness, task_completion evaluators#5072

Add multi-turn quality tests for coherence, customer_satisfaction, groundedness, task_completion evaluators#5072
AliMahmoudzadeh merged 3 commits into
mainfrom
amah/multi-turn-quality-tests

AliMahmoudzadeh commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AliMahmoudzadeh commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests Added

Key Implementation Details

Files Changed

Test Results

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for assets-test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AliMahmoudzadeh commented May 20, 2026 •

edited

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading