Skip to content

Add multi-turn quality tests for coherence, customer_satisfaction, groundedness, task_completion evaluators#5072

Merged
AliMahmoudzadeh merged 3 commits into
mainfrom
amah/multi-turn-quality-tests
May 21, 2026
Merged

Add multi-turn quality tests for coherence, customer_satisfaction, groundedness, task_completion evaluators#5072
AliMahmoudzadeh merged 3 commits into
mainfrom
amah/multi-turn-quality-tests

Conversation

@AliMahmoudzadeh
Copy link
Copy Markdown
Contributor

@AliMahmoudzadeh AliMahmoudzadeh commented May 20, 2026

Summary

Adds 16 multi-turn quality tests across 4 evaluators that validate real LLM evaluation with Azure OpenAI (no mocking). Each test trace was validated across 7 judge models with 100% pass rate.

Tests Added

Evaluator Tests Cases
Coherence 4 fail (incoherent), pass (minor tangent), pass (perfect flow), skip (user derails)
Customer Satisfaction 4 fail (dismissive), fail (curt with tool calls), edge (generic advice), pass (full resolution)
Groundedness 5 fail (fabricated meds), fail (contradicts tool), pass (clarifying only), pass (correct but incomplete), pass (fully grounded)
Task Completion 3 pass (flight+hotel with tools), fail (missing charts), fail (ignores refinement)

Key Implementation Details

  • *
    ormalize_messages_for_evaluator()*
    helper in \common_test_data.py\ converts OpenAI-format tool call messages to Azure AI Evaluation content-block format
  • Coherence & Groundedness test classes override \�xpected_result_fields\ and _extract_and_print_result\ because their multi-turn _build_result\ omits the _passed\ field
  • Coherence skip test overrides \�ssert_not_applicable\ since the LLM reason doesn't contain the fixed 'not applicable' string

Files Changed

  • \common_test_data.py\ added
    ormalize_messages_for_evaluator()\ helper
  • _init_.py\ registered 4 new multi-turn test classes
  • 4 new test files (\ est_*_multi_turn.py)

Test Results

All 16 tests pass locally with both gpt-5-nano (~3.5 min) and gpt-5.4 (~2.5 min).

…oundedness, task_completion evaluators

Add 16 multi-turn quality tests (4 coherence, 4 CSAT, 5 groundedness, 3 task_completion)
that validate real LLM evaluation with Azure OpenAI (no mocking). Test traces
validated across 7 judge models with 100% pass rate.

Key implementation details:
- normalize_messages_for_evaluator() helper converts OpenAI-format tool call
  messages to Azure AI Evaluation content-block format
- Coherence/groundedness test classes override expected_result_fields (multi-turn
  _build_result omits _passed) and _extract_and_print_result (derives passed
  from label)
- Coherence skip test overrides assert_not_applicable (LLM reason doesn't
  contain fixed 'not applicable' string)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Test Results for assets-test

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 8b509dc.

♻️ This comment has been updated with latest results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AliMahmoudzadeh AliMahmoudzadeh merged commit f202b50 into main May 21, 2026
38 checks passed
@AliMahmoudzadeh AliMahmoudzadeh deleted the amah/multi-turn-quality-tests branch May 21, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants