Skip to content

Update Groundedness Evaluator to support multi-turn evaluation#4942

Merged
AliMahmoudzadeh merged 11 commits into
mainfrom
selshafey/groundedness_multiturn
Apr 24, 2026
Merged

Update Groundedness Evaluator to support multi-turn evaluation#4942
AliMahmoudzadeh merged 11 commits into
mainfrom
selshafey/groundedness_multiturn

Conversation

@salma-elshafey
Copy link
Copy Markdown
Contributor

@salma-elshafey salma-elshafey commented Apr 20, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

Test Results for assets-test

92 tests   92 ✅  2s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit 97e8668.

♻️ This comment has been updated with latest results.

…#4957)

Data-driven prompt improvements validated on 300 FaithDial traces:
- Added 5-step evaluation procedure with tool-result parsing guidance
- Introduced 'substantive hallucination' vs 'NOT hallucination' taxonomy
- Added score-4 example showing tool-call + paraphrase pattern
- Tightened score-3 to require zero factual propositions
- Added CRITICAL rule: any substantive hallucination -> score <= 2

Results (v1 -> v5):
- gpt-5.4-mini: 76.3% -> 83.7% accuracy, F1 88.2%
- gpt-5.4:      65.7% -> 79.7% accuracy, F1 83.1%

Co-authored-by: Ali Mahmoudzadeh <amah@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AliMahmoudzadeh AliMahmoudzadeh merged commit 9971fb5 into main Apr 24, 2026
38 checks passed
@AliMahmoudzadeh AliMahmoudzadeh deleted the selshafey/groundedness_multiturn branch April 24, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants