feat: add flip rate metric and answer flip detection to --diff results#14
Merged
Merged
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Copilot started reviewing on behalf of
NullPointerDepressiveDisorder
April 14, 2026 09:03
View session
Contributor
There was a problem hiding this comment.
Pull request overview
Adds “answer flip” analysis to diff runs and surfaces it as a flip_rate metric in the CLI summary, enabling quicker identification of backends that change the functional answer vs the baseline.
Changes:
- Annotate
diff()ComparisonResult.metadatawith extracted answers and aflippedboolean using category-aware answer extraction. - Add a
flip_ratecolumn to theinfer-check diffsummary table and color-code it for quick scanning.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/infer_check/runner.py |
Extracts/compares functional answers during diff() and stores flip-related metadata on each comparison. |
src/infer_check/cli.py |
Displays per-backend flip_rate in the diff summary table (with threshold-based coloring). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…iff command - Extract answer extraction and flip detection logic into a new `_annotate_flip_metadata` helper method in `TestRunner` to eliminate code duplication between `compare` and `diff`. - Add unit tests for the `diff` CLI command (`test_cli_diff.py`) to verify summary table rendering, backend metrics, and flip rate formatting. - Introduce `StubBackend` in `test_runner.py` and add async tests to ensure the `diff` runner method accurately detects answer flips across single and multiple test backends.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new metric—flip rate—to the comparison of backend inference results, and implements answer extraction and flip detection to better analyze differences between model outputs. The main changes are:
Metrics and Reporting Improvements:
flip_ratecolumn to the summary table insrc/infer_check/cli.py, which visually highlights backends with a high rate of answer "flips" (i.e., cases where the answer changes compared to the baseline).flip_ratefor each backend, including color-coding for easy identification of problematic backends.Answer Extraction and Flip Detection:
src/infer_check/runner.py, imported answer extraction and comparison utilities to support new analysis features.