Skip to content

feat: add flip rate metric and answer flip detection to --diff results#14

Merged
NullPointerDepressiveDisorder merged 2 commits into
mainfrom
feature/diff-flip-rate
Apr 16, 2026
Merged

feat: add flip rate metric and answer flip detection to --diff results#14
NullPointerDepressiveDisorder merged 2 commits into
mainfrom
feature/diff-flip-rate

Conversation

@NullPointerDepressiveDisorder
Copy link
Copy Markdown
Owner

This pull request introduces a new metric—flip rate—to the comparison of backend inference results, and implements answer extraction and flip detection to better analyze differences between model outputs. The main changes are:

Metrics and Reporting Improvements:

  • Added a new flip_rate column to the summary table in src/infer_check/cli.py, which visually highlights backends with a high rate of answer "flips" (i.e., cases where the answer changes compared to the baseline).
  • Updated the table row logic to compute and display the flip_rate for each backend, including color-coding for easy identification of problematic backends.

Answer Extraction and Flip Detection:

  • In src/infer_check/runner.py, imported answer extraction and comparison utilities to support new analysis features.
  • For each comparison, extracted answers from both the baseline and test results, determined if a "flip" occurred (i.e., answers do not match), and recorded detailed metadata such as both extracted answers, extraction strategy, and confidence. This metadata is used to calculate the flip rate and provide deeper insight into model behavior.…sults

Copilot AI review requested due to automatic review settings April 14, 2026 09:02
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 97.54098% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/unit/test_runner.py 94.11% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “answer flip” analysis to diff runs and surfaces it as a flip_rate metric in the CLI summary, enabling quicker identification of backends that change the functional answer vs the baseline.

Changes:

  • Annotate diff() ComparisonResult.metadata with extracted answers and a flipped boolean using category-aware answer extraction.
  • Add a flip_rate column to the infer-check diff summary table and color-code it for quick scanning.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/infer_check/runner.py Extracts/compares functional answers during diff() and stores flip-related metadata on each comparison.
src/infer_check/cli.py Displays per-backend flip_rate in the diff summary table (with threshold-based coloring).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/infer_check/runner.py Outdated
Comment thread src/infer_check/runner.py Outdated
Comment thread src/infer_check/cli.py
…iff command

- Extract answer extraction and flip detection logic into a new `_annotate_flip_metadata` helper method in `TestRunner` to eliminate code duplication between `compare` and `diff`.
- Add unit tests for the `diff` CLI command (`test_cli_diff.py`) to verify summary table rendering, backend metrics, and flip rate formatting.
- Introduce `StubBackend` in `test_runner.py` and add async tests to ensure the `diff` runner method accurately detects answer flips across single and multiple test backends.
@NullPointerDepressiveDisorder NullPointerDepressiveDisorder merged commit cca00fe into main Apr 16, 2026
5 checks passed
@NullPointerDepressiveDisorder NullPointerDepressiveDisorder deleted the feature/diff-flip-rate branch April 16, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants