feat: add flip rate metric and answer flip detection to --diff results by NullPointerDepressiveDisorder · Pull Request #14 · NullPointerDepressiveDisorder/infer-check

NullPointerDepressiveDisorder · 2026-04-14T09:02:43Z

This pull request introduces a new metric—flip rate—to the comparison of backend inference results, and implements answer extraction and flip detection to better analyze differences between model outputs. The main changes are:

Metrics and Reporting Improvements:

Added a new flip_rate column to the summary table in src/infer_check/cli.py, which visually highlights backends with a high rate of answer "flips" (i.e., cases where the answer changes compared to the baseline).
Updated the table row logic to compute and display the flip_rate for each backend, including color-coding for easy identification of problematic backends.

Answer Extraction and Flip Detection:

In src/infer_check/runner.py, imported answer extraction and comparison utilities to support new analysis features.
For each comparison, extracted answers from both the baseline and test results, determined if a "flip" occurred (i.e., answers do not match), and recorded detailed metadata such as both extracted answers, extraction strategy, and confidence. This metadata is used to calculate the flip rate and provide deeper insight into model behavior.…sults

…sults

codecov · 2026-04-14T09:02:53Z

Codecov Report

❌ Patch coverage is 97.54098% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tests/unit/test_runner.py	94.11%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Adds “answer flip” analysis to diff runs and surfaces it as a flip_rate metric in the CLI summary, enabling quicker identification of backends that change the functional answer vs the baseline.

Changes:

Annotate diff() ComparisonResult.metadata with extracted answers and a flipped boolean using category-aware answer extraction.
Add a flip_rate column to the infer-check diff summary table and color-code it for quick scanning.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/infer_check/runner.py`	Extracts/compares functional answers during `diff()` and stores flip-related metadata on each comparison.
`src/infer_check/cli.py`	Displays per-backend `flip_rate` in the `diff` summary table (with threshold-based coloring).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…iff command - Extract answer extraction and flip detection logic into a new `_annotate_flip_metadata` helper method in `TestRunner` to eliminate code duplication between `compare` and `diff`. - Add unit tests for the `diff` CLI command (`test_cli_diff.py`) to verify summary table rendering, backend metrics, and flip rate formatting. - Introduce `StubBackend` in `test_runner.py` and add async tests to ensure the `diff` runner method accurately detects answer flips across single and multiple test backends.

feat: add flip rate metric and answer flip detection to comparison re…

6cad8fa

…sults

Copilot AI review requested due to automatic review settings April 14, 2026 09:02

Copilot started reviewing on behalf of NullPointerDepressiveDisorder April 14, 2026 09:03 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread src/infer_check/runner.py Outdated

Comment thread src/infer_check/runner.py Outdated

Comment thread src/infer_check/cli.py

NullPointerDepressiveDisorder merged commit cca00fe into main Apr 16, 2026
5 checks passed

NullPointerDepressiveDisorder deleted the feature/diff-flip-rate branch April 16, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add flip rate metric and answer flip detection to --diff results#14

feat: add flip rate metric and answer flip detection to --diff results#14
NullPointerDepressiveDisorder merged 2 commits into
mainfrom
feature/diff-flip-rate

NullPointerDepressiveDisorder commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NullPointerDepressiveDisorder commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 14, 2026 •

edited

Loading