fix(judge): re-evaluate against human ratings (post-align kappa was always 0)#171
Merged
Merged
Conversation
…real The post-alignment "Re-evaluate" path passed require_human_ratings=False, so the eval rows carried human_rating=None and Cohen's kappa was computed over zero pairs, always reporting 0 on the Judge Tuning screen regardless of how well the aligned judge actually agreed. Flip it to True: the existing require_human_ratings branch joins human ratings (via prepare_alignment_data, keyed by mlflow_trace_id to match the searched eval-tagged traces) and computes agreement against the human baseline. tag_type='eval' and use_registered_judge are unchanged, so the trace set and the aligned judge are preserved. The other three require_human_ratings=False sites (add_traces, begin_annotation, restart_auto_evaluation) are pre-annotation auto-eval and are intentionally left as-is. Adds a behavioral regression test (captures the endpoint's actual call args) and updates JUDGE_EVALUATION_SPEC's Re-Evaluation section + success criteria.
…c coverage Adds an integration test that drives the real run_evaluation_with_answer_sheet generator (MLflow mocked at the boundaries) proving the True path joins human ratings and yields a real Cohen's kappa, and that the old False path reproduces the kappa=0 bug. Aligns @Req markers with the JUDGE_EVALUATION_SPEC criterion so the spec-coverage analyzer links them, and regenerates SPEC_COVERAGE_MAP.md + the coverage baseline (JUDGE_EVALUATION_SPEC 28/30 -> 29/31; gate: no regression).
… test Removes test_re_evaluate_requires_human_ratings: it reached into the endpoint's background daemon thread and polled for the captured call args, which is timing -dependent (flaky under CI load) and structurally unlike the other router tests (which assert on synchronous pre-thread state). The require_human_ratings=True behavior is already proven deterministically by the integration test, the inline comment, and the JUDGE_EVALUATION_SPEC criterion. Net effect: this test file returns to its original state. Coverage map + baseline refreshed; the spec criterion stays covered by the integration test and the gate reports no regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the v1.10 audit (#170). Fixes the post-alignment Re-evaluate on the Judge Tuning
page reporting Cohen's kappa = 0 regardless of how well the aligned judge actually agreed with the
SMEs.
Root cause: the re-evaluate path passed
require_human_ratings=False, so eval rows carriedhuman_rating=Noneand kappa was computed over zero pairs. Fix: passTrue. The existing branchjoins human ratings (via
prepare_alignment_data, keyed bymlflow_trace_idto match the searchedeval-tagged set) and computes agreement against the human baseline.tag_type='eval'anduse_registered_judge=Trueare unchanged, so the trace set and the aligned judge are preserved.The other three
require_human_ratings=Falsesites (add_traces,begin_annotation,restart_auto_evaluation) are pre-annotation auto-eval and are intentionally left as-is.Changes
server/routers/workshops.py: flip the re-evaluate flag (1 line + comment).specs/JUDGE_EVALUATION_SPEC.md: new "Human-Rating Agreement" subsection + a success criterion(spec edit authorized by the owner).
tests/unit/services/test_realign_human_ratings_integration.py: an integration test that drivesthe real
run_evaluation_with_answer_sheetgenerator with MLflow mocked at the boundaries,proving the
Truepath yields a real kappa and the oldFalsepath reproduces the kappa=0 bug.SPEC_COVERAGE_MAP.md+.spec-coverage-baseline.json(
JUDGE_EVALUATION_SPEC28/30 -> 29/31; gate reports no regression).Reviewer focus
workshops.pyis the fix; everything else is the test + spec.silently showing 0%. This can't occur in the normal post-alignment flow (alignment itself
requires SME ratings).
Validation
(
google-genainot installed in the test venv), unrelated to this change.real positive kappa; without, kappa = 0.
spec-coverage-gate: no regression (overall stable at 71%).