Skip to content

fix(judge): re-evaluate against human ratings (post-align kappa was always 0)#171

Merged
forrestmurray-db merged 3 commits into
rc/v1.10.0from
fix/item2-realign-human-ratings
Jun 29, 2026
Merged

fix(judge): re-evaluate against human ratings (post-align kappa was always 0)#171
forrestmurray-db merged 3 commits into
rc/v1.10.0from
fix/item2-realign-human-ratings

Conversation

@jay-wong-db

@jay-wong-db jay-wong-db commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to the v1.10 audit (#170). Fixes the post-alignment Re-evaluate on the Judge Tuning
page reporting Cohen's kappa = 0 regardless of how well the aligned judge actually agreed with the
SMEs.

Root cause: the re-evaluate path passed require_human_ratings=False, so eval rows carried
human_rating=None and kappa was computed over zero pairs. Fix: pass True. The existing branch
joins human ratings (via prepare_alignment_data, keyed by mlflow_trace_id to match the searched
eval-tagged set) and computes agreement against the human baseline. tag_type='eval' and
use_registered_judge=True are unchanged, so the trace set and the aligned judge are preserved.

The other three require_human_ratings=False sites (add_traces, begin_annotation,
restart_auto_evaluation) are pre-annotation auto-eval and are intentionally left as-is.

Changes

  • server/routers/workshops.py: flip the re-evaluate flag (1 line + comment).
  • specs/JUDGE_EVALUATION_SPEC.md: new "Human-Rating Agreement" subsection + a success criterion
    (spec edit authorized by the owner).
  • tests/unit/services/test_realign_human_ratings_integration.py: an integration test that drives
    the real run_evaluation_with_answer_sheet generator with MLflow mocked at the boundaries,
    proving the True path yields a real kappa and the old False path reproduces the kappa=0 bug.
  • Regenerated SPEC_COVERAGE_MAP.md + .spec-coverage-baseline.json
    (JUDGE_EVALUATION_SPEC 28/30 -> 29/31; gate reports no regression).

Reviewer focus

  • The 1-line behavior change in workshops.py is the fix; everything else is the test + spec.
  • Edge case: re-evaluate with no human ratings now fails loudly with a clear error instead of
    silently showing 0%. This can't occur in the normal post-alignment flow (alignment itself
    requires SME ratings).

Validation

  • Backend suite: 915 passed locally; the only 3 failures are pre-existing and environment-only
    (google-genai not installed in the test venv), unrelated to this change.
  • The integration test demonstrates the metric directly: with humans joined, accuracy 0.75 and a
    real positive kappa; without, kappa = 0.
  • spec-coverage-gate: no regression (overall stable at 71%).
  • No CI runs on this repo, so validation is local.

…real

The post-alignment "Re-evaluate" path passed require_human_ratings=False, so the
eval rows carried human_rating=None and Cohen's kappa was computed over zero
pairs, always reporting 0 on the Judge Tuning screen regardless of how well the
aligned judge actually agreed.

Flip it to True: the existing require_human_ratings branch joins human ratings
(via prepare_alignment_data, keyed by mlflow_trace_id to match the searched
eval-tagged traces) and computes agreement against the human baseline.
tag_type='eval' and use_registered_judge are unchanged, so the trace set and
the aligned judge are preserved. The other three require_human_ratings=False
sites (add_traces, begin_annotation, restart_auto_evaluation) are pre-annotation
auto-eval and are intentionally left as-is.

Adds a behavioral regression test (captures the endpoint's actual call args) and
updates JUDGE_EVALUATION_SPEC's Re-Evaluation section + success criteria.
…c coverage

Adds an integration test that drives the real run_evaluation_with_answer_sheet
generator (MLflow mocked at the boundaries) proving the True path joins human
ratings and yields a real Cohen's kappa, and that the old False path reproduces
the kappa=0 bug. Aligns @Req markers with the JUDGE_EVALUATION_SPEC criterion so
the spec-coverage analyzer links them, and regenerates SPEC_COVERAGE_MAP.md +
the coverage baseline (JUDGE_EVALUATION_SPEC 28/30 -> 29/31; gate: no regression).
… test

Removes test_re_evaluate_requires_human_ratings: it reached into the endpoint's
background daemon thread and polled for the captured call args, which is timing
-dependent (flaky under CI load) and structurally unlike the other router tests
(which assert on synchronous pre-thread state). The require_human_ratings=True
behavior is already proven deterministically by the integration test, the inline
comment, and the JUDGE_EVALUATION_SPEC criterion. Net effect: this test file
returns to its original state. Coverage map + baseline refreshed; the spec
criterion stays covered by the integration test and the gate reports no regression.
@forrestmurray-db forrestmurray-db merged commit 28c22d9 into rc/v1.10.0 Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants