fix(judge): re-evaluate against human ratings (post-align kappa was always 0) by jay-wong-db · Pull Request #171 · databricks-solutions/vibescaler

jay-wong-db · 2026-06-19T15:27:50Z

Summary

Follow-up to the v1.10 audit (#170). Fixes the post-alignment Re-evaluate on the Judge Tuning
page reporting Cohen's kappa = 0 regardless of how well the aligned judge actually agreed with the
SMEs.

Root cause: the re-evaluate path passed require_human_ratings=False, so eval rows carried
human_rating=None and kappa was computed over zero pairs. Fix: pass True. The existing branch
joins human ratings (via prepare_alignment_data, keyed by mlflow_trace_id to match the searched
eval-tagged set) and computes agreement against the human baseline. tag_type='eval' and
use_registered_judge=True are unchanged, so the trace set and the aligned judge are preserved.

The other three require_human_ratings=False sites (add_traces, begin_annotation,
restart_auto_evaluation) are pre-annotation auto-eval and are intentionally left as-is.

Changes

server/routers/workshops.py: flip the re-evaluate flag (1 line + comment).
specs/JUDGE_EVALUATION_SPEC.md: new "Human-Rating Agreement" subsection + a success criterion
(spec edit authorized by the owner).
tests/unit/services/test_realign_human_ratings_integration.py: an integration test that drives
the real run_evaluation_with_answer_sheet generator with MLflow mocked at the boundaries,
proving the True path yields a real kappa and the old False path reproduces the kappa=0 bug.
Regenerated SPEC_COVERAGE_MAP.md + .spec-coverage-baseline.json
(JUDGE_EVALUATION_SPEC 28/30 -> 29/31; gate reports no regression).

Reviewer focus

The 1-line behavior change in workshops.py is the fix; everything else is the test + spec.
Edge case: re-evaluate with no human ratings now fails loudly with a clear error instead of
silently showing 0%. This can't occur in the normal post-alignment flow (alignment itself
requires SME ratings).

Validation

Backend suite: 915 passed locally; the only 3 failures are pre-existing and environment-only
(google-genai not installed in the test venv), unrelated to this change.
The integration test demonstrates the metric directly: with humans joined, accuracy 0.75 and a
real positive kappa; without, kappa = 0.
spec-coverage-gate: no regression (overall stable at 71%).
No CI runs on this repo, so validation is local.

…real The post-alignment "Re-evaluate" path passed require_human_ratings=False, so the eval rows carried human_rating=None and Cohen's kappa was computed over zero pairs, always reporting 0 on the Judge Tuning screen regardless of how well the aligned judge actually agreed. Flip it to True: the existing require_human_ratings branch joins human ratings (via prepare_alignment_data, keyed by mlflow_trace_id to match the searched eval-tagged traces) and computes agreement against the human baseline. tag_type='eval' and use_registered_judge are unchanged, so the trace set and the aligned judge are preserved. The other three require_human_ratings=False sites (add_traces, begin_annotation, restart_auto_evaluation) are pre-annotation auto-eval and are intentionally left as-is. Adds a behavioral regression test (captures the endpoint's actual call args) and updates JUDGE_EVALUATION_SPEC's Re-Evaluation section + success criteria.

@Req

…c coverage Adds an integration test that drives the real run_evaluation_with_answer_sheet generator (MLflow mocked at the boundaries) proving the True path joins human ratings and yields a real Cohen's kappa, and that the old False path reproduces the kappa=0 bug. Aligns @Req markers with the JUDGE_EVALUATION_SPEC criterion so the spec-coverage analyzer links them, and regenerates SPEC_COVERAGE_MAP.md + the coverage baseline (JUDGE_EVALUATION_SPEC 28/30 -> 29/31; gate: no regression).

… test Removes test_re_evaluate_requires_human_ratings: it reached into the endpoint's background daemon thread and polled for the captured call args, which is timing -dependent (flaky under CI load) and structurally unlike the other router tests (which assert on synchronous pre-thread state). The require_human_ratings=True behavior is already proven deterministically by the integration test, the inline comment, and the JUDGE_EVALUATION_SPEC criterion. Net effect: this test file returns to its original state. Coverage map + baseline refreshed; the spec criterion stays covered by the integration test and the gate reports no regression.

jay-wong-db added 2 commits June 19, 2026 16:00

jay-wong-db requested review from forrestmurray-db and vivian-xie-db as code owners June 19, 2026 15:27

forrestmurray-db merged commit 28c22d9 into rc/v1.10.0 Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(judge): re-evaluate against human ratings (post-align kappa was always 0)#171

fix(judge): re-evaluate against human ratings (post-align kappa was always 0)#171
forrestmurray-db merged 3 commits into
rc/v1.10.0from
fix/item2-realign-human-ratings

jay-wong-db commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jay-wong-db commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Reviewer focus

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jay-wong-db commented Jun 19, 2026 •

edited

Loading