feat: GEPA prompt optimization and pairwise agreement IRR#111
Open
vivian-xie-db wants to merge 10 commits into
Open
feat: GEPA prompt optimization and pairwise agreement IRR#111vivian-xie-db wants to merge 10 commits into
vivian-xie-db wants to merge 10 commits into
Conversation
- Add prompt optimization page with GEPA (mlflow.genai.optimize_prompts) supporting custom target endpoints, auto-format detection (chat vs agent), URL parsing, and clean training data extraction from traces - Replace Krippendorff's Alpha with pairwise agreement percentage for IRR (adjacent agreement for Likert, exact for binary, 75% threshold) - Add Copy and Download buttons for optimization logs - Add prompt_optimization_runs table with migrations - Add pairwise_agreement module with comprehensive tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When navigating away from the Prompt Optimization page and back, the component state was lost (logs, status, original prompt). Now on mount, the page checks history for any "running" job and automatically reconnects — restoring all logs, the original prompt text, and resuming the polling loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rsistence - Replace Fleiss' Kappa with GDPval A^HH formula (E[1-|H1-H2|]) for human inter-rater agreement, supporting 2+ raters with normalized [0,1] ratings - Compute judge performance metrics (Cohen's κ, accuracy) on-the-fly from annotations when auto-eval stores evaluations without human_rating - Re-enable performance metrics bar on Judge Tuning page for saved prompts - Persist Prompt Optimization config (prompt, UC catalog/schema, model) to localStorage so form data survives page navigation - Restore full config (model, iterations, candidates, target endpoint, UC fields) from optimization history on reconnect - Add target_endpoint column to prompt_optimization_runs (migration 0014) - Add 30 unit tests for GDPval agreement calculation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…upling - Store human_rating as None (not 0) when missing, fixing ambiguity for binary judges where 0 is a valid rating - Always prefer annotation-sourced human ratings over evaluation records - Detect and auto-expire stale auto-eval jobs stuck running for >5 minutes - Correctly use 0/1 labels for binary judges in confusion matrix and agreement-by-rating (was incorrectly using 1-5 Likert labels) - Decouple background auto-eval polling from button disabled states on JudgeTuningPage; show informational badge instead of blocking actions - Fix optimizer model name display with backend-to-frontend mapping; default to Claude Opus 4.5 for prompt optimization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Show GEPA score improvement (initial → final) in prompt optimization results card and history, with percentage badge; parse from logs as fallback for older runs without metrics - Include initial_score/final_score in backend optimization metrics - Fix sidebar not updating after starting annotation: replace removeQueries (destroys cache) with invalidateQueries (triggers refetch for active observers) - Rename trace alignment tag from tags.label='align' to tags.annotation_status='align'; keep tags.label='eval' unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…default
- Fix list-scorers endpoint to use correct MLflow setup pattern
(set_tracking_uri('databricks') + set_experiment) matching alignment
and optimization services so registered judges are actually returned
- Add high-water mark for sidebar phase navigation so post-annotation
phases stay green when clicking back to earlier phases
- Gate sidebar blue/green visuals behind isAnnotationComplete
- Change default candidates per iteration from 5 to 3
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sign - Add judge_names list and max_traces to PromptOptimizationRequest model - Make target_endpoint required (not optional) - Backend uses request judge_names over rubric-derived ones when provided - Add cancel endpoint and stale job detection for running optimizations - Pass max_traces through to _build_train_data - Redesign Prompt Optimization page: two-column layout with enriched cards, cohesive teal/sky/emerald/blue color palette, polished card headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tory UI - Start Optimization button uses default primary color (matches login) - Default iterations changed from 3 to 2 (frontend + backend) - History section redesigned: clean divider-based rows, outline status badges, dot-separated metadata, chevron expand indicators, rounded-lg containers, and consistent spacing with the rest of the page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Append 'Z' suffix to created_at ISO strings so JavaScript's Date constructor treats them as UTC and correctly converts to local timezone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
forrestmurray-db
added a commit
that referenced
this pull request
Apr 17, 2026
Cherry-pick IRR-specific files from feature/gepa-prompt-optimization: - pairwise_agreement.py: pairwise agreement % as primary IRR metric - fleiss_kappa.py: Fleiss' Kappa for multi-rater reliability - irr_service.py: rewritten to use pairwise agreement primary - IRRResultsDemo.tsx: UI updates for pairwise display - All associated tests (63 passing) Skips GEPA prompt optimization files (unrelated scope).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PromptOptimizationServicewith end-to-end GEPA integration, stale job detection, cancel endpoint,judge_namesmulti-select,max_tracesparameter, and UTC timestamp fix for historyTest plan
just test-server— all Python unit tests passjust ui-test-unit— all React unit tests pass🤖 Generated with Claude Code