feat: GEPA prompt optimization and pairwise agreement IRR by vivian-xie-db · Pull Request #111 · databricks-solutions/project-0xfffff

vivian-xie-db · 2026-02-18T19:51:35Z

Summary

GEPA Prompt Optimization page: Full UI for running prompt optimization jobs — text/URI prompt input, multi-select scorers, configurable parameters (iterations, candidates, max traces), real-time log streaming, job cancellation, auto-reconnect, and optimization history with expandable results
Pairwise Agreement IRR: Replaced Fleiss' Kappa with pairwise agreement percentage for inter-rater reliability, supporting binary and Likert rubric scales
Backend: New PromptOptimizationService with end-to-end GEPA integration, stale job detection, cancel endpoint, judge_names multi-select, max_traces parameter, and UTC timestamp fix for history
UI polish: Cohesive teal/sky/emerald/blue card palette, modernized history section, enriched card design with accent bars

Test plan

Run just test-server — all Python unit tests pass
Run just ui-test-unit — all React unit tests pass
Verify prompt optimization page loads and displays all cards
Test start/stop optimization flow with a live endpoint
Verify optimization history shows correct local timestamps
Check pairwise agreement IRR displays correctly on Results Review page

🤖 Generated with Claude Code

- Add prompt optimization page with GEPA (mlflow.genai.optimize_prompts) supporting custom target endpoints, auto-format detection (chat vs agent), URL parsing, and clean training data extraction from traces - Replace Krippendorff's Alpha with pairwise agreement percentage for IRR (adjacent agreement for Likert, exact for binary, 75% threshold) - Add Copy and Download buttons for optimization logs - Add prompt_optimization_runs table with migrations - Add pairwise_agreement module with comprehensive tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When navigating away from the Prompt Optimization page and back, the component state was lost (logs, status, original prompt). Now on mount, the page checks history for any "running" job and automatically reconnects — restoring all logs, the original prompt text, and resuming the polling loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rsistence - Replace Fleiss' Kappa with GDPval A^HH formula (E[1-|H1-H2|]) for human inter-rater agreement, supporting 2+ raters with normalized [0,1] ratings - Compute judge performance metrics (Cohen's κ, accuracy) on-the-fly from annotations when auto-eval stores evaluations without human_rating - Re-enable performance metrics bar on Judge Tuning page for saved prompts - Persist Prompt Optimization config (prompt, UC catalog/schema, model) to localStorage so form data survives page navigation - Restore full config (model, iterations, candidates, target endpoint, UC fields) from optimization history on reconnect - Add target_endpoint column to prompt_optimization_runs (migration 0014) - Add 30 unit tests for GDPval agreement calculation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…upling - Store human_rating as None (not 0) when missing, fixing ambiguity for binary judges where 0 is a valid rating - Always prefer annotation-sourced human ratings over evaluation records - Detect and auto-expire stale auto-eval jobs stuck running for >5 minutes - Correctly use 0/1 labels for binary judges in confusion matrix and agreement-by-rating (was incorrectly using 1-5 Likert labels) - Decouple background auto-eval polling from button disabled states on JudgeTuningPage; show informational badge instead of blocking actions - Fix optimizer model name display with backend-to-frontend mapping; default to Claude Opus 4.5 for prompt optimization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Show GEPA score improvement (initial → final) in prompt optimization results card and history, with percentage badge; parse from logs as fallback for older runs without metrics - Include initial_score/final_score in backend optimization metrics - Fix sidebar not updating after starting annotation: replace removeQueries (destroys cache) with invalidateQueries (triggers refetch for active observers) - Rename trace alignment tag from tags.label='align' to tags.annotation_status='align'; keep tags.label='eval' unchanged Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…default - Fix list-scorers endpoint to use correct MLflow setup pattern (set_tracking_uri('databricks') + set_experiment) matching alignment and optimization services so registered judges are actually returned - Add high-water mark for sidebar phase navigation so post-annotation phases stay green when clicking back to earlier phases - Gate sidebar blue/green visuals behind isAnnotationComplete - Change default candidates per iteration from 5 to 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sign - Add judge_names list and max_traces to PromptOptimizationRequest model - Make target_endpoint required (not optional) - Backend uses request judge_names over rubric-derived ones when provided - Add cancel endpoint and stale job detection for running optimizations - Pass max_traces through to _build_train_data - Redesign Prompt Optimization page: two-column layout with enriched cards, cohesive teal/sky/emerald/blue color palette, polished card headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tory UI - Start Optimization button uses default primary color (matches login) - Default iterations changed from 3 to 2 (frontend + backend) - History section redesigned: clean divider-based rows, outline status badges, dot-separated metadata, chevron expand indicators, rounded-lg containers, and consistent spacing with the rest of the page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Append 'Z' suffix to created_at ISO strings so JavaScript's Date constructor treats them as UTC and correctly converts to local timezone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cherry-pick IRR-specific files from feature/gepa-prompt-optimization: - pairwise_agreement.py: pairwise agreement % as primary IRR metric - fleiss_kappa.py: Fleiss' Kappa for multi-rater reliability - irr_service.py: rewritten to use pairwise agreement primary - IRRResultsDemo.tsx: UI updates for pairwise display - All associated tests (63 passing) Skips GEPA prompt optimization files (unrelated scope).

vivian-xie-db and others added 10 commits February 11, 2026 12:37

fix: candidates per iteration slider max from 20 to 10

57c02bc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: optimization history timestamps show UTC instead of local time

5eeb60a

Append 'Z' suffix to created_at ISO strings so JavaScript's Date constructor treats them as UTC and correctly converts to local timezone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GEPA prompt optimization and pairwise agreement IRR#111

feat: GEPA prompt optimization and pairwise agreement IRR#111
vivian-xie-db wants to merge 10 commits into
mainfrom
feature/gepa-prompt-optimization

vivian-xie-db commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vivian-xie-db commented Feb 18, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant