Skip to content

feat: GEPA prompt optimization and pairwise agreement IRR#111

Open
vivian-xie-db wants to merge 10 commits into
mainfrom
feature/gepa-prompt-optimization
Open

feat: GEPA prompt optimization and pairwise agreement IRR#111
vivian-xie-db wants to merge 10 commits into
mainfrom
feature/gepa-prompt-optimization

Conversation

@vivian-xie-db
Copy link
Copy Markdown
Collaborator

Summary

  • GEPA Prompt Optimization page: Full UI for running prompt optimization jobs — text/URI prompt input, multi-select scorers, configurable parameters (iterations, candidates, max traces), real-time log streaming, job cancellation, auto-reconnect, and optimization history with expandable results
  • Pairwise Agreement IRR: Replaced Fleiss' Kappa with pairwise agreement percentage for inter-rater reliability, supporting binary and Likert rubric scales
  • Backend: New PromptOptimizationService with end-to-end GEPA integration, stale job detection, cancel endpoint, judge_names multi-select, max_traces parameter, and UTC timestamp fix for history
  • UI polish: Cohesive teal/sky/emerald/blue card palette, modernized history section, enriched card design with accent bars

Test plan

  • Run just test-server — all Python unit tests pass
  • Run just ui-test-unit — all React unit tests pass
  • Verify prompt optimization page loads and displays all cards
  • Test start/stop optimization flow with a live endpoint
  • Verify optimization history shows correct local timestamps
  • Check pairwise agreement IRR displays correctly on Results Review page

🤖 Generated with Claude Code

vivian-xie-db and others added 10 commits February 11, 2026 12:37
- Add prompt optimization page with GEPA (mlflow.genai.optimize_prompts)
  supporting custom target endpoints, auto-format detection (chat vs agent),
  URL parsing, and clean training data extraction from traces
- Replace Krippendorff's Alpha with pairwise agreement percentage for IRR
  (adjacent agreement for Likert, exact for binary, 75% threshold)
- Add Copy and Download buttons for optimization logs
- Add prompt_optimization_runs table with migrations
- Add pairwise_agreement module with comprehensive tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When navigating away from the Prompt Optimization page and back, the
component state was lost (logs, status, original prompt). Now on mount,
the page checks history for any "running" job and automatically
reconnects — restoring all logs, the original prompt text, and resuming
the polling loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rsistence

- Replace Fleiss' Kappa with GDPval A^HH formula (E[1-|H1-H2|]) for human
  inter-rater agreement, supporting 2+ raters with normalized [0,1] ratings
- Compute judge performance metrics (Cohen's κ, accuracy) on-the-fly from
  annotations when auto-eval stores evaluations without human_rating
- Re-enable performance metrics bar on Judge Tuning page for saved prompts
- Persist Prompt Optimization config (prompt, UC catalog/schema, model) to
  localStorage so form data survives page navigation
- Restore full config (model, iterations, candidates, target endpoint, UC
  fields) from optimization history on reconnect
- Add target_endpoint column to prompt_optimization_runs (migration 0014)
- Add 30 unit tests for GDPval agreement calculation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…upling

- Store human_rating as None (not 0) when missing, fixing ambiguity for
  binary judges where 0 is a valid rating
- Always prefer annotation-sourced human ratings over evaluation records
- Detect and auto-expire stale auto-eval jobs stuck running for >5 minutes
- Correctly use 0/1 labels for binary judges in confusion matrix and
  agreement-by-rating (was incorrectly using 1-5 Likert labels)
- Decouple background auto-eval polling from button disabled states on
  JudgeTuningPage; show informational badge instead of blocking actions
- Fix optimizer model name display with backend-to-frontend mapping;
  default to Claude Opus 4.5 for prompt optimization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Show GEPA score improvement (initial → final) in prompt optimization
  results card and history, with percentage badge; parse from logs as
  fallback for older runs without metrics
- Include initial_score/final_score in backend optimization metrics
- Fix sidebar not updating after starting annotation: replace
  removeQueries (destroys cache) with invalidateQueries (triggers
  refetch for active observers)
- Rename trace alignment tag from tags.label='align' to
  tags.annotation_status='align'; keep tags.label='eval' unchanged

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…default

- Fix list-scorers endpoint to use correct MLflow setup pattern
  (set_tracking_uri('databricks') + set_experiment) matching alignment
  and optimization services so registered judges are actually returned
- Add high-water mark for sidebar phase navigation so post-annotation
  phases stay green when clicking back to earlier phases
- Gate sidebar blue/green visuals behind isAnnotationComplete
- Change default candidates per iteration from 5 to 3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sign

- Add judge_names list and max_traces to PromptOptimizationRequest model
- Make target_endpoint required (not optional)
- Backend uses request judge_names over rubric-derived ones when provided
- Add cancel endpoint and stale job detection for running optimizations
- Pass max_traces through to _build_train_data
- Redesign Prompt Optimization page: two-column layout with enriched cards,
  cohesive teal/sky/emerald/blue color palette, polished card headers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tory UI

- Start Optimization button uses default primary color (matches login)
- Default iterations changed from 3 to 2 (frontend + backend)
- History section redesigned: clean divider-based rows, outline status
  badges, dot-separated metadata, chevron expand indicators, rounded-lg
  containers, and consistent spacing with the rest of the page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Append 'Z' suffix to created_at ISO strings so JavaScript's Date
constructor treats them as UTC and correctly converts to local timezone.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
forrestmurray-db added a commit that referenced this pull request Apr 17, 2026
Cherry-pick IRR-specific files from feature/gepa-prompt-optimization:
- pairwise_agreement.py: pairwise agreement % as primary IRR metric
- fleiss_kappa.py: Fleiss' Kappa for multi-rater reliability
- irr_service.py: rewritten to use pairwise agreement primary
- IRRResultsDemo.tsx: UI updates for pairwise display
- All associated tests (63 passing)

Skips GEPA prompt optimization files (unrelated scope).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant