From 7a5fa40d57b665b8d03b6c3ed9ea205271094f0e Mon Sep 17 00:00:00 2001 From: Wenwen Xie Date: Thu, 12 Feb 2026 19:52:05 -0500 Subject: [PATCH 1/2] docs: Add GEPA prompt optimization and human agreement specs Add declarative specifications for two key features: - PROMPT_OPTIMIZATION_SPEC.md: GEPA optimizer pipeline, training data, score normalization, predict_fn, config persistence, auto-reconnect - HUMAN_AGREEMENT_SPEC.md: GDPVal A^HH human-to-human agreement metric, rating normalization, pairwise agreement %, IRR integration Co-Authored-By: Claude Opus 4.6 --- specs/HUMAN_AGREEMENT_SPEC.md | 340 ++++++++++++++++++++++++++++ specs/PROMPT_OPTIMIZATION_SPEC.md | 358 ++++++++++++++++++++++++++++++ 2 files changed, 698 insertions(+) create mode 100644 specs/HUMAN_AGREEMENT_SPEC.md create mode 100644 specs/PROMPT_OPTIMIZATION_SPEC.md diff --git a/specs/HUMAN_AGREEMENT_SPEC.md b/specs/HUMAN_AGREEMENT_SPEC.md new file mode 100644 index 00000000..6e80e898 --- /dev/null +++ b/specs/HUMAN_AGREEMENT_SPEC.md @@ -0,0 +1,340 @@ +# Human Agreement Specification (GDPVal A^HH) + +## Overview + +This specification defines the GDPVal Human Inter-Rater Agreement system for the Human Evaluation Workshop. Based on the GDPVal paper (OpenAI), it measures **human-to-human** agreement between SME annotators using the A^HH metric — a normalized pairwise agreement score in [0, 1]. This is used alongside pairwise agreement percentages to determine whether annotators are calibrated before proceeding to judge alignment. + +## Key Distinction + +| Metric | What It Measures | Where It's Used | +|--------|-----------------|-----------------| +| **GDPVal A^HH** (this spec) | Human vs Human agreement | IRR Results page | +| Pairwise Agreement % | Human vs Human agreement (percentage) | IRR Results page | +| Cohen's Kappa | Judge vs Human agreement | Judge Tuning page | + +GDPVal A^HH and Pairwise Agreement % both measure inter-rater reliability between human annotators, but use different formulas. Cohen's Kappa is a separate metric used on the Judge Tuning page. + +## Position in Workshop Pipeline + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Annotation │ │ IRR │ │ Judge │ +│ Phase │───▶│ Analysis │───▶│ Tuning │ +│ (2+ SMEs) │ │ (GDPVal) │ │ (Alignment) │ +└──────────────┘ └──────────────┘ └──────────────┘ + Multiple SMEs A^HH score Only proceed if + rate same traces per rubric Q humans agree +``` + +GDPVal is a quality gate: if human annotators don't agree with each other, aligning a judge to them would be unreliable. + +## Core Concepts + +### GDPVal A^HH Formula + +For a given sample (trace) `s`, with human scores `H_1, H_2` normalized to [0, 1]: + +``` +A_s^HH = E[1 - |H_1 - H_2|] +``` + +Estimated by the empirical mean over all pairs of ratings for that sample. The final score is the mean of sample-level scores over all samples with at least two human raters. + +### Rating Normalization + +Ratings are normalized to [0, 1] before computing A^HH: + +| Scale | Raw Range | Normalization | Examples | +|-------|-----------|---------------|----------| +| Likert | 1-5 | `(rating - 1) / 4` | 1→0.0, 3→0.5, 5→1.0 | +| Binary | 0-1 | As-is | 0→0.0, 1→1.0 | + +### Score Interpretation + +| A^HH Score | Interpretation | +|-----------|----------------| +| >= 0.90 | Excellent agreement | +| >= 0.75 | Good agreement | +| >= 0.60 | Moderate agreement | +| >= 0.50 | Fair agreement | +| < 0.50 | Poor agreement | + +### Worked Examples + +**Perfect agreement (Likert):** +``` +Trace 1: raters give [4, 4, 4] → normalized [0.75, 0.75, 0.75] + All pairs: |0.75-0.75| = 0 → 1 - 0 = 1.0 for each pair + Sample score = 1.0 +A^HH = 1.0 +``` + +**Adjacent ratings (Likert):** +``` +Trace 1: [3, 4] → normalized [0.5, 0.75] + |0.5 - 0.75| = 0.25 → 1 - 0.25 = 0.75 +Trace 2: [2, 3] → normalized [0.25, 0.5] + |0.25 - 0.5| = 0.25 → 1 - 0.25 = 0.75 +A^HH = (0.75 + 0.75) / 2 = 0.75 +``` + +**Maximum disagreement (Likert):** +``` +Trace 1: [1, 5] → normalized [0.0, 1.0] + |0.0 - 1.0| = 1.0 → 1 - 1.0 = 0.0 +A^HH = 0.0 +``` + +**Binary partial agreement (3 raters):** +``` +Trace 1: [1, 1, 0] → already [0, 1] + Pairs: |1-1|=0, |1-0|=1, |1-0|=1 + 1-diff: 1.0, 0.0, 0.0 → mean = 1/3 +Trace 2: [0, 0, 1] + Pairs: |0-0|=0, |0-1|=1, |0-1|=1 + 1-diff: 1.0, 0.0, 0.0 → mean = 1/3 +A^HH = (1/3 + 1/3) / 2 = 1/3 ≈ 0.333 +``` + +## Relationship to Pairwise Agreement Percentage + +Both metrics appear on the IRR Results page. They measure the same thing (human-human agreement) but differently: + +| Aspect | GDPVal A^HH | Pairwise Agreement % | +|--------|-------------|---------------------| +| Range | [0, 1] | [0, 100]% | +| Formula | `E[1 - \|H_1 - H_2\|]` (normalized) | `agreeing_pairs / total_pairs × 100` | +| Agreement check | Continuous (uses distance) | Discrete (exact or ±1 match) | +| Likert primary | Uses full distance | Adjacent agreement (within ±1) | +| Binary primary | Same as exact (0/1 distance) | Exact agreement | +| Threshold | >= 0.75 = Good | >= 75% = Ready to proceed | + +### Integration in IRR Service + +GDPVal A^HH is computed alongside pairwise agreement in `irr_service.py`: + +```python +# irr_service.py: calculate_irr_for_workshop() +result = _calculate_pairwise_agreement_result(annotations, analysis) # Pairwise % + +# Then add GDPVal A^HH per metric +ha_per_metric = calculate_human_agreement_per_metric(annotations) +for question_id, ha_score in ha_per_metric.items(): + result['per_metric_scores'][question_id]['human_agreement'] = ha_score + +# Overall A^HH = average across metrics +result['human_agreement'] = mean(ha_per_metric.values()) +``` + +## Binary Detection + +Automatic detection of binary vs Likert scale from actual rating values: + +```python +def _detect_binary(ratings: List[int]) -> bool: + return all(r in (0, 1) for r in ratings) +``` + +This affects normalization: +- **Binary detected**: ratings used as-is (already [0, 1]) +- **Likert detected**: ratings normalized via `(rating - 1) / 4` + +## Data Model + +### Input: Annotations + +```python +class Annotation: + trace_id: str + user_id: str + rating: int # Legacy single rating (1-5) + ratings: Dict[str, int] # Per-question ratings {"q_uuid": 4, ...} +``` + +### Output: Per-Metric Result + +Each metric in `per_metric_scores` includes: + +```python +{ + 'score': 85.0, # Primary pairwise % (adjacent for Likert, exact for binary) + 'exact_agreement': 40.0, # Pairwise exact % + 'adjacent_agreement': 85.0, # Pairwise adjacent % + 'human_agreement': 0.812, # GDPVal A^HH score [0, 1] + 'interpretation': 'Good agreement', + 'acceptable': True, + 'is_binary': False, + 'krippendorff_alpha': 0.234, # Secondary detail + 'suggestions': [], +} +``` + +### Output: Overall Result + +```python +{ + 'metric_used': 'Pairwise Agreement', + 'score': 82.5, # Overall pairwise % + 'human_agreement': 0.812, # Overall A^HH (average across metrics) + 'ready_to_proceed': True, # score >= 75.0 + 'threshold': 75.0, + 'per_metric_scores': { ... }, # Per-question breakdown (includes A^HH) + 'problematic_patterns': [ ... ], # Detected issues + 'num_raters': 3, + 'num_traces': 10, +} +``` + +## API Endpoint + +### Calculate IRR + +``` +GET /workshops/{workshop_id}/irr +``` + +Response: `IRRResult` with `details` containing both pairwise agreement and GDPVal A^HH scores. + +## Frontend Display + +### File: `client/src/pages/IRRResultsDemo.tsx` + +A^HH is displayed as the **primary score** per rubric question: + +``` +┌─────────────────────────────────────────────────────┐ +│ Human Agreement A^HH (GDPval) │ +│ │ +│ 0.812 │ +│ Good agreement │ +│ │ +│ Score of 1.0 = raters always agree. │ +│ Score of 0.0 = maximum disagreement. │ +│ Ratings normalized to [0, 1] scale. │ +└─────────────────────────────────────────────────────┘ +``` + +### Color Coding + +| A^HH Score | Color | Label | +|-----------|-------|-------| +| >= 0.75 | Green | Good/Excellent | +| >= 0.60 | Yellow | Moderate | +| >= 0.50 | Orange | Fair | +| < 0.50 | Red | Poor | + +### Fallback Display + +When `human_agreement` is `null` (insufficient data), the page falls back to displaying the pairwise agreement percentage instead. + +## Implementation Files + +| File | Role | +|------|------| +| `server/services/fleiss_kappa.py` | GDPVal A^HH calculation (`calculate_human_agreement`, `calculate_human_agreement_per_metric`) | +| `server/services/irr_service.py` | Orchestration: integrates A^HH into IRR results | +| `server/services/pairwise_agreement.py` | Pairwise agreement % (companion metric) | +| `server/services/irr_utils.py` | Validation, formatting, problematic pattern detection | +| `server/services/krippendorff_alpha.py` | Krippendorff's Alpha (secondary detail) | +| `client/src/pages/IRRResultsDemo.tsx` | Frontend: displays A^HH as primary per-metric score | +| `tests/unit/services/test_fleiss_kappa.py` | Tests for A^HH calculation (20+ tests) | + +## Algorithm Detail + +### Per-Sample Calculation + +```python +def calculate_human_agreement(annotations, question_id=None): + # 1. Group ratings by trace + traces = group_by_trace(annotations, question_id) + + # 2. Detect binary scale + is_binary = all(r in (0, 1) for r in all_ratings) + + # 3. Per-sample: enumerate all rater pairs + sample_scores = [] + for trace_id, ratings in traces.items(): + if len(ratings) < 2: + continue # Need 2+ raters + + normalized = [normalize(r, is_binary) for r in ratings] + + # All unique pairs: N*(N-1)/2 + pair_scores = [] + for i in range(len(normalized)): + for j in range(i + 1, len(normalized)): + pair_scores.append(1.0 - abs(normalized[i] - normalized[j])) + + sample_scores.append(mean(pair_scores)) + + # 4. A^HH = mean across all samples + return mean(sample_scores) +``` + +### Per-Metric Calculation + +Computes A^HH independently for each rubric question: + +```python +def calculate_human_agreement_per_metric(annotations): + question_ids = collect_all_question_ids(annotations) + return {qid: calculate_human_agreement(annotations, question_id=qid) + for qid in question_ids} +``` + +## Edge Cases + +- **Single rater per trace**: trace is excluded (need 2+ raters) +- **Single annotation total**: returns `None` +- **No ratings dict**: `calculate_human_agreement_per_metric` returns `{}` +- **Mixed scales across questions**: each question detected independently +- **All raters identical**: returns 1.0 (perfect agreement) +- **4+ raters**: all unique pairs enumerated (N*(N-1)/2 pairs) + +## Success Criteria + +- [ ] A^HH correctly computed using `E[1 - |H_1 - H_2|]` formula +- [ ] Likert ratings normalized via `(rating - 1) / 4` to [0, 1] +- [ ] Binary ratings used as-is (already [0, 1]) +- [ ] Binary auto-detected from actual rating values +- [ ] Per-question A^HH computed independently +- [ ] Overall A^HH = average across per-question scores +- [ ] Traces with < 2 raters excluded from calculation +- [ ] Multi-rater traces enumerate all N*(N-1)/2 pairs +- [ ] A^HH displayed as primary score on IRR Results page +- [ ] Falls back to pairwise % when A^HH unavailable +- [ ] Integrated into `per_metric_scores` alongside pairwise agreement + +## Testing Scenarios + +### Test 1: Perfect Agreement (Likert) +1. All raters give same ratings on all traces +2. Verify A^HH = 1.0 + +### Test 2: Maximum Disagreement (Likert) +1. Raters give 1 and 5 on all traces +2. Verify A^HH = 0.0 + +### Test 3: Adjacent Ratings +1. Raters differ by 1 on Likert scale (e.g., 3 vs 4) +2. Verify A^HH = 0.75 + +### Test 4: Binary Perfect Agreement +1. All raters agree on Pass/Fail per trace +2. Verify A^HH = 1.0 (using `question_id` parameter) + +### Test 5: Binary Complete Disagreement +1. One rater says Pass, other says Fail on all traces +2. Verify A^HH = 0.0 + +### Test 6: Three Raters Known Value +1. Trace 1: [3, 4, 5], Trace 2: [1, 1, 2] +2. Verify A^HH = 0.75 (hand-calculated) + +### Test 7: Per-Metric Independence +1. Two questions: q_1 has perfect agreement, q_2 has max disagreement +2. Verify A^HH(q_1) = 1.0, A^HH(q_2) = 0.0 + +### Test 8: Single-Rater Traces Excluded +1. Some traces have only 1 rater +2. Verify those traces are excluded, result only from multi-rater traces diff --git a/specs/PROMPT_OPTIMIZATION_SPEC.md b/specs/PROMPT_OPTIMIZATION_SPEC.md new file mode 100644 index 00000000..ba658453 --- /dev/null +++ b/specs/PROMPT_OPTIMIZATION_SPEC.md @@ -0,0 +1,358 @@ +# Prompt Optimization Specification (GEPA) + +## Overview + +This specification defines the GEPA (Guided Evolutionary Prompt Augmentation) prompt optimization system for the Human Evaluation Workshop. GEPA iteratively improves an agent's system prompt by using human evaluation feedback as training data and aligned judges as scorers via `mlflow.genai.optimize_prompts()`. + +## MLflow Integration Context + +### Position in Workshop Pipeline + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Annotation │ │ Judge │ │ Prompt │ │ Improved │ +│ Phase │───▶│ Tuning │───▶│ Optimization│───▶│ Agent │ +│ (Human SMEs) │ │ (Alignment) │ │ (GEPA) │ │ Prompt │ +└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ + Collect ratings Align judge(s) Optimize prompt Deploy via UC + & annotations to human scores using judges prompt alias +``` + +GEPA requires: +1. **Human annotations** — training data from the annotation phase +2. **Aligned judges** — scorers created during judge tuning/alignment +3. **Agent prompt** — the system prompt to optimize (entered directly or loaded from MLflow) + +### MLflow API Dependencies + +| API | Usage | +|-----|-------| +| `mlflow.genai.register_prompt()` | Register user-entered prompt to MLflow/UC | +| `mlflow.genai.load_prompt()` | Load prompt for GEPA interception | +| `mlflow.genai.optimize_prompts()` | Core GEPA optimization call | +| `mlflow.genai.set_prompt_alias()` | Set "champion" alias on optimized version | +| `mlflow.genai.scorers.get_scorer()` | Load aligned judge as scorer | +| `mlflow.search_traces()` | Find annotated traces for training data | + +## Core Concepts + +### GEPA Optimizer +- Uses `GepaPromptOptimizer` from `mlflow.genai.optimize` +- Evolutionary approach: generates candidate prompts, evaluates them, selects best +- Configurable iterations (1-10) and candidates per iteration (2-20) +- `reflection_model` parameter controls which LLM generates candidate prompts +- `max_metric_calls` = iterations x candidates x max(dataset_size, 5) + +### Prompt Input Modes +Two ways to provide the agent prompt: +1. **Direct text entry** — paste prompt text, optionally register to UC (catalog.schema.name) +2. **MLflow URI** — load existing registered prompt (e.g., `prompts:/main.my_schema.agent_prompt/1`) + +### Training Data +Built from human-annotated traces: +- Primary: traces tagged `tags.annotation_status = 'align'` +- Fallback: traces tagged `tags.label = 'eval'` +- Format: `{"inputs": {"request": user_message}, "outputs": agent_response}` +- User messages extracted from trace request (handles JSON `messages` array or plain text) + +### Score Normalization +Judges return different scales. The `aggregation_fn` normalizes to 0-1 for GEPA: + +| Judge Scale | Raw Range | Normalization | +|-------------|-----------|---------------| +| Binary | 0 or 1 | As-is (already 0-1) | +| Likert | 1-5 | Divide by 5.0 | +| Percentage | >5 | Divide by 100.0 | + +When multiple judges are used, scores are averaged across all judges to produce a composite score. + +## Data Model + +### PromptOptimizationRun (Database) + +```sql +CREATE TABLE prompt_optimization_runs ( + id VARCHAR PRIMARY KEY, + workshop_id VARCHAR NOT NULL REFERENCES workshops(id), + job_id VARCHAR NOT NULL, + prompt_uri VARCHAR NOT NULL, + original_prompt TEXT, + optimized_prompt TEXT, + optimized_version INTEGER, + optimized_uri VARCHAR, + optimizer_model VARCHAR, + num_iterations INTEGER, + num_candidates INTEGER, + target_endpoint VARCHAR, + metrics TEXT, -- JSON: {original_length, optimized_length, num_iterations, num_candidates, train_data_size, initial_score?, final_score?} + status VARCHAR DEFAULT 'pending', -- pending | running | completed | failed + error TEXT, + created_at DATETIME, + updated_at DATETIME +); +``` + +### PromptOptimizationRequest (API) + +```python +class PromptOptimizationRequest(BaseModel): + prompt_text: Optional[str] # Direct prompt text (alternative to URI) + prompt_uri: Optional[str] # MLflow prompt URI (alternative to text) + prompt_name: Optional[str] # Name for UC registration + uc_catalog: Optional[str] # Unity Catalog catalog + uc_schema: Optional[str] # Unity Catalog schema + optimizer_model_name: str # Default: 'databricks-claude-sonnet-4-5' + num_iterations: int # 1-10, default: 3 + num_candidates: int # 2-20, default: 5 + judge_name: Optional[str] # Fallback judge name + target_endpoint: Optional[str] # Custom serving endpoint +``` + +### Metrics Dict + +```json +{ + "original_length": 500, + "optimized_length": 720, + "num_iterations": 3, + "num_candidates": 5, + "train_data_size": 10, + "initial_score": 0.680, + "final_score": 0.800 +} +``` + +## API Endpoints + +### Start Optimization + +``` +POST /workshops/{workshop_id}/start-prompt-optimization +``` + +Request: `PromptOptimizationRequest` body + +Response: +```json +{ + "job_id": "uuid", + "status": "running" +} +``` + +Behavior: +1. Validates prompt input (text or URI required) +2. Loads MLflow config and Databricks token +3. Discovers aligned judges from rubric questions +4. Creates DB record and in-memory job +5. Spawns background thread for optimization +6. Returns job_id immediately for polling + +### Poll Job Status + +``` +GET /workshops/{workshop_id}/prompt-optimization-job/{job_id}?since_log_index={n} +``` + +Response: +```json +{ + "status": "running|completed|failed", + "logs": ["log line 1", "log line 2"], + "log_count": 42, + "result": { ... }, + "error": "..." +} +``` + +- `since_log_index` enables incremental log fetching (only new logs since last poll) +- Frontend polls every 2 seconds while job is running + +### Get History + +``` +GET /workshops/{workshop_id}/prompt-optimization-history +``` + +Response: Array of `PromptOptimizationRun` objects, newest first. + +## Optimization Pipeline + +### Step-by-Step Flow + +``` +1. Setup MLflow environment + |-- Set DATABRICKS_HOST, TOKEN (or OAuth) + +-- Set experiment ID + +2. Load/register prompt + |-- Direct text -> mlflow.genai.register_prompt() -> get URI + +-- MLflow URI -> mlflow.genai.load_prompt() + +3. Build training data + |-- Search traces: tags.annotation_status='align' + |-- Fallback: tags.label='eval' + +-- Extract request/response pairs + +4. Load aligned judges as scorers + |-- Get judge names from rubric questions + +-- mlflow.genai.scorers.get_scorer() for each + +5. Create GEPA optimizer + +-- GepaPromptOptimizer(reflection_model=model_uri) + +6. Define predict_fn + |-- load_prompt() (GEPA intercepts to swap candidates) + |-- format() triggers GEPA interception + +-- Call model (OpenAI client or custom endpoint) + +7. Define aggregation_fn + +-- Normalize judge scores to 0-1 range + +8. Run mlflow.genai.optimize_prompts() + |-- Yields log messages during execution + +-- Returns PromptOptimizationResult + +9. Extract result + |-- Get optimized prompt text from result + |-- Register if GEPA didn't auto-register + +-- Set "champion" alias + +10. Return final result dict with metrics +``` + +### predict_fn Behavior + +The predict function is called by GEPA for each (candidate_prompt, training_example) pair: + +1. `mlflow.genai.load_prompt(uri)` — GEPA intercepts this to substitute candidate prompts +2. `prompt.format()` — triggers GEPA's prompt interception +3. Build messages: `[{role: "system", content: candidate_prompt}, {role: "user", content: request}]` +4. Route to model: + - **Custom endpoint**: auto-detect chat vs agent format (try `messages` key, fall back to `input` key) + - **Default**: Databricks OpenAI client `chat.completions.create()` + +### Custom Endpoint Auto-Detection + +When a target endpoint is provided, the system auto-detects the request format: + +| Format | Request Shape | Response Shape | +|--------|--------------|----------------| +| Chat (`messages`) | `{"messages": [...], "max_tokens": 1024}` | `{"choices": [{"message": {"content": "..."}}]}` | +| Agent (`input`) | `{"input": [...], "context": {}}` | `{"output": [{"type": "message", "content": [...]}]}` | + +Detection is cached after the first successful call. + +### Log Capture + +Three-layer log capture ensures all GEPA/DSPy output reaches the frontend: + +1. **Python logging handler** — captures `mlflow.genai`, `gepa`, `dspy` loggers +2. **stdout capture** — DSPy prints iteration scores directly to stdout +3. **stderr capture** — catches any error output + +Important: stdout/stderr capture must be installed BEFORE importing DSPy, which caches `sys.stdout` at import time. + +## Frontend + +### File: `client/src/pages/PromptOptimizationPage.tsx` + +### State Management + +| State | Purpose | +|-------|---------| +| `promptInputMode` | 'text' or 'uri' toggle | +| `promptText` / `promptUri` | User input | +| `ucCatalog`, `ucSchema`, `promptName` | UC registration fields | +| `optimizerModel` | Selected model (display name) | +| `numIterations`, `numCandidates` | GEPA parameters | +| `targetEndpoint` | Optional custom endpoint | +| `jobId`, `jobStatus` | Current job tracking | +| `jobLogs` | Log lines for display | +| `jobResult` | Final optimization result | +| `history` | Past optimization runs | + +### Configuration Persistence + +All configuration fields are persisted to `localStorage` under key `prompt-opt-config-{workshopId}`: +- Saves on every change via `useEffect` +- Restored on mount +- Survives page navigation (not just refresh) + +### Auto-Reconnect + +When the page mounts with existing history: +1. Prefer running job; fall back to most recent completed/failed +2. Restore configuration from the history entry +3. Fetch all logs from job store (`since_log_index=0`) +4. Resume polling if job is still running + +### Score Improvement Display + +Two sources for score data (prioritized): +1. **metrics.initial_score / metrics.final_score** — from backend result +2. **Log parsing fallback** — regex match on `Score improvement: X.XXX -> Y.YYY` in log text + +Displayed as: +``` +Score 0.680 -> 0.800 +12.0% +``` + +### Log Viewer + +- Dark terminal-style panel (`bg-gray-900`) +- Color-coded log lines (errors=red, warnings=yellow, sections=violet, iterations=amber, etc.) +- Auto-scrolls to bottom on new logs +- Copy and Download buttons for full log text +- Polling indicator shows entry count and refresh interval + +### Access Control + +- Only facilitators can configure and run optimization +- Non-facilitators see a waiting card + +## Success Criteria + +- [ ] Users can enter prompt text directly or load from MLflow URI +- [ ] Prompts are registered to Unity Catalog when catalog/schema provided +- [ ] Training data correctly built from annotated traces +- [ ] All aligned judges loaded as scorers (one per rubric question) +- [ ] GEPA optimization runs with real-time log streaming +- [ ] Score improvement (initial -> final) displayed in results +- [ ] Optimized prompt registered with "champion" alias +- [ ] Configuration persists across page navigation +- [ ] Auto-reconnect to running job after navigation +- [ ] Optimization history with expandable run details +- [ ] Custom endpoint support with auto-format detection + +## Testing Scenarios + +### Test 1: Direct Prompt Entry +1. Enter agent system prompt text +2. Set UC catalog and schema +3. Start optimization +4. Verify prompt registered to MLflow +5. Verify optimization completes with score improvement + +### Test 2: MLflow URI Load +1. Enter existing prompt URI +2. Start optimization +3. Verify prompt loaded correctly +4. Verify optimized version saved + +### Test 3: Auto-Reconnect +1. Start optimization +2. Navigate away from page +3. Navigate back +4. Verify job state restored (logs, status, config) + +### Test 4: Custom Endpoint +1. Enter serving endpoint name +2. Start optimization +3. Verify format auto-detection (chat vs agent) +4. Verify evaluation completes + +### Test 5: Score Display Fallback +1. Complete optimization +2. Verify scores shown from metrics +3. For older runs without metrics, verify log parsing fallback From 5913eddf14bd015e32b8318477a7ceb9ff3d9ace Mon Sep 17 00:00:00 2001 From: Wenwen Xie Date: Sat, 14 Feb 2026 20:28:39 -0500 Subject: [PATCH 2/2] docs: Replace IRR/Krippendorff's Alpha with Cohen's Kappa metrics panel in judge evaluation spec Remove the generic IRR section (Krippendorff's Alpha, Cohen's Kappa for rater pairs) from JUDGE_EVALUATION_SPEC and replace with a detailed Cohen's Kappa Metrics Panel spec covering the judge-vs-human agreement metrics displayed after evaluation on the Judge Tuning page. Co-Authored-By: Claude Opus 4.6 --- specs/HUMAN_AGREEMENT_SPEC.md | 3 +- specs/JUDGE_EVALUATION_SPEC.md | 244 ++++++++++++++++++++++++--------- 2 files changed, 180 insertions(+), 67 deletions(-) diff --git a/specs/HUMAN_AGREEMENT_SPEC.md b/specs/HUMAN_AGREEMENT_SPEC.md index 6e80e898..d4d0fb19 100644 --- a/specs/HUMAN_AGREEMENT_SPEC.md +++ b/specs/HUMAN_AGREEMENT_SPEC.md @@ -10,9 +10,8 @@ This specification defines the GDPVal Human Inter-Rater Agreement system for the |--------|-----------------|-----------------| | **GDPVal A^HH** (this spec) | Human vs Human agreement | IRR Results page | | Pairwise Agreement % | Human vs Human agreement (percentage) | IRR Results page | -| Cohen's Kappa | Judge vs Human agreement | Judge Tuning page | -GDPVal A^HH and Pairwise Agreement % both measure inter-rater reliability between human annotators, but use different formulas. Cohen's Kappa is a separate metric used on the Judge Tuning page. +GDPVal A^HH and Pairwise Agreement % both measure inter-rater reliability between human annotators, but use different formulas. ## Position in Workshop Pipeline diff --git a/specs/JUDGE_EVALUATION_SPEC.md b/specs/JUDGE_EVALUATION_SPEC.md index 1cba1019..68694a49 100644 --- a/specs/JUDGE_EVALUATION_SPEC.md +++ b/specs/JUDGE_EVALUATION_SPEC.md @@ -403,44 +403,6 @@ def aggregate_feedback(annotations: List[Annotation]) -> float: return 1.0 if sum(ratings) > len(ratings) / 2 else 0.0 ``` -## Inter-Rater Reliability (IRR) - -### Metrics - -| Metric | Use Case | Range | -|--------|----------|-------| -| **Krippendorff's Alpha** | Multiple raters, any scale | -1 to 1 | -| **Cohen's Kappa** | Two raters, categorical | -1 to 1 | - -### Interpretation - -| Value | Interpretation | -|-------|----------------| -| < 0 | Less than chance agreement | -| 0.0 - 0.20 | Slight agreement | -| 0.21 - 0.40 | Fair agreement | -| 0.41 - 0.60 | Moderate agreement | -| 0.61 - 0.80 | Substantial agreement | -| 0.81 - 1.00 | Almost perfect agreement | - -### Calculation - -```python -from server.services.krippendorff_alpha import calculate_krippendorff_alpha -from server.services.cohens_kappa import calculate_cohens_kappa - -# Krippendorff's Alpha (multiple raters) -alpha = calculate_krippendorff_alpha( - annotations=all_annotations, - scale='ordinal' # or 'nominal' for binary -) - -# Cohen's Kappa (two raters) -kappa = calculate_cohens_kappa( - rater1_annotations=user_a_annotations, - rater2_annotations=user_b_annotations -) -``` ## Data Model @@ -518,6 +480,185 @@ Response: ] } ``` +## Cohen's Kappa Metrics Panel + +### Overview + +After evaluation completes on the Judge Tuning page, a **Performance Metrics Panel** is displayed showing Cohen's Kappa and related agreement metrics between the LLM judge's ratings and human annotations. This provides immediate feedback on judge quality so facilitators can iterate on the prompt. + +### When Displayed + +The metrics panel renders **only** when both conditions are met: +1. `metrics` object is non-null (evaluation returned results) +2. `hasEvaluated` is true (at least one evaluation has been run in this session) + +The panel is hidden before the first evaluation and if no valid evaluations were produced. + +### Metrics Computed + +#### 1. Cohen's Kappa (κ) + +Measures agreement between the LLM judge and human annotators, corrected for chance agreement. + +**Formula**: `κ = (p_o - p_e) / (1 - p_e)` +- `p_o` = observed agreement (proportion of exact matches) +- `p_e` = expected agreement by chance (based on marginal distributions) + +**Implementation**: `sklearn.metrics.cohen_kappa_score(human_ratings, predicted_ratings)` + +**Stored as**: `JudgePerformanceMetrics.correlation` (float, 0.0–1.0) + +**Edge cases**: +- If κ is `NaN` (e.g., all ratings identical), falls back to simple agreement ratio: `matches / total` +- If `cohen_kappa_score` raises an exception, falls back to simple agreement ratio +- If κ is `NaN` after fallback, stored as `0.0` + +**Display**: Shown as percentage (e.g., `85.3%`) with label "Cohen's κ" + +#### 2. Accuracy (Exact Match) + +Proportion of evaluations where the judge's rating exactly matches the human rating. + +**Formula**: `accuracy = count(human == predicted) / total` + +**Implementation**: `sklearn.metrics.accuracy_score(human_ratings, predicted_ratings)` + +**Stored as**: `JudgePerformanceMetrics.accuracy` (float, 0.0–1.0) + +**Display**: Shown as percentage (e.g., `72.0%`) with label "Accuracy" + +#### 3. Total Evaluations + +Count of evaluations with **both** valid human and judge ratings. + +**Stored as**: `JudgePerformanceMetrics.total_evaluations` (int) + +**Display**: Shown as integer. If some evaluations had invalid/missing judge ratings, shows `valid / total` with a count of missing ratings below. + +#### 4. Agreement by Rating + +Per-rating-level accuracy showing how well the judge performs when the human gave each specific rating. + +**Formula** (for each rating level r in 1–5): +``` +agreement[r] = accuracy_score( + [h for h, p in pairs if h == r], + [p for h, p in pairs if h == r] +) +``` + +If no human annotations exist for a rating level, agreement is `0.0`. + +**Stored as**: `JudgePerformanceMetrics.agreement_by_rating` (Dict[str, float], keys "1"–"5") + +**Display**: Five pill-shaped cards labeled `1★` through `5★`, each showing agreement percentage. + +#### 5. Confusion Matrix + +Full 5×5 confusion matrix of human (rows) vs. predicted (columns) ratings. + +**Implementation**: `sklearn.metrics.confusion_matrix(human_ratings, predicted_ratings, labels=[1, 2, 3, 4, 5])` + +**Stored as**: `JudgePerformanceMetrics.confusion_matrix` (List[List[int]]) + +**Display**: Not directly rendered in the metrics panel; available in the data model for downstream analysis and export. + +### Color Thresholds + +Both Cohen's κ and Accuracy use the same color scale: + +| Value | Color | Meaning | +|-------|-------|---------| +| ≥ 80% | Green (`text-green-600`) | Strong agreement | +| 60%–79% | Amber (`text-yellow-600`) | Moderate agreement | +| < 60% | Red (`text-red-600`) | Weak agreement | + +Agreement by Rating pills use the same thresholds with left-border color indicators: + +| Value | Border Color | +|-------|-------------| +| ≥ 80% | Green (`border-green-500`) | +| 60%–79% | Amber (`border-amber-500`) | +| < 60% | Red (`border-red-500`) | + +### Warnings + +#### Small Sample Warning (< 3 evaluations) + +When `total_evaluations < 3`: +- Cohen's κ label shows "(limited data)" suffix +- κ value shows asterisk (`*`) suffix +- Warning banner displayed: + > "Cohen's kappa with fewer than 3 evaluations shows simple agreement rate instead of statistical kappa. Get more annotation data for reliable inter-rater agreement metrics." + +**Rationale**: With fewer than 3 data points, kappa is statistically unreliable and may produce misleading values. + +#### Missing Ratings Warning + +When `total_evaluations_all > total_evaluations` (some evaluations had invalid judge responses): +- Total count shows `valid / total` +- Warning banner displayed with count of missing evaluations +- Explains that invalid responses (e.g., binary judge returning 3.0) are excluded from metrics + +### Data Model + +``` +JudgePerformanceMetrics: + - prompt_id: string # ID of the evaluated prompt version + - correlation: float # Cohen's Kappa (0.0–1.0) + - accuracy: float # Exact match rate (0.0–1.0) + - mean_absolute_error: float # Deprecated (always 0.0) + - agreement_by_rating: Dict[str, float] # Per-rating accuracy {"1": 0.8, ...} + - confusion_matrix: List[List[int]] # 5×5 matrix + - total_evaluations: int # Count of valid evaluations +``` + +### Computation Flow + +``` +1. Evaluation completes (POST /evaluate-judge or /evaluate-judge-direct) +2. Backend collects List[JudgeEvaluation] with human_rating and predicted_rating +3. Filter to evaluations with valid ratings only +4. Calculate: cohen_kappa_score(human_ratings, predicted_ratings) +5. Calculate: accuracy_score(human_ratings, predicted_ratings) +6. Calculate: per-rating accuracy for each rating level 1–5 +7. Calculate: confusion_matrix with labels [1, 2, 3, 4, 5] +8. Return JudgePerformanceMetrics to frontend +9. Frontend renders metrics panel with color-coded values and warnings +``` + +### Binary Scale Adaptation + +For binary judges (0/1 scale), the same metrics computation applies but: +- Ratings are `0` (Fail) and `1` (Pass) instead of 1–5 +- Agreement by Rating shows only keys `"0"` and `"1"` with values +- Confusion matrix is effectively 2×2 (other entries are zero) +- The 1★–5★ pills may show 0% for unused rating levels + +### Persistence + +Metrics are persisted in two ways: +1. **Backend**: Stored via `POST /workshops/{workshop_id}/evaluation-metrics` for the active prompt version +2. **Frontend**: Cached in `localStorage` with 24-hour TTL for session continuity + +Metrics are re-fetched on page load if a prior evaluation exists for the current prompt. + +### Success Criteria + +- [ ] Metrics panel displays only after evaluation has been run +- [ ] Cohen's κ computed via `sklearn.metrics.cohen_kappa_score` +- [ ] κ falls back to simple agreement ratio when NaN or exception +- [ ] Accuracy computed as exact match rate +- [ ] Agreement by Rating shows per-level accuracy for each rating 1–5 +- [ ] Confusion matrix computed with labels [1, 2, 3, 4, 5] +- [ ] Color thresholds: green ≥ 80%, amber 60–79%, red < 60% +- [ ] Small sample warning shown when total_evaluations < 3 +- [ ] Missing ratings warning shown when some evaluations have invalid judge responses +- [ ] Total count displays `valid / total` when missing ratings exist +- [ ] Metrics persisted to backend for the active prompt version +- [ ] Binary judges produce valid metrics with 0/1 scale +- [ ] Panel hidden when no evaluation has been run + ### Run Alignment @@ -534,22 +675,6 @@ Response: } ``` -### Calculate IRR - -``` -GET /workshops/{workshop_id}/irr - -Response: -{ - "krippendorff_alpha": 0.72, - "cohens_kappa": { - "user_a_vs_user_b": 0.68, - "user_a_vs_user_c": 0.71 - }, - "annotation_count": 150, - "annotator_count": 3 -} -``` ## UI Components @@ -562,7 +687,6 @@ Features: - Prompt editor - Evaluation results table with pagination - Alignment trigger and status -- IRR display ### Mode Indicator @@ -603,11 +727,6 @@ Features: - [ ] Metrics reported (guideline count, example count) - [ ] Works for both Likert and Binary scales -### IRR -- [ ] Krippendorff's Alpha calculated correctly -- [ ] Cohen's Kappa calculated for rater pairs -- [ ] Handles edge cases (no variation, single rater) -- [ ] Updates when new annotations added ## Troubleshooting @@ -618,12 +737,7 @@ Check that: 2. `feedback_value_type=float` (not bool) 3. Fallback conversion enabled -### IRR Shows NaN -Causes: -- Only one rater -- No overlapping traces between raters -- All ratings identical (no variation) ### Alignment Fails