Skip to content

Commit 80334e9

Browse files
feat(irr): merge pairwise agreement IRR from PR #111
Cherry-pick IRR-specific files from feature/gepa-prompt-optimization: - pairwise_agreement.py: pairwise agreement % as primary IRR metric - fleiss_kappa.py: Fleiss' Kappa for multi-rater reliability - irr_service.py: rewritten to use pairwise agreement primary - IRRResultsDemo.tsx: UI updates for pairwise display - All associated tests (63 passing) Skips GEPA prompt optimization files (unrelated scope).
1 parent 6ed7b2f commit 80334e9

10 files changed

Lines changed: 1676 additions & 649 deletions

File tree

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Fully Wired Eval Mode MVP Implementation Plan
2+
3+
**Spec:** [EVAL_MODE_SPEC](../../specs/EVAL_MODE_SPEC.md)
4+
**Goal:** Finish the remaining eval-mode wiring so facilitators can run criterion-level judge evaluation (milestone scoped), compute IRR on criterion decisions, and run alignment end-to-end.
5+
**Architecture:** Keep eval mode as a parallel path to workshop-mode rubric evaluation: per-trace criteria remain the source of truth, each criterion is evaluated independently, and we reuse existing MLflow + MemAlign integration where possible. Add lineage-aware criterion metadata so judge prompts can be scoped to the correct milestone context instead of whole-trace-only evaluation. Keep changes constrained to eval-mode routes/services plus minimal frontend controls in the existing eval workspace.
6+
7+
**Success Criteria Targeted:**
8+
- SC-1: One independent judge call per criterion
9+
- SC-2: Judge sees trace content + single criterion, not other criteria
10+
- SC-3: Judge returns met (boolean) + rationale
11+
- SC-4: Evaluation runs as background job with progress tracking
12+
- SC-5: Results stored per-criterion with rationale
13+
- SC-6: One task-level judge aligned using all criteria across all traces as examples
14+
- SC-7: Each criterion's human met/not-met decision stored as a separate MLflow assessment on the trace
15+
- SC-8: All assessments share the judge name; extraction yields all (not just most recent)
16+
- SC-9: Re-evaluation compares pre/post alignment accuracy on same trace set
17+
18+
---
19+
20+
## Scope Guardrails
21+
22+
- In scope:
23+
- Eval-mode judge execution against trace criteria
24+
- Milestone/lineage scoping for judge context
25+
- Eval-mode IRR computation from criterion-level decisions
26+
- Eval-mode alignment wiring using criterion-level assessments
27+
- Out of scope for this MVP:
28+
- New discovery UX concepts beyond existing social-mode promotion flow
29+
- Offline eval export enhancements
30+
- Broad workshop-mode refactors
31+
32+
## File Map
33+
34+
### New Files
35+
| File | Responsibility |
36+
|------|----------------|
37+
| `tests/unit/services/test_eval_mode_judge_execution_service.py` | Unit tests for milestone-scoped criterion judge execution |
38+
| `tests/unit/services/test_eval_mode_irr_service.py` | Unit tests for eval-mode IRR computations |
39+
| `tests/unit/routers/test_eval_mode_execution_router.py` | Router tests for eval evaluate/eval-job/align endpoints |
40+
41+
### Modified Files
42+
| File | Change |
43+
|------|--------|
44+
| `server/models.py` | Extend `TraceCriterion` model with lineage fields used for milestone scoping |
45+
| `server/database.py` | Add lineage columns to `TraceCriterionDB` and indexes for scoped fetches |
46+
| `migrations/versions/*_eval_mode_lineage_fields.py` | Migration for new eval criterion lineage columns |
47+
| `server/services/eval_criteria_service.py` | Persist/read lineage fields, propagate on promote/create/update |
48+
| `server/services/discovery_service.py` | Promote finding lineage (`evidence_milestone_refs`, `evidence_question_refs`) into trace criteria in eval mode |
49+
| `server/services/eval_mode_service.py` | Add judge-run orchestration + eval-mode IRR helpers; keep score aggregation as source of truth |
50+
| `server/routers/eval_mode.py` | Add `POST /evaluate`, `GET /eval-job/{job_id}`, `POST /align`, `GET /alignment-status`, and eval IRR endpoint |
51+
| `server/services/alignment_service.py` | Add eval-mode alignment entrypoint that uses criterion-level assessments and shared eval judge name |
52+
| `client/src/hooks/useWorkshopApi.ts` | Add hooks for eval-mode evaluate job, eval IRR, and eval alignment actions |
53+
| `client/src/components/eval/EvalModeWorkspace.tsx` | Add facilitator controls: run eval, poll progress, view IRR/alignment status |
54+
| `client/src/components/eval/EvalGradingPanel.tsx` | Use structured criterion lineage refs (not text regex) for milestone highlighting |
55+
| `tests/unit/services/test_eval_criteria_service.py` | Add lineage persistence assertions |
56+
| `tests/unit/services/test_discovery_promotion_eval_mode.py` | Verify promoted criteria carry lineage/milestone refs |
57+
| `tests/unit/services/test_eval_mode_service.py` | Add judge-run and IRR edge-case tests |
58+
| `tests/unit/routers/test_eval_mode_router.py` | Extend existing eval-mode route coverage for new endpoints |
59+
60+
---
61+
62+
### Task 1: Add Lineage-Aware Criterion Schema
63+
64+
**Spec criteria:** SC-1, SC-2
65+
**Files:**
66+
- Modify: `server/models.py`, `server/database.py`, `server/services/eval_criteria_service.py`, `server/services/discovery_service.py`
67+
- Create: `migrations/versions/*_eval_mode_lineage_fields.py`
68+
- Test: `tests/unit/services/test_eval_criteria_service.py`, `tests/unit/services/test_discovery_promotion_eval_mode.py`
69+
70+
- [ ] **Step 1: Write failing tests for lineage persistence**
71+
- [ ] **Step 2: Add fields to `TraceCriterion` and `TraceCriterionDB` (e.g., `lineage_refs`, `milestone_refs`, `lineage_scope`)**
72+
- [ ] **Step 3: Add migration for SQLite/Postgres parity**
73+
- [ ] **Step 4: Update create/update/promote flows to persist lineage metadata**
74+
- [ ] **Step 5: Run tests**
75+
76+
Run: `just test-server -k "eval_criteria_service or discovery_promotion_eval_mode" --no-header -q`
77+
Expected: PASS with lineage fields covered
78+
79+
---
80+
81+
### Task 2: Implement Eval-Mode Judge Execution (Milestone Scoped)
82+
83+
**Spec criteria:** SC-1, SC-2, SC-3, SC-4, SC-5
84+
**Files:**
85+
- Modify: `server/services/eval_mode_service.py`, `server/routers/eval_mode.py`, `server/services/database_service.py`
86+
- Test: `tests/unit/services/test_eval_mode_judge_execution_service.py`, `tests/unit/routers/test_eval_mode_execution_router.py`
87+
88+
- [ ] **Step 1: Write failing tests for evaluate job lifecycle**
89+
- [ ] **Step 2: Add `POST /workshops/{workshop_id}/evaluate` job start endpoint for eval mode**
90+
- [ ] **Step 3: Add background runner that iterates trace criteria and makes one judge call per criterion**
91+
- [ ] **Step 4: Build lineage-scoped prompt context (`trace summary + referenced milestone`, fallback to whole trace)**
92+
- [ ] **Step 5: Store outputs in `criterion_evaluations` and expose progress via `GET /eval-job/{job_id}`**
93+
- [ ] **Step 6: Run tests**
94+
95+
Run: `just test-server -k "eval_mode_execution or eval_mode_router" --no-header -q`
96+
Expected: PASS with per-criterion calls and persisted rationale
97+
98+
---
99+
100+
### Task 3: Compute Eval-Mode IRR from Criterion Decisions
101+
102+
**Spec criteria:** SC-5
103+
**Files:**
104+
- Modify: `server/services/eval_mode_service.py`, `server/routers/eval_mode.py`, `client/src/hooks/useWorkshopApi.ts`, `client/src/components/eval/EvalModeWorkspace.tsx`
105+
- Create: `tests/unit/services/test_eval_mode_irr_service.py`
106+
107+
- [ ] **Step 1: Write failing tests for eval IRR input shaping**
108+
- [ ] **Step 2: Add eval-mode IRR function that compares HUMAN vs judge criterion decisions (per criterion across traces)**
109+
- [ ] **Step 3: Add `GET /workshops/{workshop_id}/eval-irr` endpoint**
110+
- [ ] **Step 4: Add minimal UI block in eval workspace showing eval IRR score + readiness**
111+
- [ ] **Step 5: Run tests**
112+
113+
Run: `just test-server -k "eval_mode_irr" --no-header -q`
114+
Expected: PASS for sparse data, no-human-label data, and normal data cases
115+
116+
---
117+
118+
### Task 4: Wire Eval-Mode Alignment
119+
120+
**Spec criteria:** SC-6, SC-7, SC-8, SC-9
121+
**Files:**
122+
- Modify: `server/routers/eval_mode.py`, `server/services/alignment_service.py`, `server/services/eval_criteria_service.py`, `client/src/components/eval/EvalModeWorkspace.tsx`
123+
- Test: `tests/unit/routers/test_eval_mode_execution_router.py`, `tests/unit/services/test_eval_mode_judge_execution_service.py`
124+
125+
- [ ] **Step 1: Write failing tests for eval alignment trigger/status behavior**
126+
- [ ] **Step 2: Add eval alignment route(s) that use one task-level judge name for all criterion assessments**
127+
- [ ] **Step 3: Ensure HUMAN criterion corrections are logged as separate MLflow assessments with shared judge name**
128+
- [ ] **Step 4: Trigger pre/post re-evaluation comparison and return alignment summary**
129+
- [ ] **Step 5: Add fallback/guard if installed MLflow still collapses multi-assessment traces (explicit warning + actionable remediation)**
130+
- [ ] **Step 6: Run tests**
131+
132+
Run: `just test-server -k "eval_mode.*align or alignment_service" --no-header -q`
133+
Expected: PASS for alignment trigger and pre/post comparison contract
134+
135+
---
136+
137+
### Task 5: Frontend Tight Wiring for Facilitator Flow
138+
139+
**Spec criteria:** SC-4, SC-5, SC-9
140+
**Files:**
141+
- Modify: `client/src/hooks/useWorkshopApi.ts`, `client/src/components/eval/EvalModeWorkspace.tsx`, `client/src/components/eval/EvalGradingPanel.tsx`
142+
- Test: `client/src/components/eval/CriterionEditor.eval.test.tsx` (extend), add eval workspace unit tests if missing
143+
144+
- [ ] **Step 1: Add hooks for evaluate job start/status, eval IRR, and alignment trigger/status**
145+
- [ ] **Step 2: Add controls in `EvalModeWorkspace` for run eval + run alignment**
146+
- [ ] **Step 3: Replace regex milestone parsing in grading panel with structured lineage refs from criterion data**
147+
- [ ] **Step 4: Add/extend UI tests**
148+
149+
Run: `just ui-test-unit-spec EVAL_MODE_SPEC`
150+
Expected: PASS for eval workspace interactions
151+
152+
---
153+
154+
### Task 6 (Final): Verify, Lint, and Spec Coverage
155+
156+
- [ ] **Step 1: Run backend tests for the spec**
157+
158+
Run: `just test-server-spec EVAL_MODE_SPEC`
159+
Expected: PASS
160+
161+
- [ ] **Step 2: Run frontend tests for the spec**
162+
163+
Run: `just ui-test-unit-spec EVAL_MODE_SPEC`
164+
Expected: PASS
165+
166+
- [ ] **Step 3: Run lint checks**
167+
168+
Run: `just lint-ruff`
169+
Expected: No errors
170+
171+
Run: `just ui-lint`
172+
Expected: No errors
173+
174+
- [ ] **Step 4: Validate and report spec coverage**
175+
176+
Run: `just spec-coverage --specs EVAL_MODE_SPEC`
177+
Expected: Coverage increases for targeted success criteria
178+
179+
Run: `just spec-validate`
180+
Expected: Test tags valid
181+
182+
---
183+
184+
## Execution Notes
185+
186+
- Use the new branch `feat/eval-mode-fully-wired-mvp-plan` as the implementation branch for this plan.
187+
- Keep all eval-mode endpoints gated by workshop mode (`mode == "eval"`), and preserve existing workshop-mode behavior unchanged.
188+
- For lineage scoping, prefer explicit criterion metadata over text parsing; retain compatibility fallback only where needed.
189+
- If MLflow multi-assessment extraction still collapses data in the installed version, ship a guarded fallback and document exact upgrade/patch requirement before marking alignment complete.

0 commit comments

Comments
 (0)