|
| 1 | +# Fully Wired Eval Mode MVP Implementation Plan |
| 2 | + |
| 3 | +**Spec:** [EVAL_MODE_SPEC](../../specs/EVAL_MODE_SPEC.md) |
| 4 | +**Goal:** Finish the remaining eval-mode wiring so facilitators can run criterion-level judge evaluation (milestone scoped), compute IRR on criterion decisions, and run alignment end-to-end. |
| 5 | +**Architecture:** Keep eval mode as a parallel path to workshop-mode rubric evaluation: per-trace criteria remain the source of truth, each criterion is evaluated independently, and we reuse existing MLflow + MemAlign integration where possible. Add lineage-aware criterion metadata so judge prompts can be scoped to the correct milestone context instead of whole-trace-only evaluation. Keep changes constrained to eval-mode routes/services plus minimal frontend controls in the existing eval workspace. |
| 6 | + |
| 7 | +**Success Criteria Targeted:** |
| 8 | +- SC-1: One independent judge call per criterion |
| 9 | +- SC-2: Judge sees trace content + single criterion, not other criteria |
| 10 | +- SC-3: Judge returns met (boolean) + rationale |
| 11 | +- SC-4: Evaluation runs as background job with progress tracking |
| 12 | +- SC-5: Results stored per-criterion with rationale |
| 13 | +- SC-6: One task-level judge aligned using all criteria across all traces as examples |
| 14 | +- SC-7: Each criterion's human met/not-met decision stored as a separate MLflow assessment on the trace |
| 15 | +- SC-8: All assessments share the judge name; extraction yields all (not just most recent) |
| 16 | +- SC-9: Re-evaluation compares pre/post alignment accuracy on same trace set |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Scope Guardrails |
| 21 | + |
| 22 | +- In scope: |
| 23 | + - Eval-mode judge execution against trace criteria |
| 24 | + - Milestone/lineage scoping for judge context |
| 25 | + - Eval-mode IRR computation from criterion-level decisions |
| 26 | + - Eval-mode alignment wiring using criterion-level assessments |
| 27 | +- Out of scope for this MVP: |
| 28 | + - New discovery UX concepts beyond existing social-mode promotion flow |
| 29 | + - Offline eval export enhancements |
| 30 | + - Broad workshop-mode refactors |
| 31 | + |
| 32 | +## File Map |
| 33 | + |
| 34 | +### New Files |
| 35 | +| File | Responsibility | |
| 36 | +|------|----------------| |
| 37 | +| `tests/unit/services/test_eval_mode_judge_execution_service.py` | Unit tests for milestone-scoped criterion judge execution | |
| 38 | +| `tests/unit/services/test_eval_mode_irr_service.py` | Unit tests for eval-mode IRR computations | |
| 39 | +| `tests/unit/routers/test_eval_mode_execution_router.py` | Router tests for eval evaluate/eval-job/align endpoints | |
| 40 | + |
| 41 | +### Modified Files |
| 42 | +| File | Change | |
| 43 | +|------|--------| |
| 44 | +| `server/models.py` | Extend `TraceCriterion` model with lineage fields used for milestone scoping | |
| 45 | +| `server/database.py` | Add lineage columns to `TraceCriterionDB` and indexes for scoped fetches | |
| 46 | +| `migrations/versions/*_eval_mode_lineage_fields.py` | Migration for new eval criterion lineage columns | |
| 47 | +| `server/services/eval_criteria_service.py` | Persist/read lineage fields, propagate on promote/create/update | |
| 48 | +| `server/services/discovery_service.py` | Promote finding lineage (`evidence_milestone_refs`, `evidence_question_refs`) into trace criteria in eval mode | |
| 49 | +| `server/services/eval_mode_service.py` | Add judge-run orchestration + eval-mode IRR helpers; keep score aggregation as source of truth | |
| 50 | +| `server/routers/eval_mode.py` | Add `POST /evaluate`, `GET /eval-job/{job_id}`, `POST /align`, `GET /alignment-status`, and eval IRR endpoint | |
| 51 | +| `server/services/alignment_service.py` | Add eval-mode alignment entrypoint that uses criterion-level assessments and shared eval judge name | |
| 52 | +| `client/src/hooks/useWorkshopApi.ts` | Add hooks for eval-mode evaluate job, eval IRR, and eval alignment actions | |
| 53 | +| `client/src/components/eval/EvalModeWorkspace.tsx` | Add facilitator controls: run eval, poll progress, view IRR/alignment status | |
| 54 | +| `client/src/components/eval/EvalGradingPanel.tsx` | Use structured criterion lineage refs (not text regex) for milestone highlighting | |
| 55 | +| `tests/unit/services/test_eval_criteria_service.py` | Add lineage persistence assertions | |
| 56 | +| `tests/unit/services/test_discovery_promotion_eval_mode.py` | Verify promoted criteria carry lineage/milestone refs | |
| 57 | +| `tests/unit/services/test_eval_mode_service.py` | Add judge-run and IRR edge-case tests | |
| 58 | +| `tests/unit/routers/test_eval_mode_router.py` | Extend existing eval-mode route coverage for new endpoints | |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +### Task 1: Add Lineage-Aware Criterion Schema |
| 63 | + |
| 64 | +**Spec criteria:** SC-1, SC-2 |
| 65 | +**Files:** |
| 66 | +- Modify: `server/models.py`, `server/database.py`, `server/services/eval_criteria_service.py`, `server/services/discovery_service.py` |
| 67 | +- Create: `migrations/versions/*_eval_mode_lineage_fields.py` |
| 68 | +- Test: `tests/unit/services/test_eval_criteria_service.py`, `tests/unit/services/test_discovery_promotion_eval_mode.py` |
| 69 | + |
| 70 | +- [ ] **Step 1: Write failing tests for lineage persistence** |
| 71 | +- [ ] **Step 2: Add fields to `TraceCriterion` and `TraceCriterionDB` (e.g., `lineage_refs`, `milestone_refs`, `lineage_scope`)** |
| 72 | +- [ ] **Step 3: Add migration for SQLite/Postgres parity** |
| 73 | +- [ ] **Step 4: Update create/update/promote flows to persist lineage metadata** |
| 74 | +- [ ] **Step 5: Run tests** |
| 75 | + |
| 76 | +Run: `just test-server -k "eval_criteria_service or discovery_promotion_eval_mode" --no-header -q` |
| 77 | +Expected: PASS with lineage fields covered |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +### Task 2: Implement Eval-Mode Judge Execution (Milestone Scoped) |
| 82 | + |
| 83 | +**Spec criteria:** SC-1, SC-2, SC-3, SC-4, SC-5 |
| 84 | +**Files:** |
| 85 | +- Modify: `server/services/eval_mode_service.py`, `server/routers/eval_mode.py`, `server/services/database_service.py` |
| 86 | +- Test: `tests/unit/services/test_eval_mode_judge_execution_service.py`, `tests/unit/routers/test_eval_mode_execution_router.py` |
| 87 | + |
| 88 | +- [ ] **Step 1: Write failing tests for evaluate job lifecycle** |
| 89 | +- [ ] **Step 2: Add `POST /workshops/{workshop_id}/evaluate` job start endpoint for eval mode** |
| 90 | +- [ ] **Step 3: Add background runner that iterates trace criteria and makes one judge call per criterion** |
| 91 | +- [ ] **Step 4: Build lineage-scoped prompt context (`trace summary + referenced milestone`, fallback to whole trace)** |
| 92 | +- [ ] **Step 5: Store outputs in `criterion_evaluations` and expose progress via `GET /eval-job/{job_id}`** |
| 93 | +- [ ] **Step 6: Run tests** |
| 94 | + |
| 95 | +Run: `just test-server -k "eval_mode_execution or eval_mode_router" --no-header -q` |
| 96 | +Expected: PASS with per-criterion calls and persisted rationale |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +### Task 3: Compute Eval-Mode IRR from Criterion Decisions |
| 101 | + |
| 102 | +**Spec criteria:** SC-5 |
| 103 | +**Files:** |
| 104 | +- Modify: `server/services/eval_mode_service.py`, `server/routers/eval_mode.py`, `client/src/hooks/useWorkshopApi.ts`, `client/src/components/eval/EvalModeWorkspace.tsx` |
| 105 | +- Create: `tests/unit/services/test_eval_mode_irr_service.py` |
| 106 | + |
| 107 | +- [ ] **Step 1: Write failing tests for eval IRR input shaping** |
| 108 | +- [ ] **Step 2: Add eval-mode IRR function that compares HUMAN vs judge criterion decisions (per criterion across traces)** |
| 109 | +- [ ] **Step 3: Add `GET /workshops/{workshop_id}/eval-irr` endpoint** |
| 110 | +- [ ] **Step 4: Add minimal UI block in eval workspace showing eval IRR score + readiness** |
| 111 | +- [ ] **Step 5: Run tests** |
| 112 | + |
| 113 | +Run: `just test-server -k "eval_mode_irr" --no-header -q` |
| 114 | +Expected: PASS for sparse data, no-human-label data, and normal data cases |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +### Task 4: Wire Eval-Mode Alignment |
| 119 | + |
| 120 | +**Spec criteria:** SC-6, SC-7, SC-8, SC-9 |
| 121 | +**Files:** |
| 122 | +- Modify: `server/routers/eval_mode.py`, `server/services/alignment_service.py`, `server/services/eval_criteria_service.py`, `client/src/components/eval/EvalModeWorkspace.tsx` |
| 123 | +- Test: `tests/unit/routers/test_eval_mode_execution_router.py`, `tests/unit/services/test_eval_mode_judge_execution_service.py` |
| 124 | + |
| 125 | +- [ ] **Step 1: Write failing tests for eval alignment trigger/status behavior** |
| 126 | +- [ ] **Step 2: Add eval alignment route(s) that use one task-level judge name for all criterion assessments** |
| 127 | +- [ ] **Step 3: Ensure HUMAN criterion corrections are logged as separate MLflow assessments with shared judge name** |
| 128 | +- [ ] **Step 4: Trigger pre/post re-evaluation comparison and return alignment summary** |
| 129 | +- [ ] **Step 5: Add fallback/guard if installed MLflow still collapses multi-assessment traces (explicit warning + actionable remediation)** |
| 130 | +- [ ] **Step 6: Run tests** |
| 131 | + |
| 132 | +Run: `just test-server -k "eval_mode.*align or alignment_service" --no-header -q` |
| 133 | +Expected: PASS for alignment trigger and pre/post comparison contract |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +### Task 5: Frontend Tight Wiring for Facilitator Flow |
| 138 | + |
| 139 | +**Spec criteria:** SC-4, SC-5, SC-9 |
| 140 | +**Files:** |
| 141 | +- Modify: `client/src/hooks/useWorkshopApi.ts`, `client/src/components/eval/EvalModeWorkspace.tsx`, `client/src/components/eval/EvalGradingPanel.tsx` |
| 142 | +- Test: `client/src/components/eval/CriterionEditor.eval.test.tsx` (extend), add eval workspace unit tests if missing |
| 143 | + |
| 144 | +- [ ] **Step 1: Add hooks for evaluate job start/status, eval IRR, and alignment trigger/status** |
| 145 | +- [ ] **Step 2: Add controls in `EvalModeWorkspace` for run eval + run alignment** |
| 146 | +- [ ] **Step 3: Replace regex milestone parsing in grading panel with structured lineage refs from criterion data** |
| 147 | +- [ ] **Step 4: Add/extend UI tests** |
| 148 | + |
| 149 | +Run: `just ui-test-unit-spec EVAL_MODE_SPEC` |
| 150 | +Expected: PASS for eval workspace interactions |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +### Task 6 (Final): Verify, Lint, and Spec Coverage |
| 155 | + |
| 156 | +- [ ] **Step 1: Run backend tests for the spec** |
| 157 | + |
| 158 | +Run: `just test-server-spec EVAL_MODE_SPEC` |
| 159 | +Expected: PASS |
| 160 | + |
| 161 | +- [ ] **Step 2: Run frontend tests for the spec** |
| 162 | + |
| 163 | +Run: `just ui-test-unit-spec EVAL_MODE_SPEC` |
| 164 | +Expected: PASS |
| 165 | + |
| 166 | +- [ ] **Step 3: Run lint checks** |
| 167 | + |
| 168 | +Run: `just lint-ruff` |
| 169 | +Expected: No errors |
| 170 | + |
| 171 | +Run: `just ui-lint` |
| 172 | +Expected: No errors |
| 173 | + |
| 174 | +- [ ] **Step 4: Validate and report spec coverage** |
| 175 | + |
| 176 | +Run: `just spec-coverage --specs EVAL_MODE_SPEC` |
| 177 | +Expected: Coverage increases for targeted success criteria |
| 178 | + |
| 179 | +Run: `just spec-validate` |
| 180 | +Expected: Test tags valid |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## Execution Notes |
| 185 | + |
| 186 | +- Use the new branch `feat/eval-mode-fully-wired-mvp-plan` as the implementation branch for this plan. |
| 187 | +- Keep all eval-mode endpoints gated by workshop mode (`mode == "eval"`), and preserve existing workshop-mode behavior unchanged. |
| 188 | +- For lineage scoping, prefer explicit criterion metadata over text parsing; retain compatibility fallback only where needed. |
| 189 | +- If MLflow multi-assessment extraction still collapses data in the installed version, ship a guarded fallback and document exact upgrade/patch requirement before marking alignment complete. |
0 commit comments