Goal
Systematically evaluate whether our semantic anchors work across different LLMs. Build a multiple-choice evaluation framework that is deterministic, cheap, and reproducible.
Status: Phase 2 complete, Phases 3-4 partially done
Phase 1: Pilot ✅
Phase 2: Question authoring ✅
Phase 3: Automation ✅
Phase 4: Execution ✅
Results (193 questions × 10 models)
| Model |
Score |
| claude-opus-4-6 |
99% |
| gpt-5.4-2026-03-05 |
99% |
| claude-sonnet-4-20250514 |
99% |
| claude-haiku-4-5-20251001 |
98% |
| gpt-4o |
97% |
| gpt-5.4-mini-2026-03-17 |
97% |
| mistral-large-2512 |
96% |
| devstral-2512 |
96% |
| mistral-medium-2508 |
85% |
| mistral-small-2603 |
74% |
Related issues
Goal
Systematically evaluate whether our semantic anchors work across different LLMs. Build a multiple-choice evaluation framework that is deterministic, cheap, and reproducible.
Status: Phase 2 complete, Phases 3-4 partially done
Phase 1: Pilot ✅
Phase 2: Question authoring ✅
Phase 3: Automation ✅
Phase 4: Execution ✅
Results (193 questions × 10 models)
Related issues