EPIC: Semantic Anchor Evaluations across LLMs

## Goal

Systematically evaluate whether our semantic anchors work across different LLMs. Build a multiple-choice evaluation framework that is deterministic, cheap, and reproducible.

## Status: Phase 2 complete, Phases 3-4 partially done

### Phase 1: Pilot ✅
- [x] #330 — Manual pilot with 5 anchors, 7 models — **CLOSED**

### Phase 2: Question authoring ✅
- [x] #331 — Level 1 (Recognition) questions for 66 anchors — **CLOSED**
- [x] #332 — Level 2 (Application) questions for 59 anchors — **CLOSED**
- [ ] #333 — Level 3 (Differentiation) questions for conflict groups — OPEN
- [ ] #334 — Level 4 (Consistency) variants (aliases + language) — OPEN

### Phase 3: Automation ✅
- [x] #335 — Build evaluation runner script — **CLOSED**
- [ ] #336 — Build results heatmap for the website — OPEN (static HTML report exists)

### Phase 4: Execution ✅
- [x] #337 — Run full evaluation across 10 models — **CLOSED**

### Results (193 questions × 10 models)

| Model | Score |
|-------|:---:|
| claude-opus-4-6 | 99% |
| gpt-5.4-2026-03-05 | 99% |
| claude-sonnet-4-20250514 | 99% |
| claude-haiku-4-5-20251001 | 98% |
| gpt-4o | 97% |
| gpt-5.4-mini-2026-03-17 | 97% |
| mistral-large-2512 | 96% |
| devstral-2512 | 96% |
| mistral-medium-2508 | 85% |
| mistral-small-2603 | 74% |

### Related issues
- #362 — Integrate evaluation insights into anchor catalog
- #370 — Evaluation framework for Semantic Contracts (compliance testing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EPIC: Semantic Anchor Evaluations across LLMs #329

Goal

Status: Phase 2 complete, Phases 3-4 partially done

Phase 1: Pilot ✅

Phase 2: Question authoring ✅

Phase 3: Automation ✅

Phase 4: Execution ✅

Results (193 questions × 10 models)

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Score
claude-opus-4-6	99%
gpt-5.4-2026-03-05	99%
claude-sonnet-4-20250514	99%
claude-haiku-4-5-20251001	98%
gpt-4o	97%
gpt-5.4-mini-2026-03-17	97%
mistral-large-2512	96%
devstral-2512	96%
mistral-medium-2508	85%
mistral-small-2603	74%

Uh oh!

EPIC: Semantic Anchor Evaluations across LLMs #329

Description

Goal

Status: Phase 2 complete, Phases 3-4 partially done

Phase 1: Pilot ✅

Phase 2: Question authoring ✅

Phase 3: Automation ✅

Phase 4: Execution ✅

Results (193 questions × 10 models)

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions