Skip to content

EPIC: Semantic Anchor Evaluations across LLMs #329

Description

@raifdmueller

Goal

Systematically evaluate whether our semantic anchors work across different LLMs. Build a multiple-choice evaluation framework that is deterministic, cheap, and reproducible.

Status: Phase 2 complete, Phases 3-4 partially done

Phase 1: Pilot ✅

Phase 2: Question authoring ✅

Phase 3: Automation ✅

Phase 4: Execution ✅

Results (193 questions × 10 models)

Model Score
claude-opus-4-6 99%
gpt-5.4-2026-03-05 99%
claude-sonnet-4-20250514 99%
claude-haiku-4-5-20251001 98%
gpt-4o 97%
gpt-5.4-mini-2026-03-17 97%
mistral-large-2512 96%
devstral-2512 96%
mistral-medium-2508 85%
mistral-small-2603 74%

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions