Skip to content

Commit 8a54746

Browse files
authored
Merge pull request #343 from raifdmueller/feat/eval-l1-questions
feat: evaluation framework with 63 anchor specs and pilot results
2 parents 73cf01a + 2b86352 commit 8a54746

74 files changed

Lines changed: 35624 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

evaluations/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.pyc

evaluations/README.adoc

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
= Semantic Anchor Evaluations
2+
:toc:
3+
4+
== Overview
5+
6+
Multiple-choice evaluation framework for testing whether semantic anchors work across different LLMs.
7+
See the link:../docs/anchor-evaluations.adoc[full concept document] for background and methodology.
8+
9+
== Quick Start
10+
11+
=== Prerequisites
12+
13+
* Python 3.10+
14+
* `pyyaml` package: `pip install pyyaml`
15+
* At least one of:
16+
** Claude Code CLI (authenticated)
17+
** OpenAI API key (`OPENAI_API_KEY` environment variable)
18+
** Ollama running locally
19+
20+
=== Running the Pilot
21+
22+
[source,bash]
23+
----
24+
cd website
25+
26+
# Claude Sonnet (default, via CLI)
27+
python3 evaluations/pilot.py
28+
29+
# Claude Haiku
30+
python3 evaluations/pilot.py --model claude-haiku
31+
32+
# GPT-4o-mini (requires OPENAI_API_KEY)
33+
python3 evaluations/pilot.py --model openai
34+
35+
# Ollama (requires local server + model)
36+
ollama serve & # start server if not running
37+
ollama pull qwen3:4b # pull model (once)
38+
python3 evaluations/pilot.py --model ollama # uses qwen3:4b by default
39+
python3 evaluations/pilot.py --model ollama --ollama-model mistral # other model
40+
41+
# Multiple models at once
42+
python3 evaluations/pilot.py --model claude-cli claude-haiku openai
43+
44+
# Dry run (show prompts without sending)
45+
python3 evaluations/pilot.py --dry-run
46+
----
47+
48+
=== Available Models
49+
50+
[cols="1,1,2"]
51+
|===
52+
|Flag |Model |Notes
53+
54+
|`claude-cli`
55+
|Claude Sonnet (via CLI)
56+
|Default. Requires `claude` CLI authenticated.
57+
58+
|`claude-haiku`
59+
|Claude Haiku (via CLI)
60+
|Smallest Claude model. Good lower-bound test.
61+
62+
|`openai`
63+
|GPT-4o-mini (via API)
64+
|Requires `OPENAI_API_KEY`.
65+
66+
|`claude`
67+
|Claude Sonnet (via API)
68+
|Requires `ANTHROPIC_API_KEY`. Alternative to CLI.
69+
70+
|`ollama`
71+
|Local model (via Ollama)
72+
|Requires Ollama server on `localhost:11434`. Default: `qwen3:4b`, override with `--ollama-model`.
73+
|===
74+
75+
== Directory Structure
76+
77+
[source]
78+
----
79+
evaluations/
80+
├── README.adoc # This file
81+
├── pilot.py # Evaluation runner script
82+
├── specs/ # Question specs (YAML)
83+
│ ├── arc42.yaml
84+
│ ├── docs-as-code.yaml
85+
│ ├── mece.yaml
86+
│ ├── tdd-london-school.yaml
87+
│ └── timtowtdi.yaml
88+
└── results/ # Raw results (JSON, timestamped)
89+
└── pilot-*.json
90+
----
91+
92+
== Question Spec Format
93+
94+
Each anchor has a YAML file with multiple-choice questions:
95+
96+
[source,yaml]
97+
----
98+
anchor: tdd-london-school
99+
tier: 3
100+
101+
questions:
102+
recognition: # Level 1: Does the model identify the anchor?
103+
question: |
104+
Which of the following best describes "TDD, London School"?
105+
options:
106+
A: ... # Distractor (e.g., Chicago School description)
107+
B: ... # Correct answer
108+
C: ... # Distractor (e.g., BDD description)
109+
D: ... # Distractor
110+
correct: B
111+
112+
application: # Level 2: Does it change behavior?
113+
scenario: |
114+
You are reviewing a PR. ...
115+
anchor_prompt: "using TDD, London School principles"
116+
paraphrase_prompt: "Write isolated tests for the service layer"
117+
options: ...
118+
correct: B
119+
120+
consistency: # Level 4: Same answer across aliases/languages?
121+
variants:
122+
- 'Question with canonical name'
123+
- 'Question with alias'
124+
language_variant: 'Frage auf Deutsch'
125+
options: ...
126+
correct: B
127+
----
128+
129+
== Scoring
130+
131+
* Each question runs *4 times* with randomized option order (position bias mitigation)
132+
* Score = percentage of correct answers across the 4 runs
133+
* Response parsing: extracts first capital letter A–D from response
134+
* Results saved as timestamped JSON in `results/`
135+
136+
== Pilot Results (2026-03-24)
137+
138+
[cols="1,1,1,1"]
139+
|===
140+
|Model |Average |Best |Worst
141+
142+
|Claude Sonnet 4.6
143+
|100%
144+
|all 100%
145+
|—
146+
147+
|Claude Haiku 4.5
148+
|100%
149+
|all 100%
150+
|—
151+
152+
|GPT-4o-mini
153+
|81%
154+
|Recognition: arc42, MECE, TIMTOWTDI (100%)
155+
|TDD London School Recognition (25%)
156+
|===
157+
158+
Key finding: *Position bias is real.* GPT-4o-mini recognizes "TDD, London School" only 25% of the time -- it picks the correct answer only when it happens to be in a favorable position.

evaluations/fill-distractors.py

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Fill placeholder distractors in evaluation specs using Claude API.
4+
5+
Reads specs with PLACEHOLDER_A/C/D options and asks Claude to generate
6+
plausible but wrong distractors based on the anchor's domain.
7+
8+
Usage:
9+
python3 evaluations/fill-distractors.py # Fill all placeholders
10+
python3 evaluations/fill-distractors.py --dry-run # Preview prompts
11+
python3 evaluations/fill-distractors.py --anchor arc42 # Single anchor
12+
"""
13+
14+
import argparse
15+
import json
16+
import os
17+
import sys
18+
from pathlib import Path
19+
20+
try:
21+
import yaml
22+
except ImportError:
23+
print("PyYAML required: pip install pyyaml")
24+
sys.exit(1)
25+
26+
SPECS_DIR = Path(__file__).parent / "specs"
27+
28+
29+
def needs_distractors(spec):
30+
"""Check if spec has placeholder distractors."""
31+
q = spec.get("questions", {}).get("recognition", {})
32+
options = q.get("options", {})
33+
return any("PLACEHOLDER" in str(v) for v in options.values())
34+
35+
36+
def generate_distractors(spec):
37+
"""Use Claude API to generate 3 plausible distractors."""
38+
try:
39+
import anthropic
40+
except ImportError:
41+
print("anthropic package required: pip install anthropic")
42+
sys.exit(1)
43+
44+
q = spec["questions"]["recognition"]
45+
correct = q["options"]["B"]
46+
title = q["question"].strip().split('"')[1] if '"' in q["question"] else spec["anchor"]
47+
related = q.get("_related", [])
48+
proponents = q.get("_proponents", "")
49+
50+
prompt = f"""Generate 3 plausible but WRONG multiple-choice distractors for this question:
51+
52+
Question: Which of the following best describes "{title}"?
53+
Correct answer: {correct}
54+
55+
Requirements for distractors:
56+
- Each distractor should be a one-sentence description of a DIFFERENT but related concept
57+
- They must be wrong but sound plausible to someone unfamiliar with the topic
58+
- All 4 options (correct + 3 distractors) should be similar in length
59+
- Do NOT include the correct concept in any distractor
60+
- Draw distractors from adjacent concepts in software engineering, architecture, or methodology
61+
{f"- Related anchors for inspiration: {', '.join(related)}" if related else ""}
62+
{f"- The correct answer is associated with: {proponents}" if proponents else ""}
63+
64+
Return ONLY a JSON object with keys "A", "C", "D" containing the 3 distractor strings. No explanation."""
65+
66+
client = anthropic.Anthropic()
67+
response = client.messages.create(
68+
model="claude-sonnet-4-20250514",
69+
max_tokens=300,
70+
temperature=0.7, # some creativity for diverse distractors
71+
messages=[{"role": "user", "content": prompt}],
72+
)
73+
74+
text = response.content[0].text.strip()
75+
# Parse JSON from response (might be wrapped in ```json ... ```)
76+
if "```" in text:
77+
text = text.split("```")[1]
78+
if text.startswith("json"):
79+
text = text[4:]
80+
text = text.strip()
81+
82+
return json.loads(text)
83+
84+
85+
def main():
86+
parser = argparse.ArgumentParser(description="Fill placeholder distractors using Claude API")
87+
parser.add_argument("--dry-run", action="store_true", help="Preview without writing")
88+
parser.add_argument("--anchor", help="Process single anchor")
89+
args = parser.parse_args()
90+
91+
specs_to_fill = []
92+
for f in sorted(SPECS_DIR.glob("*.yaml")):
93+
spec = yaml.safe_load(f.read_text(encoding="utf-8"))
94+
if args.anchor and spec["anchor"] != args.anchor:
95+
continue
96+
if needs_distractors(spec):
97+
specs_to_fill.append((f, spec))
98+
99+
print(f"Found {len(specs_to_fill)} specs needing distractors")
100+
101+
for filepath, spec in specs_to_fill:
102+
anchor_id = spec["anchor"]
103+
print(f" {anchor_id}...", end=" ", flush=True)
104+
105+
if args.dry_run:
106+
print("(dry run)")
107+
continue
108+
109+
try:
110+
distractors = generate_distractors(spec)
111+
q = spec["questions"]["recognition"]
112+
q["options"]["A"] = distractors["A"]
113+
q["options"]["C"] = distractors["C"]
114+
q["options"]["D"] = distractors["D"]
115+
116+
# Remove helper notes
117+
q.pop("_note", None)
118+
q.pop("_related", None)
119+
q.pop("_proponents", None)
120+
q.pop("_also_known_as", None)
121+
122+
with open(filepath, "w", encoding="utf-8") as fh:
123+
yaml.dump(spec, fh, default_flow_style=False, allow_unicode=True, sort_keys=False)
124+
print("OK")
125+
126+
except Exception as e:
127+
print(f"ERROR: {e}")
128+
129+
print("\nDone. Review the generated distractors before running evaluations!")
130+
131+
132+
if __name__ == "__main__":
133+
main()

0 commit comments

Comments
 (0)