You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/anchor-evaluations.adoc
+34-16Lines changed: 34 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -407,45 +407,63 @@ For Claude, this means Sonnet 4.6 (not Opus).
407
407
For GPT and Gemini, the mid-tier variants are not yet clearly established, so we test the current flagship (GPT-5, Gemini 2.5 Pro) and add smaller variants when they become available.
408
408
A follow-up round with the cheapest variants (Haiku, GPT-5 mini, Gemini Flash) would reveal the lower boundary of anchor activation.
409
409
410
+
IMPORTANT: Always record the *exact model identifier with date suffix* (e.g., `mistral-large-2512`, not `mistral-large-latest`).
411
+
Model aliases like `-latest` can change without notice.
412
+
410
413
*Commercial models (API cost per call):*
411
414
412
-
[cols="1,2"]
415
+
[cols="2,1,2"]
413
416
|===
414
-
|Model |Rationale
417
+
|Model |API ID |Rationale
415
418
416
419
|Claude Sonnet 4.6
420
+
|`claude-sonnet-4-20250514`
417
421
|Our primary development model. Serves as the baseline.
418
422
419
-
|GPT-5
420
-
|Largest market share, OpenAI ecosystem.
423
+
|GPT-4o / GPT-5
424
+
|`gpt-4o` / `gpt-5`
425
+
|OpenAI ecosystem. GPT-4o as mid-tier, GPT-5 as flagship.
426
+
427
+
|Mistral Large 3
428
+
|`mistral-large-2512`
429
+
|European flagship. Already tested (96%).
430
+
431
+
|Mistral Medium 3.1
432
+
|`mistral-medium-2508`
433
+
|European mid-tier. Frontier-class multimodal.
434
+
435
+
|Mistral Small 4
436
+
|`mistral-small-2603`
437
+
|European small model. Hybrid reasoning+coding (March 2026).
438
+
439
+
|Devstral 2
440
+
|`devstral-2512`
441
+
|Code-specialized model. Tests whether SE-focused training improves anchor recognition.
421
442
422
443
|Gemini 2.5 Pro
444
+
|TBD
423
445
|Google, different training approach.
424
446
|===
425
447
426
-
*Open-weight models (available as open-source):*
448
+
*Open-weight models (run locally via Ollama):*
427
449
428
-
[cols="1,2,1"]
450
+
[cols="2,1,2"]
429
451
|===
430
-
|Model |Rationale |Local?
452
+
|Model |Local? |Rationale
431
453
432
454
|Llama 4 Maverick
433
-
|Largest open-weight model. Shows whether anchors work without proprietary training -- relevant for self-hosted setups.
434
455
|Yes (Ollama)
435
-
436
-
|Mistral Large
437
-
|European model with a different training focus. Interesting because the anchor catalog is heavily influenced by English-language software engineering literature.
438
-
|No (too large, use La Plateforme API)
456
+
|Largest open-weight model. Shows whether anchors work without proprietary training.
439
457
440
458
|DeepSeek V3
459
+
|Yes (Ollama)
441
460
|Chinese model. Tests whether anchors work across cultural and training-data boundaries.
461
+
462
+
|Ministral 3 8B
442
463
|Yes (Ollama)
464
+
|Mistral's tiny model. Lower boundary test.
443
465
|===
444
466
445
-
Llama and DeepSeek can run locally (e.g., via Ollama) at no API cost.
446
-
Mistral Large requires the Mistral API -- it is open-weight but too large for local inference.
447
-
This means 4 models have API costs (Claude, GPT, Gemini, Mistral) and 2 run locally for free.
448
-
449
467
=== Effort Estimate
450
468
451
469
Each question runs 4 times (randomized option order) to control for position bias.
0 commit comments