Skip to content

Commit 8bf9002

Browse files
authored
Merge pull request #276 from LLM-Coding/copilot/add-llm-evaluations-anchor
Add LLM-Evaluations semantic anchor
2 parents bc1f707 + e22839e commit 8bf9002

7 files changed

Lines changed: 156 additions & 5 deletions

File tree

docs/anchors/llm-evaluations.adoc

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
= LLM-Evaluations
2+
:categories: testing-quality
3+
:roles: data-scientist, software-developer, qa-engineer, software-architect
4+
:related: chain-of-thought, sota, mutation-testing
5+
:proponents: Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
6+
:tags: llm, evaluation, benchmarks, metrics, leaderboard, nlp, ai
7+
:tier: 3
8+
9+
[%collapsible]
10+
====
11+
Full Name:: Large Language Model Evaluations
12+
13+
Also known as:: LLM Benchmarking, LLM Assessment, Foundation Model Evaluation
14+
15+
[discrete]
16+
== *Core Concepts*:
17+
18+
Benchmark Suites:: Standardized datasets and tasks used to compare LLM capabilities — MMLU (Massive Multitask Language Understanding), HellaSwag, HumanEval, BIG-Bench, GSM8K, TruthfulQA, ARC
19+
20+
Evaluation Metrics:: Quantitative measures of model quality — perplexity, accuracy, BLEU, ROUGE, F1, pass@k (code generation), exact match, calibration
21+
22+
Automatic vs. Human Evaluation:: Automated scoring via metrics or reference outputs (fast, scalable) vs. human judgment (nuanced, expensive); hybrid approaches such as LLM-as-judge
23+
24+
HELM (Holistic Evaluation of Language Models):: Stanford framework evaluating models across multiple scenarios and metrics simultaneously to surface trade-offs across accuracy, robustness, fairness, and efficiency
25+
26+
Chatbot Arena / Elo Rating:: Human preference-based evaluation where two models respond to the same prompt and humans choose the better answer; produces Elo-style rankings
27+
28+
Open LLM Leaderboard:: Hugging Face / EleutherAI hosted ranking of open-source models across standardized benchmarks enabling reproducible comparisons
29+
30+
Red-Teaming & Safety Evaluation:: Systematic adversarial probing for harmful outputs, jailbreaks, and failure modes; a required step before production deployment
31+
32+
Contamination & Overfitting:: Risk that a model's training data includes benchmark test sets, inflating apparent performance; mitigated by held-out or dynamic benchmarks
33+
34+
Task-Specific vs. General Evaluation:: Targeted evaluation for a specific use case (e.g., code, summarization, RAG retrieval) vs. broad capability assessment across diverse domains
35+
36+
Key Proponents:: Percy Liang et al. (Stanford, "Holistic Evaluation of Language Models"), EleutherAI ("Language Model Evaluation Harness"), LMSYS ("Chatbot Arena: Benchmarking LLMs in the Wild")
37+
38+
[discrete]
39+
== *When to Use*:
40+
41+
* Selecting a foundation model for a specific application domain
42+
* Comparing fine-tuned model versions during iterative training
43+
* Validating that a model meets quality, safety, and fairness requirements before deployment
44+
* Reproducing or challenging published model capability claims
45+
* Establishing regression baselines when updating a deployed model
46+
* Communicating model strengths and limitations to non-technical stakeholders
47+
48+
[discrete]
49+
== *Related Anchors*:
50+
51+
* <<chain-of-thought,Chain of Thought (CoT)>>
52+
* <<sota,SOTA (State-of-the-Art)>>
53+
* <<mutation-testing,Mutation Testing>>
54+
====
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
= LLM-Evaluations
2+
:categories: testing-quality
3+
:roles: data-scientist, software-developer, qa-engineer, software-architect
4+
:related: chain-of-thought, sota, mutation-testing
5+
:proponents: Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
6+
:tags: llm, evaluation, benchmarks, metrics, leaderboard, nlp, ai
7+
8+
[%collapsible]
9+
====
10+
Vollständiger Name:: Large Language Model Evaluations (Bewertung großer Sprachmodelle)
11+
12+
Auch bekannt als:: LLM-Benchmarking, LLM-Bewertung, Foundation-Model-Evaluation
13+
14+
[discrete]
15+
== *Kernkonzepte*:
16+
17+
Benchmark-Suiten:: Standardisierte Datensätze und Aufgaben zum Vergleich von LLM-Fähigkeiten — MMLU (Massive Multitask Language Understanding), HellaSwag, HumanEval, BIG-Bench, GSM8K, TruthfulQA, ARC
18+
19+
Evaluationsmetriken:: Quantitative Maße für Modellqualität — Perplexity, Genauigkeit, BLEU, ROUGE, F1, pass@k (Code-Generierung), Exact Match, Kalibrierung
20+
21+
Automatische vs. menschliche Evaluation:: Automatisierte Bewertung über Metriken oder Referenzausgaben (schnell, skalierbar) vs. menschliches Urteil (differenziert, kostenintensiv); hybride Ansätze wie LLM-as-Judge
22+
23+
HELM (Holistic Evaluation of Language Models):: Stanford-Framework, das Modelle über mehrere Szenarien und Metriken gleichzeitig bewertet, um Kompromisse bei Genauigkeit, Robustheit, Fairness und Effizienz sichtbar zu machen
24+
25+
Chatbot Arena / Elo-Rating:: Präferenzbasierte Evaluation, bei der zwei Modelle auf denselben Prompt antworten und Menschen die bessere Antwort wählen; erzeugt Elo-ähnliche Ranglisten
26+
27+
Open LLM Leaderboard:: Von Hugging Face / EleutherAI gehostetes Ranking von Open-Source-Modellen anhand standardisierter Benchmarks für reproduzierbare Vergleiche
28+
29+
Red-Teaming & Sicherheitsevaluation:: Systematisches adversariales Testen auf schädliche Ausgaben, Jailbreaks und Fehlerszenarien; notwendiger Schritt vor dem Produktionseinsatz
30+
31+
Datenkontamination & Overfitting:: Risiko, dass Trainingsdaten eines Modells die Test-Sets der Benchmarks enthalten und so die scheinbare Leistung aufblähen; Gegenmaßnahmen: zurückgehaltene oder dynamische Benchmarks
32+
33+
Aufgabenspezifische vs. allgemeine Evaluation:: Gezielte Bewertung für einen spezifischen Anwendungsfall (z. B. Code, Zusammenfassung, RAG-Retrieval) vs. breite Fähigkeitsbewertung über diverse Domänen
34+
35+
Schlüsselvertreter:: Percy Liang et al. (Stanford, "Holistic Evaluation of Language Models"), EleutherAI ("Language Model Evaluation Harness"), LMSYS ("Chatbot Arena: Benchmarking LLMs in the Wild")
36+
37+
[discrete]
38+
== *Wann zu verwenden*:
39+
40+
* Auswahl eines Foundation-Modells für eine spezifische Anwendungsdomäne
41+
* Vergleich feinjustierter Modellversionen während des iterativen Trainings
42+
* Validierung, dass ein Modell Qualitäts-, Sicherheits- und Fairness-Anforderungen vor dem Deployment erfüllt
43+
* Reproduzieren oder Hinterfragen veröffentlichter Modell-Leistungsaussagen
44+
* Erstellen von Regressions-Baselines beim Update eines eingesetzten Modells
45+
* Kommunikation von Modellstärken und -grenzen an nicht-technische Stakeholder
46+
47+
[discrete]
48+
== *Verwandte Anker*:
49+
50+
* <<chain-of-thought,Chain of Thought (CoT)>>
51+
* <<sota,SOTA (State-of-the-Art)>>
52+
* <<mutation-testing,Mutation Testing>>
53+
====

skill/semantic-anchor-translator/references/catalog.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ Source: https://github.com/LLM-Coding/Semantic-Anchors
5656
- **Proponents:** Loren Kohnfelder, Praerit Garg (Microsoft), Adam Shostack
5757
- **Core:** Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege — structured threat categorization for security design
5858

59+
### LLM-Evaluations
60+
- **Also known as:** LLM Benchmarking, LLM Assessment, Foundation Model Evaluation
61+
- **Proponents:** Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
62+
- **Core:** Frameworks and metrics for assessing LLM capabilities — benchmark suites (MMLU, HumanEval, BIG-Bench), automatic vs. human evaluation, HELM, Chatbot Arena Elo ratings, red-teaming, contamination detection
63+
5964
## Software Architecture
6065

6166
### Clean Architecture

website/public/data/anchors.json

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1590,6 +1590,40 @@
15901590
"filePath": "docs/anchors/linddun.adoc",
15911591
"tier": 3
15921592
},
1593+
{
1594+
"id": "llm-evaluations",
1595+
"title": "LLM-Evaluations",
1596+
"categories": [
1597+
"testing-quality"
1598+
],
1599+
"roles": [
1600+
"data-scientist",
1601+
"software-developer",
1602+
"qa-engineer",
1603+
"software-architect"
1604+
],
1605+
"related": [
1606+
"chain-of-thought",
1607+
"sota",
1608+
"mutation-testing"
1609+
],
1610+
"proponents": [
1611+
"Percy Liang (Stanford HELM)",
1612+
"EleutherAI (Open LLM Leaderboard)",
1613+
"LMSYS (Chatbot Arena)"
1614+
],
1615+
"tags": [
1616+
"llm",
1617+
"evaluation",
1618+
"benchmarks",
1619+
"metrics",
1620+
"leaderboard",
1621+
"nlp",
1622+
"ai"
1623+
],
1624+
"filePath": "docs/anchors/llm-evaluations.adoc",
1625+
"tier": 3
1626+
},
15931627
{
15941628
"id": "madr",
15951629
"title": "MADR",

website/public/data/categories.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@
161161
"gherkin",
162162
"iec-61508-sil-levels",
163163
"linddun",
164+
"llm-evaluations",
164165
"mutation-testing",
165166
"owasp-top-10",
166167
"property-based-testing",

website/public/data/metadata.json

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
{
2-
"generatedAt": "2026-03-17T19:53:57.003Z",
2+
"generatedAt": "2026-03-18T09:45:19.699Z",
33
"version": "1.0.0",
44
"counts": {
5-
"anchors": 104,
5+
"anchors": 105,
66
"categories": 12,
77
"roles": 12
88
},
99
"statistics": {
10-
"averageRolesPerAnchor": "3.08",
10+
"averageRolesPerAnchor": "3.09",
1111
"averageCategoriesPerAnchor": "1.01",
12-
"anchorsWithTags": 64,
13-
"anchorsWithRelated": 35
12+
"anchorsWithTags": 65,
13+
"anchorsWithRelated": 36
1414
}
1515
}

website/public/data/roles.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@
6969
"anchors": [
7070
"chain-of-thought",
7171
"control-chart-shewhart",
72+
"llm-evaluations",
7273
"mece",
7374
"nelson-rules",
7475
"sota",
@@ -152,6 +153,7 @@
152153
"gherkin",
153154
"iec-61508-sil-levels",
154155
"linddun",
156+
"llm-evaluations",
155157
"mutation-testing",
156158
"owasp-top-10",
157159
"property-based-testing",
@@ -214,6 +216,7 @@
214216
"iec-61508-sil-levels",
215217
"lasr",
216218
"linddun",
219+
"llm-evaluations",
217220
"madr",
218221
"mece",
219222
"mikado-method",
@@ -286,6 +289,7 @@
286289
"invest",
287290
"lasr",
288291
"linddun",
292+
"llm-evaluations",
289293
"madr",
290294
"mental-model-according-to-naur",
291295
"mikado-method",

0 commit comments

Comments
 (0)