Merge pull request #276 from LLM-Coding/copilot/add-llm-evaluations-anchor

rdmueller · web-flow · commit 8bf900223e16 · 2026-03-18T13:40:57.000+01:00
Add LLM-Evaluations semantic anchor
diff --git a/docs/anchors/llm-evaluations.adoc b/docs/anchors/llm-evaluations.adoc
@@ -0,0 +1,54 @@
+= LLM-Evaluations
+:categories: testing-quality
+:roles: data-scientist, software-developer, qa-engineer, software-architect
+:related: chain-of-thought, sota, mutation-testing
+:proponents: Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
+:tags: llm, evaluation, benchmarks, metrics, leaderboard, nlp, ai
+:tier: 3
+
+[%collapsible]
+====
+Full Name:: Large Language Model Evaluations
+
+Also known as:: LLM Benchmarking, LLM Assessment, Foundation Model Evaluation
+
+[discrete]
+== *Core Concepts*:
+
+Benchmark Suites:: Standardized datasets and tasks used to compare LLM capabilities — MMLU (Massive Multitask Language Understanding), HellaSwag, HumanEval, BIG-Bench, GSM8K, TruthfulQA, ARC
+
+Evaluation Metrics:: Quantitative measures of model quality — perplexity, accuracy, BLEU, ROUGE, F1, pass@k (code generation), exact match, calibration
+
+Automatic vs. Human Evaluation:: Automated scoring via metrics or reference outputs (fast, scalable) vs. human judgment (nuanced, expensive); hybrid approaches such as LLM-as-judge
+
+HELM (Holistic Evaluation of Language Models):: Stanford framework evaluating models across multiple scenarios and metrics simultaneously to surface trade-offs across accuracy, robustness, fairness, and efficiency
+
+Chatbot Arena / Elo Rating:: Human preference-based evaluation where two models respond to the same prompt and humans choose the better answer; produces Elo-style rankings
+
+Open LLM Leaderboard:: Hugging Face / EleutherAI hosted ranking of open-source models across standardized benchmarks enabling reproducible comparisons
+
+Red-Teaming & Safety Evaluation:: Systematic adversarial probing for harmful outputs, jailbreaks, and failure modes; a required step before production deployment
+
+Contamination & Overfitting:: Risk that a model's training data includes benchmark test sets, inflating apparent performance; mitigated by held-out or dynamic benchmarks
+
+Task-Specific vs. General Evaluation:: Targeted evaluation for a specific use case (e.g., code, summarization, RAG retrieval) vs. broad capability assessment across diverse domains
+
+Key Proponents:: Percy Liang et al. (Stanford, "Holistic Evaluation of Language Models"), EleutherAI ("Language Model Evaluation Harness"), LMSYS ("Chatbot Arena: Benchmarking LLMs in the Wild")
+
+[discrete]
+== *When to Use*:
+
+* Selecting a foundation model for a specific application domain
+* Comparing fine-tuned model versions during iterative training
+* Validating that a model meets quality, safety, and fairness requirements before deployment
+* Reproducing or challenging published model capability claims
+* Establishing regression baselines when updating a deployed model
+* Communicating model strengths and limitations to non-technical stakeholders
+
+[discrete]
+== *Related Anchors*:
+
+* <<chain-of-thought,Chain of Thought (CoT)>>
+* <<sota,SOTA (State-of-the-Art)>>
+* <<mutation-testing,Mutation Testing>>
+====
diff --git a/docs/anchors/llm-evaluations.de.adoc b/docs/anchors/llm-evaluations.de.adoc
@@ -0,0 +1,53 @@
+= LLM-Evaluations
+:categories: testing-quality
+:roles: data-scientist, software-developer, qa-engineer, software-architect
+:related: chain-of-thought, sota, mutation-testing
+:proponents: Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
+:tags: llm, evaluation, benchmarks, metrics, leaderboard, nlp, ai
+
+[%collapsible]
+====
+Vollständiger Name:: Large Language Model Evaluations (Bewertung großer Sprachmodelle)
+
+Auch bekannt als:: LLM-Benchmarking, LLM-Bewertung, Foundation-Model-Evaluation
+
+[discrete]
+== *Kernkonzepte*:
+
+Benchmark-Suiten:: Standardisierte Datensätze und Aufgaben zum Vergleich von LLM-Fähigkeiten — MMLU (Massive Multitask Language Understanding), HellaSwag, HumanEval, BIG-Bench, GSM8K, TruthfulQA, ARC
+
+Evaluationsmetriken:: Quantitative Maße für Modellqualität — Perplexity, Genauigkeit, BLEU, ROUGE, F1, pass@k (Code-Generierung), Exact Match, Kalibrierung
+
+Automatische vs. menschliche Evaluation:: Automatisierte Bewertung über Metriken oder Referenzausgaben (schnell, skalierbar) vs. menschliches Urteil (differenziert, kostenintensiv); hybride Ansätze wie LLM-as-Judge
+
+HELM (Holistic Evaluation of Language Models):: Stanford-Framework, das Modelle über mehrere Szenarien und Metriken gleichzeitig bewertet, um Kompromisse bei Genauigkeit, Robustheit, Fairness und Effizienz sichtbar zu machen
+
+Chatbot Arena / Elo-Rating:: Präferenzbasierte Evaluation, bei der zwei Modelle auf denselben Prompt antworten und Menschen die bessere Antwort wählen; erzeugt Elo-ähnliche Ranglisten
+
+Open LLM Leaderboard:: Von Hugging Face / EleutherAI gehostetes Ranking von Open-Source-Modellen anhand standardisierter Benchmarks für reproduzierbare Vergleiche
+
+Red-Teaming & Sicherheitsevaluation:: Systematisches adversariales Testen auf schädliche Ausgaben, Jailbreaks und Fehlerszenarien; notwendiger Schritt vor dem Produktionseinsatz
+
+Datenkontamination & Overfitting:: Risiko, dass Trainingsdaten eines Modells die Test-Sets der Benchmarks enthalten und so die scheinbare Leistung aufblähen; Gegenmaßnahmen: zurückgehaltene oder dynamische Benchmarks
+
+Aufgabenspezifische vs. allgemeine Evaluation:: Gezielte Bewertung für einen spezifischen Anwendungsfall (z. B. Code, Zusammenfassung, RAG-Retrieval) vs. breite Fähigkeitsbewertung über diverse Domänen
+
+Schlüsselvertreter:: Percy Liang et al. (Stanford, "Holistic Evaluation of Language Models"), EleutherAI ("Language Model Evaluation Harness"), LMSYS ("Chatbot Arena: Benchmarking LLMs in the Wild")
+
+[discrete]
+== *Wann zu verwenden*:
+
+* Auswahl eines Foundation-Modells für eine spezifische Anwendungsdomäne
+* Vergleich feinjustierter Modellversionen während des iterativen Trainings
+* Validierung, dass ein Modell Qualitäts-, Sicherheits- und Fairness-Anforderungen vor dem Deployment erfüllt
+* Reproduzieren oder Hinterfragen veröffentlichter Modell-Leistungsaussagen
+* Erstellen von Regressions-Baselines beim Update eines eingesetzten Modells
+* Kommunikation von Modellstärken und -grenzen an nicht-technische Stakeholder
+
+[discrete]
+== *Verwandte Anker*:
+
+* <<chain-of-thought,Chain of Thought (CoT)>>
+* <<sota,SOTA (State-of-the-Art)>>
+* <<mutation-testing,Mutation Testing>>
+====
diff --git a/skill/semantic-anchor-translator/references/catalog.md b/skill/semantic-anchor-translator/references/catalog.md
@@ -56,6 +56,11 @@ Source: https://github.com/LLM-Coding/Semantic-Anchors
 - **Proponents:** Loren Kohnfelder, Praerit Garg (Microsoft), Adam Shostack
 - **Core:** Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege — structured threat categorization for security design
 
+### LLM-Evaluations
+- **Also known as:** LLM Benchmarking, LLM Assessment, Foundation Model Evaluation
+- **Proponents:** Percy Liang (Stanford HELM), EleutherAI (Open LLM Leaderboard), LMSYS (Chatbot Arena)
+- **Core:** Frameworks and metrics for assessing LLM capabilities — benchmark suites (MMLU, HumanEval, BIG-Bench), automatic vs. human evaluation, HELM, Chatbot Arena Elo ratings, red-teaming, contamination detection
+
 ## Software Architecture
 
 ### Clean Architecture
diff --git a/website/public/data/anchors.json b/website/public/data/anchors.json
@@ -1590,6 +1590,40 @@
     "filePath": "docs/anchors/linddun.adoc",
     "tier": 3
   },
+  {
+    "id": "llm-evaluations",
+    "title": "LLM-Evaluations",
+    "categories": [
+      "testing-quality"
+    ],
+    "roles": [
+      "data-scientist",
+      "software-developer",
+      "qa-engineer",
+      "software-architect"
+    ],
+    "related": [
+      "chain-of-thought",
+      "sota",
+      "mutation-testing"
+    ],
+    "proponents": [
+      "Percy Liang (Stanford HELM)",
+      "EleutherAI (Open LLM Leaderboard)",
+      "LMSYS (Chatbot Arena)"
+    ],
+    "tags": [
+      "llm",
+      "evaluation",
+      "benchmarks",
+      "metrics",
+      "leaderboard",
+      "nlp",
+      "ai"
+    ],
+    "filePath": "docs/anchors/llm-evaluations.adoc",
+    "tier": 3
+  },
   {
     "id": "madr",
     "title": "MADR",
diff --git a/website/public/data/categories.json b/website/public/data/categories.json
@@ -161,6 +161,7 @@
       "gherkin",
       "iec-61508-sil-levels",
       "linddun",
+      "llm-evaluations",
       "mutation-testing",
       "owasp-top-10",
       "property-based-testing",
diff --git a/website/public/data/metadata.json b/website/public/data/metadata.json
@@ -1,15 +1,15 @@
 {
-  "generatedAt": "2026-03-17T19:53:57.003Z",
+  "generatedAt": "2026-03-18T09:45:19.699Z",
   "version": "1.0.0",
   "counts": {
-    "anchors": 104,
+    "anchors": 105,
     "categories": 12,
     "roles": 12
   },
   "statistics": {
-    "averageRolesPerAnchor": "3.08",
+    "averageRolesPerAnchor": "3.09",
     "averageCategoriesPerAnchor": "1.01",
-    "anchorsWithTags": 64,
-    "anchorsWithRelated": 35
+    "anchorsWithTags": 65,
+    "anchorsWithRelated": 36
   }
 }
diff --git a/website/public/data/roles.json b/website/public/data/roles.json
@@ -69,6 +69,7 @@
     "anchors": [
       "chain-of-thought",
       "control-chart-shewhart",
+      "llm-evaluations",
       "mece",
       "nelson-rules",
       "sota",
@@ -152,6 +153,7 @@
       "gherkin",
       "iec-61508-sil-levels",
       "linddun",
+      "llm-evaluations",
       "mutation-testing",
       "owasp-top-10",
       "property-based-testing",
@@ -214,6 +216,7 @@
       "iec-61508-sil-levels",
       "lasr",
       "linddun",
+      "llm-evaluations",
       "madr",
       "mece",
       "mikado-method",
@@ -286,6 +289,7 @@
       "invest",
       "lasr",
       "linddun",
+      "llm-evaluations",
       "madr",
       "mental-model-according-to-naur",
       "mikado-method",