|
| 1 | +# Model Testing Metrics |
| 2 | + |
| 3 | +|Skill Type (Task)|What It Tests|Example Dataset|Metric to Measure Accuracy| |
| 4 | +|-----------------|-------------|---------------|--------------------------| |
| 5 | +|Reasoning / Logic|Mathematical reasoning|GSM8K|(correct answers / total)| |
| 6 | +|Commonsense QA|Everyday reasoning and knowledge|PIQA, BoolQ|Accuracy| |
| 7 | +|Summarization|Condensing information|CNN/DailyMail, XSum|ROUGE-L, BERTScore| |
| 8 | +|Code Generation|Logical structure|HumanEval-lite, MBPP|Pass@k| |
| 9 | + |
| 10 | +## Datasets |
| 11 | + |
| 12 | +### GSM8K(Grade School Math 8K) |
| 13 | + |
| 14 | +It is a dataset of 8.5K high quality linguistically diverse |
| 15 | +grade school math word problems. The dataset was created to support the task |
| 16 | +of question answering on basic mathematical problems that require multi-step |
| 17 | +reasoning. |
| 18 | + |
| 19 | +### BoolQ |
| 20 | + |
| 21 | +It is a question answering dataset for yes/no questions containing |
| 22 | +15942 examples. These questions are naturally occurring ---they are |
| 23 | +generated in unprompted and unconstrained settings. |
| 24 | + |
| 25 | +### PIQA |
| 26 | + |
| 27 | +This dataset introduces the task of physical commonsense reasoning and a |
| 28 | +corresponding benchmark dataset Physical Interaction: Question Answering |
| 29 | +or PIQA. |
| 30 | + |
| 31 | +### Extreme Summarization (XSum) Dataset |
| 32 | + |
| 33 | +There are three features: |
| 34 | +document: Input news article. |
| 35 | +summary: One sentence summary of the article. |
| 36 | +id: BBC ID of the article. |
| 37 | + |
| 38 | +### The CNN / DailyMail Dataset |
| 39 | + |
| 40 | +It is an English-language dataset containing just over |
| 41 | +300k unique news articles as written by journalists at CNN and the Daily Mail. |
| 42 | +he current version supports both extractive and abstractive summarization. |
| 43 | + |
| 44 | +### The HumanEval Dataset |
| 45 | + |
| 46 | +released by OpenAI includes 164 programming problems |
| 47 | +with a function sig- nature, docstring, body, and several unit tests. |
| 48 | +They were handwritten to ensure not to be included in the training |
| 49 | +set of code generation models. |
| 50 | + |
| 51 | +### MBPP |
| 52 | + |
| 53 | +The benchmark consists of around 1,000 crowd-sourced Python programming |
| 54 | +problems, designed to be solvable by entry level programmers, covering |
| 55 | +programming fundamentals, standard library functionality, and so on. |
| 56 | +Each problem consists of a task description,code solution and 3 automated |
| 57 | +test cases. |
| 58 | + |
| 59 | +## Metrics |
| 60 | + |
| 61 | +### Pass@1 |
| 62 | + |
| 63 | +The percentage of problems for which the model’s first generated solution |
| 64 | +passes all tests. |
| 65 | + |
| 66 | +### BERTScore |
| 67 | + |
| 68 | +It measures how similar two pieces of text are in meaning, not just in word |
| 69 | +overlap. It uses BERT embeddings (or similar transformer embeddings) to |
| 70 | +compare the semantic content of the generated text and the reference text. |
| 71 | + |
| 72 | +### ROUGE, or Recall-Oriented Understudy for Gisting Evaluation |
| 73 | + |
| 74 | +It is a set of metrics and a software package used for evaluating automatic |
| 75 | +summarization and machine translation software in natural language processing. |
| 76 | +The metrics compare an automatically produced summary or translation against |
| 77 | +a reference or a set of references (human-produced) summary or translation. |
0 commit comments