Skip to content

Commit 323f0c4

Browse files
authored
Merge pull request #3 from MIT-Emerging-Talent/testing_metrics
adding the model evaluation metrics document
2 parents 35d080d + b751233 commit 323f0c4

1 file changed

Lines changed: 77 additions & 0 deletions

File tree

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Model Testing Metrics
2+
3+
|Skill Type (Task)|What It Tests|Example Dataset|Metric to Measure Accuracy|
4+
|-----------------|-------------|---------------|--------------------------|
5+
|Reasoning / Logic|Mathematical reasoning|GSM8K|(correct answers / total)|
6+
|Commonsense QA|Everyday reasoning and knowledge|PIQA, BoolQ|Accuracy|
7+
|Summarization|Condensing information|CNN/DailyMail, XSum|ROUGE-L, BERTScore|
8+
|Code Generation|Logical structure|HumanEval-lite, MBPP|Pass@k|
9+
10+
## Datasets
11+
12+
### GSM8K(Grade School Math 8K)
13+
14+
It is a dataset of 8.5K high quality linguistically diverse
15+
grade school math word problems. The dataset was created to support the task
16+
of question answering on basic mathematical problems that require multi-step
17+
reasoning.
18+
19+
### BoolQ
20+
21+
It is a question answering dataset for yes/no questions containing
22+
15942 examples. These questions are naturally occurring ---they are
23+
generated in unprompted and unconstrained settings.
24+
25+
### PIQA
26+
27+
This dataset introduces the task of physical commonsense reasoning and a
28+
corresponding benchmark dataset Physical Interaction: Question Answering
29+
or PIQA.
30+
31+
### Extreme Summarization (XSum) Dataset
32+
33+
There are three features:
34+
document: Input news article.
35+
summary: One sentence summary of the article.
36+
id: BBC ID of the article.
37+
38+
### The CNN / DailyMail Dataset
39+
40+
It is an English-language dataset containing just over
41+
300k unique news articles as written by journalists at CNN and the Daily Mail.
42+
he current version supports both extractive and abstractive summarization.
43+
44+
### The HumanEval Dataset
45+
46+
released by OpenAI includes 164 programming problems
47+
with a function sig- nature, docstring, body, and several unit tests.
48+
They were handwritten to ensure not to be included in the training
49+
set of code generation models.
50+
51+
### MBPP
52+
53+
The benchmark consists of around 1,000 crowd-sourced Python programming
54+
problems, designed to be solvable by entry level programmers, covering
55+
programming fundamentals, standard library functionality, and so on.
56+
Each problem consists of a task description,code solution and 3 automated
57+
test cases.
58+
59+
## Metrics
60+
61+
### Pass@1
62+
63+
The percentage of problems for which the model’s first generated solution
64+
passes all tests.
65+
66+
### BERTScore
67+
68+
It measures how similar two pieces of text are in meaning, not just in word
69+
overlap. It uses BERT embeddings (or similar transformer embeddings) to
70+
compare the semantic content of the generated text and the reference text.
71+
72+
### ROUGE, or Recall-Oriented Understudy for Gisting Evaluation
73+
74+
It is a set of metrics and a software package used for evaluating automatic
75+
summarization and machine translation software in natural language processing.
76+
The metrics compare an automatically produced summary or translation against
77+
a reference or a set of references (human-produced) summary or translation.

0 commit comments

Comments
 (0)