Skip to content

Commit 770c403

Browse files
committed
adding the model evaluation metrics document
1 parent 35d080d commit 770c403

1 file changed

Lines changed: 74 additions & 0 deletions

File tree

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Model Testing Metrics
2+
3+
|Skill Type (Task)|What It Tests|Example Dataset|Metric to Measure Accuracy|
4+
|-----------------|-------------|---------------|--------------------------|
5+
|Reasoning / Logic|Mathematical reasoning|GSM8K|(correct answers / total)|
6+
|Commonsense QA|Everyday reasoning and knowledge|PIQA, BoolQ|Accuracy|
7+
|Summarization|condense information|CNN/DailyMail, XSum|ROUGE-L, BERTScore|
8+
|Code Generation|Logical structure|HumanEval-lite, MBPP|Pass@k|
9+
10+
## Datasets
11+
12+
### GSM8K(Grade School Math 8K)
13+
14+
It is a dataset of 8.5K high quality linguistically diverse
15+
grade school math word problems. The dataset was created to support the task
16+
of question answering on basic mathematical problems that require multi-step
17+
reasoning.
18+
19+
### BoolQ
20+
21+
It is a question answering dataset for yes/no questions containing
22+
15942 examples. These questions are naturally occurring ---they are
23+
generated in unprompted and unconstrained settings.
24+
25+
### PIQA
26+
27+
This dataset introduces the task of physical commonsense reasoning and a
28+
corresponding benchmark dataset Physical Interaction: Question Answering
29+
or PIQA.
30+
31+
### Extreme Summarization (XSum) Dataset
32+
33+
There are three features:
34+
document: Input news article.
35+
summary: One sentence summary of the article.
36+
id: BBC ID of the article.
37+
38+
### The CNN / DailyMail Dataset
39+
40+
It is an English-language dataset containing just over
41+
300k unique news articles as written by journalists at CNN and the Daily Mail.
42+
he current version supports both extractive and abstractive summarization.
43+
The HumanEval dataset released by OpenAI includes 164 programming problems
44+
with a function sig- nature, docstring, body, and several unit tests.
45+
They were handwritten to ensure not to be included in the training
46+
set of code generation models.
47+
48+
### MBPP
49+
50+
The benchmark consists of around 1,000 crowd-sourced Python programming
51+
problems, designed to be solvable by entry level programmers, covering
52+
programming fundamentals, standard library functionality, and so on.
53+
Each problem consists of a task description,code solution and 3 automated
54+
test cases.
55+
56+
## Metrics
57+
58+
### Pass@1
59+
60+
The percentage of problems for which the model’s first generated solution
61+
passes all tests.
62+
63+
### BERTScore
64+
65+
It measures how similar two pieces of text are in meaning, not just in word
66+
overlap. It uses BERT embeddings (or similar transformer embeddings) to
67+
compare the semantic content of the generated text and the reference text.
68+
69+
### ROUGE, or Recall-Oriented Understudy for Gisting Evaluation
70+
71+
It is a set of metrics and a software package used for evaluating automatic
72+
summarization and machine translation software in natural language processing.
73+
The metrics compare an automatically produced summary or translation against
74+
a reference or a set of references (human-produced) summary or translation.

0 commit comments

Comments
 (0)