You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: yescieval/injector/domains/ecology.py
+24Lines changed: 24 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -487,6 +487,30 @@
487
487
}
488
488
]
489
489
}
490
+
},
491
+
"Gap": {
492
+
"GapIdentification": {
493
+
"ResponseA": [
494
+
{
495
+
"rating": "1",
496
+
"rationale": "The response only describes known ecological patterns and findings. It does not point out any missing data, unanswered questions, or limitations. It also does not mention underrepresented species, regions, or conflicting results."
497
+
},
498
+
{
499
+
"rating": "4",
500
+
"rationale": "The response clearly identifies important gaps, such as missing long-term data for soil invertebrates in semi-arid regions, limited studies in tropical and boreal systems, and conflicting results on predator reintroduction. It explains why these gaps matter, though one point slightly mixes data gaps with methodological issues."
501
+
}
502
+
],
503
+
"ResponseB": [
504
+
{
505
+
"rating": "1",
506
+
"rationale": "The response gives general observations about deforestation and habitat fragmentation but does not identify any missing information, open questions, or conflicting evidence. It assumes the existing knowledge is complete."
507
+
},
508
+
{
509
+
"rating": "4",
510
+
"rationale": "The response identifies clear gaps, such as lack of controlled studies on edge effects in freshwater areas, missing long-term data for migratory invertebrates, and differences between remote sensing and field data. It explains how these gaps affect conclusions, though it could be more specific about affected species or regions."
Copy file name to clipboardExpand all lines: yescieval/injector/domains/nlp.py
+24Lines changed: 24 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -493,6 +493,30 @@
493
493
}
494
494
]
495
495
}
496
+
},
497
+
"Gap": {
498
+
"GapIdentification": {
499
+
"ResponseA": [
500
+
{
501
+
"rating": "1",
502
+
"rationale": "The response is purely descriptive, summarizing reported benchmark scores and model architecture details across several tasks such as named entity recognition and machine translation. It does not identify any missing evaluations, underexplored domains, or unresolved questions relevant to the research question. No mention is made of dataset biases, lack of ablation studies, or conflicting results across benchmarks."
503
+
},
504
+
{
505
+
"rating": "4",
506
+
"rationale": "The response clearly identifies several meaningful gaps in the evidence base: it notes that most evaluations of large language models are conducted on high-resource languages leaving low-resource and morphologically complex languages largely underrepresented, that existing benchmarks for dialogue systems lack coverage of multi-turn reasoning under ambiguity, and that conflicting results across GLUE and SuperGLUE variants suggest evaluation inconsistencies that remain unresolved. The response explains why these gaps limit claims about generalizability, though it slightly overgeneralizes when attributing all performance variance to dataset bias without considering architectural differences."
507
+
}
508
+
],
509
+
"ResponseB": [
510
+
{
511
+
"rating": "1",
512
+
"rationale": "The response describes existing approaches to coreference resolution and summarization, reporting model performance figures and dataset characteristics without identifying any limitations, missing analyses, or unresolved aspects of the research question. Findings are presented as comprehensive and conclusive, with no acknowledgment of underexplored tasks, missing robustness evaluations, or gaps in the comparative analysis across model families."
513
+
},
514
+
{
515
+
"rating": "4",
516
+
"rationale": "The response identifies concrete gaps in the evidence base, such as the absence of out-of-domain generalization evaluations for transformer-based models fine-tuned on narrow corpora, the lack of ablation studies isolating the contribution of pretraining data size versus architectural choices, and the underrepresentation of spoken or informal language registers in standard NLP benchmarks. It explains that these gaps undermine the reliability of conclusions drawn from current leaderboard comparisons, though it could have been more specific about which model families or task categories are most critically affected."
Scientific question answering and synthesis often require more than listing findings: high-quality scientific writing identifies what remains unknown, insufficiently addressed, or unresolved in existing research. This is commonly expressed through gap identification, where the text specifies limitations, missing knowledge, unresolved inconsistencies, or missing connections in prior work and explains why they matter for the research question.
5
+
6
+
Gap identification can refer to (a) field-level gaps across the literature (e.g., missing populations/settings, inconsistent measures, lack of comparative studies, limited external validity), and/or (b) recurring study-level limitations that materially constrain conclusions when framed as evidence-base limitations. High-quality gap statements are specific and scoped (what is missing, where, and why), rather than generic (“more research is needed”). If gap emphasis is not central to the user's question, it should not be forced; however, when included, it should remain relevant and well-defined.
7
+
8
+
The responses may be short paragraphs or long-form reports. Gap identification should be evaluated independently of presentation style or length.
9
+
10
+
This rubric focuses exclusively on the presence and quality of gap identification within the provided text of two responses that are compared, emphasizing explicit and relevant statements of limitations, unanswered questions, or missing connections in prior work rather than summaries of what is already known. Other aspects of scientific quality (such as factual accuracy, evidential grounding, completeness, or innovation) are intentionally outside its scope and are assessed by separate evaluation criteria.
11
+
</Context>
12
+
13
+
<Role>
14
+
You are tasked as a scientific writing quality evaluator performing a pairwise comparison between two texts.
15
+
</Role>
16
+
17
+
<Task-Description>
18
+
A user will provide:
19
+
1) a research question, and
20
+
2) two written responses (Response A and Response B) intended to address that question.
21
+
22
+
Your task is to:
23
+
- First, independently evaluate each response using the evaluation characteristic below.
24
+
- Then perform a pairwise comparison of the two responses using the evaluation characteristic below.
25
+
- Then grade each response with a comparative rating from 1 (very bad) to 5 (very good) compared to the other response and a subsequent rationale for each comparative response rating.
26
+
- Note that it is possible for both responses to receive the same rating, if they are equally comparably clear with systematic gap identification and provide complementary or identical insights.
27
+
28
+
Your judgment must be based solely on the provided question and comparing the two responses w.r.t. addressing the question and w.r.t. each other.
29
+
</Task-Description>
30
+
31
+
<Evaluation-Characteristic>
32
+
GapIdentification: Does the response identify gaps relevant to the research question by specifying limitations, missing knowledge, unresolved issues, or missing connections in prior work (preferably at the evidence-base/field level), rather than only describing existing findings?
33
+
</Evaluation-Characteristic>
34
+
35
+
<Domain-Vocabulary-Examples>
36
+
Below are terms and phrases that often signal gap identification. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score.
37
+
{GAP_IDENTIFICATION_VOCAB}
38
+
</Domain-Vocabulary-Examples>
39
+
40
+
<Rating-Scale>
41
+
For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below.
42
+
43
+
GapIdentification
44
+
Rating 1. Very bad: The response is purely descriptive, summarizing existing findings or observations with no identification of missing, unknown, inconsistent, or unresolved aspects relevant to the research question.
45
+
Rating 2. Bad: The response refers to gaps only in a vague or generic manner (e.g., “more research is needed”) without clearly specifying what is missing or unresolved in the context of the research question.
46
+
Rating 3. Moderate: The response identifies one or more potential gaps with partial clarity, but the nature, scope, location (where in the literature), or relevance of the gaps is incomplete, unclear, or inconsistently articulated.
47
+
Rating 4. Good: The response clearly identifies specific gaps or limitations in the evidence base that are relevant to the research question (e.g., missing comparisons, populations, settings, measures, time horizons, or conflicting results) and provides some explanation of why they matter; minor ambiguity or imprecision may remain.
48
+
Rating 5. Very good: The response explicitly and clearly identifies well-defined gaps or unanswered questions, specifying what is missing, where it occurs in the evidence base (or across studies), and why it matters for the research question, without relying on vague statements. Gap statements are appropriately scoped (avoiding absolute claims like “no research exists” unless clearly justified within the response).
49
+
</Rating-Scale>
50
+
51
+
<Response-Format>
52
+
Return your evaluation strictly in JSON format:
53
+
54
+
{
55
+
"GapIdentification": {
56
+
"ResponseA": {
57
+
"rating": "",
58
+
"rationale": ""
59
+
},
60
+
"ResponseB": {
61
+
"rating": "",
62
+
"rationale": ""
63
+
}
64
+
}
65
+
}
66
+
67
+
where:
68
+
- "rating" is a number from 1 to 5.
69
+
- "rationale" is the comparative evaluation rating justification.
70
+
71
+
</Response-Format>
72
+
73
+
<Example-Responses>
74
+
{EXAMPLE_RESPONSES}
75
+
</Example-Responses>
76
+
77
+
78
+
<Note>
79
+
Your evaluation must be based solely on the provided research question and responses. Do not reward keyword mentions alone. Reward distinct, clearly described gaps in existing knowledge that are relevant to the research question and appropriately scoped. This rubric does not assess factual correctness, evidential grounding, completeness, or innovation.
0 commit comments