Skip to content

Commit 055f1c3

Browse files
authored
Merge pull request #44 from sciknoworg/dev
add dimension - gap pairwise rubric
2 parents d9be6e4 + 204d107 commit 055f1c3

6 files changed

Lines changed: 154 additions & 3 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,8 @@ from yescieval.rubric.pairwise.depth import MechanisticUnderstanding, CausalReas
124124
from yescieval.rubric.pairwise.breadth import ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage
125125
# Rigor rubrics
126126
from yescieval.rubric.pairiwise.rigor import EpistemicCalibration, QuantitativeEvidenceAndUncertainty, ExplicitUncertainty
127+
# Gap rubrics
128+
from yescieval.rubric.pairwise.gap import GapIdentification
127129
```
128130

129131
A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.

docs/source/rubrics.rst

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ Each rubric can be used in two ways:
4545
from yescieval.rubric.pairwise.breadth import ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage
4646
# Pairwise rigor rubrics
4747
from yescieval.rubric.pairwise.rigor import EpistemicCalibration, QuantitativeEvidenceAndUncertainty, ExplicitUncertainty
48+
# Pairwise Gap rubric
49+
from yescieval.rubric.pairwise.gap import GapIdentification
4850
4951
5052
Pairwise Evaluation
@@ -471,9 +473,7 @@ Detects explicit acknowledgment of unanswered questions or understudied areas.
471473
terms like "research gap" or "understudied"?
472474

473475

474-
475-
476-
.. tab:: Basic Usage
476+
.. tab:: Pointwise Usage
477477

478478
.. code-block:: python
479479
@@ -485,6 +485,19 @@ Detects explicit acknowledgment of unanswered questions or understudied areas.
485485
print(instruction)
486486
print(rubric.name)
487487
488+
.. tab:: Pairwise Usage
489+
490+
.. code-block:: python
491+
492+
from yescieval.rubric.pairwise.gap import GapIdentification
493+
494+
rubric = GapIdentification(papers=papers, question=question, answer_a=answer_a, answer_b=answer_b)
495+
instruction = rubric.instruct()
496+
497+
print(instruction)
498+
print(rubric.name)
499+
500+
488501
.. tab:: Usage with Example and Vocabulary Injectors
489502

490503
.. code-block:: python

yescieval/injector/domains/ecology.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -487,6 +487,30 @@
487487
}
488488
]
489489
}
490+
},
491+
"Gap": {
492+
"GapIdentification": {
493+
"ResponseA": [
494+
{
495+
"rating": "1",
496+
"rationale": "The response only describes known ecological patterns and findings. It does not point out any missing data, unanswered questions, or limitations. It also does not mention underrepresented species, regions, or conflicting results."
497+
},
498+
{
499+
"rating": "4",
500+
"rationale": "The response clearly identifies important gaps, such as missing long-term data for soil invertebrates in semi-arid regions, limited studies in tropical and boreal systems, and conflicting results on predator reintroduction. It explains why these gaps matter, though one point slightly mixes data gaps with methodological issues."
501+
}
502+
],
503+
"ResponseB": [
504+
{
505+
"rating": "1",
506+
"rationale": "The response gives general observations about deforestation and habitat fragmentation but does not identify any missing information, open questions, or conflicting evidence. It assumes the existing knowledge is complete."
507+
},
508+
{
509+
"rating": "4",
510+
"rationale": "The response identifies clear gaps, such as lack of controlled studies on edge effects in freshwater areas, missing long-term data for migratory invertebrates, and differences between remote sensing and field data. It explains how these gaps affect conclusions, though it could be more specific about affected species or regions."
511+
}
512+
]
513+
}
490514
}
491515
}
492516
}

yescieval/injector/domains/nlp.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -493,6 +493,30 @@
493493
}
494494
]
495495
}
496+
},
497+
"Gap": {
498+
"GapIdentification": {
499+
"ResponseA": [
500+
{
501+
"rating": "1",
502+
"rationale": "The response is purely descriptive, summarizing reported benchmark scores and model architecture details across several tasks such as named entity recognition and machine translation. It does not identify any missing evaluations, underexplored domains, or unresolved questions relevant to the research question. No mention is made of dataset biases, lack of ablation studies, or conflicting results across benchmarks."
503+
},
504+
{
505+
"rating": "4",
506+
"rationale": "The response clearly identifies several meaningful gaps in the evidence base: it notes that most evaluations of large language models are conducted on high-resource languages leaving low-resource and morphologically complex languages largely underrepresented, that existing benchmarks for dialogue systems lack coverage of multi-turn reasoning under ambiguity, and that conflicting results across GLUE and SuperGLUE variants suggest evaluation inconsistencies that remain unresolved. The response explains why these gaps limit claims about generalizability, though it slightly overgeneralizes when attributing all performance variance to dataset bias without considering architectural differences."
507+
}
508+
],
509+
"ResponseB": [
510+
{
511+
"rating": "1",
512+
"rationale": "The response describes existing approaches to coreference resolution and summarization, reporting model performance figures and dataset characteristics without identifying any limitations, missing analyses, or unresolved aspects of the research question. Findings are presented as comprehensive and conclusive, with no acknowledgment of underexplored tasks, missing robustness evaluations, or gaps in the comparative analysis across model families."
513+
},
514+
{
515+
"rating": "4",
516+
"rationale": "The response identifies concrete gaps in the evidence base, such as the absence of out-of-domain generalization evaluations for transformer-based models fine-tuned on narrow corpora, the lack of ablation studies isolating the contribution of pretraining data size versus architectural choices, and the underrepresentation of spoken or informal language registers in standard NLP benchmarks. It explains that these gaps undermine the reliability of conclusions drawn from current leaderboard comparisons, though it could have been more specific about which model families or task categories are most critically affected."
517+
}
518+
]
519+
}
496520
}
497521
}
498522
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .gap_identification import GapIdentification
2+
3+
__all__ = ["GapIdentification"]
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from ....base import PairwiseRubric
2+
3+
gap_identification_pairwise_prompt = """<Context>
4+
Scientific question answering and synthesis often require more than listing findings: high-quality scientific writing identifies what remains unknown, insufficiently addressed, or unresolved in existing research. This is commonly expressed through gap identification, where the text specifies limitations, missing knowledge, unresolved inconsistencies, or missing connections in prior work and explains why they matter for the research question.
5+
6+
Gap identification can refer to (a) field-level gaps across the literature (e.g., missing populations/settings, inconsistent measures, lack of comparative studies, limited external validity), and/or (b) recurring study-level limitations that materially constrain conclusions when framed as evidence-base limitations. High-quality gap statements are specific and scoped (what is missing, where, and why), rather than generic (“more research is needed”). If gap emphasis is not central to the user's question, it should not be forced; however, when included, it should remain relevant and well-defined.
7+
8+
The responses may be short paragraphs or long-form reports. Gap identification should be evaluated independently of presentation style or length.
9+
10+
This rubric focuses exclusively on the presence and quality of gap identification within the provided text of two responses that are compared, emphasizing explicit and relevant statements of limitations, unanswered questions, or missing connections in prior work rather than summaries of what is already known. Other aspects of scientific quality (such as factual accuracy, evidential grounding, completeness, or innovation) are intentionally outside its scope and are assessed by separate evaluation criteria.
11+
</Context>
12+
13+
<Role>
14+
You are tasked as a scientific writing quality evaluator performing a pairwise comparison between two texts.
15+
</Role>
16+
17+
<Task-Description>
18+
A user will provide:
19+
1) a research question, and
20+
2) two written responses (Response A and Response B) intended to address that question.
21+
22+
Your task is to:
23+
- First, independently evaluate each response using the evaluation characteristic below.
24+
- Then perform a pairwise comparison of the two responses using the evaluation characteristic below.
25+
- Then grade each response with a comparative rating from 1 (very bad) to 5 (very good) compared to the other response and a subsequent rationale for each comparative response rating.
26+
- Note that it is possible for both responses to receive the same rating, if they are equally comparably clear with systematic gap identification and provide complementary or identical insights.
27+
28+
Your judgment must be based solely on the provided question and comparing the two responses w.r.t. addressing the question and w.r.t. each other.
29+
</Task-Description>
30+
31+
<Evaluation-Characteristic>
32+
GapIdentification: Does the response identify gaps relevant to the research question by specifying limitations, missing knowledge, unresolved issues, or missing connections in prior work (preferably at the evidence-base/field level), rather than only describing existing findings?
33+
</Evaluation-Characteristic>
34+
35+
<Domain-Vocabulary-Examples>
36+
Below are terms and phrases that often signal gap identification. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score.
37+
{GAP_IDENTIFICATION_VOCAB}
38+
</Domain-Vocabulary-Examples>
39+
40+
<Rating-Scale>
41+
For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below.
42+
43+
GapIdentification
44+
Rating 1. Very bad: The response is purely descriptive, summarizing existing findings or observations with no identification of missing, unknown, inconsistent, or unresolved aspects relevant to the research question.
45+
Rating 2. Bad: The response refers to gaps only in a vague or generic manner (e.g., “more research is needed”) without clearly specifying what is missing or unresolved in the context of the research question.
46+
Rating 3. Moderate: The response identifies one or more potential gaps with partial clarity, but the nature, scope, location (where in the literature), or relevance of the gaps is incomplete, unclear, or inconsistently articulated.
47+
Rating 4. Good: The response clearly identifies specific gaps or limitations in the evidence base that are relevant to the research question (e.g., missing comparisons, populations, settings, measures, time horizons, or conflicting results) and provides some explanation of why they matter; minor ambiguity or imprecision may remain.
48+
Rating 5. Very good: The response explicitly and clearly identifies well-defined gaps or unanswered questions, specifying what is missing, where it occurs in the evidence base (or across studies), and why it matters for the research question, without relying on vague statements. Gap statements are appropriately scoped (avoiding absolute claims like “no research exists” unless clearly justified within the response).
49+
</Rating-Scale>
50+
51+
<Response-Format>
52+
Return your evaluation strictly in JSON format:
53+
54+
{
55+
"GapIdentification": {
56+
"ResponseA": {
57+
"rating": "",
58+
"rationale": ""
59+
},
60+
"ResponseB": {
61+
"rating": "",
62+
"rationale": ""
63+
}
64+
}
65+
}
66+
67+
where:
68+
- "rating" is a number from 1 to 5.
69+
- "rationale" is the comparative evaluation rating justification.
70+
71+
</Response-Format>
72+
73+
<Example-Responses>
74+
{EXAMPLE_RESPONSES}
75+
</Example-Responses>
76+
77+
78+
<Note>
79+
Your evaluation must be based solely on the provided research question and responses. Do not reward keyword mentions alone. Reward distinct, clearly described gaps in existing knowledge that are relevant to the research question and appropriately scoped. This rubric does not assess factual correctness, evidential grounding, completeness, or innovation.
80+
</Note>
81+
"""
82+
83+
class GapIdentification(PairwiseRubric):
84+
name: str = "GapIdentification"
85+
system_prompt_template: str = gap_identification_pairwise_prompt

0 commit comments

Comments
 (0)