Skip to content

Commit 054437f

Browse files
authored
add dimension - innovation pairwise rubric #46 from sciknoworg/dev
2 parents 055f1c3 + 2dcaaad commit 054437f

6 files changed

Lines changed: 158 additions & 6 deletions

File tree

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,13 +105,13 @@ from yescieval.rubric.pointwise.structural import Coherence, Relevancy, Integrat
105105
from yescieval.rubric.pointwise.stylistic import Cohesion, Readability, Conciseness
106106
# Depth rubrics
107107
from yescieval.rubric.pointwise.depth import MechanisticUnderstanding, CausalReasoning, TemporalPrecision
108-
# Gap rubrics
108+
# Gap rubric
109109
from yescieval.rubric.pointwise.gap import GapIdentification
110110
# Breadth rubrics
111111
from yescieval.rubric.pointwise.breadth import ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage
112112
# Rigor rubrics
113113
from yescieval.rubric.pointwise.rigor import EpistemicCalibration, QuantitativeEvidenceAndUncertainty, ExplicitUncertainty
114-
# Innovation rubrics
114+
# Innovation rubric
115115
from yescieval.rubric.pointwise.innovation import StateOfTheArtAndNovelty
116116
```
117117

@@ -124,8 +124,10 @@ from yescieval.rubric.pairwise.depth import MechanisticUnderstanding, CausalReas
124124
from yescieval.rubric.pairwise.breadth import ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage
125125
# Rigor rubrics
126126
from yescieval.rubric.pairiwise.rigor import EpistemicCalibration, QuantitativeEvidenceAndUncertainty, ExplicitUncertainty
127-
# Gap rubrics
127+
# Gap rubric
128128
from yescieval.rubric.pairwise.gap import GapIdentification
129+
# Innovation rubric
130+
from yescieval.rubric.pairwise.innovation import StateOfTheArtAndNovelty
129131
```
130132

131133
A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.

docs/source/rubrics.rst

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,10 @@ Each rubric can be used in two ways:
4545
from yescieval.rubric.pairwise.breadth import ContextCoverage, MethodCoverage, DimensionCoverage, ScaleCoverage, ScopeCoverage
4646
# Pairwise rigor rubrics
4747
from yescieval.rubric.pairwise.rigor import EpistemicCalibration, QuantitativeEvidenceAndUncertainty, ExplicitUncertainty
48-
# Pairwise Gap rubric
48+
# Pairwise gap rubric
4949
from yescieval.rubric.pairwise.gap import GapIdentification
50+
# Pairwise innovation rubric
51+
from yescieval.rubric.pairwise.innovation import StateOfTheArtAndNovelty
5052
5153
5254
Pairwise Evaluation
@@ -426,7 +428,7 @@ Evaluates the novelty of the synthesis.
426428
- Is the answer identifying specific state-of-the-art and/or novel contributions
427429
relevant to the research question, using terms like "novel," "state-of-the-art"?
428430

429-
.. tab:: Basic Usage
431+
.. tab:: Pointwise Usage
430432

431433
.. code-block:: python
432434
@@ -438,6 +440,18 @@ Evaluates the novelty of the synthesis.
438440
print(instruction)
439441
print(rubric.name)
440442
443+
.. tab:: Pairwise Usage
444+
445+
.. code-block:: python
446+
447+
from yescieval.rubric.pairwise.innovation import StateOfTheArtAndNovelty
448+
449+
rubric = StateOfTheArtAndNovelty(papers=papers, question=question, answer_a=answer_a, answer_b=answer_b)
450+
instruction = rubric.instruct()
451+
452+
print(instruction)
453+
print(rubric.name)
454+
441455
.. tab:: Usage with Example and Vocabulary Injectors
442456

443457
.. code-block:: python

yescieval/injector/domains/ecology.py

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -511,7 +511,31 @@
511511
}
512512
]
513513
}
514-
}
514+
},
515+
"Innovation": {
516+
"StateOfTheArtAndNovelty": {
517+
"ResponseA": [
518+
{
519+
"rating": "1",
520+
"rationale": "The response provides a general overview of ecological research and mentions common practices like field surveys and biodiversity monitoring. It does not identify any specific state-of-the-art methods or novel contributions and uses vague terms such as 'advanced techniques' without explaining their significance."
521+
},
522+
{
523+
"rating": "4",
524+
"rationale": "The response identifies specific state-of-the-art ecological approaches, such as the use of satellite remote sensing, LiDAR for vegetation structure analysis, and eDNA/metabarcoding for biodiversity assessment. It explains how these methods improve spatial resolution, allow non-invasive monitoring, and enhance detection of cryptic species, though comparisons to traditional baselines could be more detailed."
525+
}
526+
],
527+
"ResponseB": [
528+
{
529+
"rating": "1",
530+
"rationale": "The response discusses ecological concepts like ecosystem balance and conservation strategies but does not mention any concrete state-of-the-art tools or innovative methods. It relies on generic statements and does not explain what is new or improved in current research."
531+
},
532+
{
533+
"rating": "4",
534+
"rationale": "The response highlights modern ecological innovations, including machine learning models for ecosystem prediction, high-frequency sensor networks for real-time monitoring, and integration of multi-source data (e.g., remote sensing and field data). It explains how these approaches improve predictive accuracy and temporal coverage, although it provides limited detail on specific benchmarks or comparisons."
535+
}
536+
]
537+
}
538+
}
515539
}
516540
}
517541

yescieval/injector/domains/nlp.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -517,6 +517,30 @@
517517
}
518518
]
519519
}
520+
},
521+
"Innovation": {
522+
"StateOfTheArtAndNovelty": {
523+
"ResponseA": [
524+
{
525+
"rating": "1",
526+
"rationale": "The response provides a general overview of NLP techniques such as tokenization, word embeddings, and neural networks. It does not mention any specific state-of-the-art models or recent innovations and uses vague terms like 'modern approaches' without explaining what makes them new or how they improve performance."
527+
},
528+
{
529+
"rating": "4",
530+
"rationale": "The response identifies concrete state-of-the-art NLP contributions, including transformer-based architectures, instruction tuning, and parameter-efficient fine-tuning methods such as LoRA. It explains how these approaches reduce training cost, improve adaptability to downstream tasks, and enable efficient deployment, although it lacks detailed comparison with earlier baselines."
531+
}
532+
],
533+
"ResponseB": [
534+
{
535+
"rating": "1",
536+
"rationale": "The response discusses NLP applications like machine translation and sentiment analysis but does not identify any novel methods or recent advancements. It relies on generic descriptions and mentions 'cutting-edge models' without specifying what they are or what improvements they bring."
537+
},
538+
{
539+
"rating": "4",
540+
"rationale": "The response highlights specific innovations such as retrieval-augmented generation (RAG), reinforcement learning from human feedback (RLHF/DPO), and multimodal models that combine text and images. It explains how these methods improve factual grounding, alignment with user intent, and cross-modal understanding, though the explanation of trade-offs and benchmarks is somewhat limited."
541+
}
542+
]
543+
}
520544
}
521545
}
522546
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .novelty import StateOfTheArtAndNovelty
2+
3+
__all__ = ["StateOfTheArtAndNovelty"]
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
from ....base import PairwiseRubric
2+
3+
state_of_the_art_and_novelty_pairwise_prompt = """<Context>
4+
Scientific question answering and synthesis often require more than listing findings: high-quality scientific writing can surface what is genuinely innovative in the literature and explain how it differs from prior or established approaches. In synthesis settings (e.g., reports summarizing multiple papers), this is expressed by identifying specific novel contributions (e.g., new methods, new datasets, new capabilities, new theoretical framings, proof-of-concept results) and situating them relative to an implicit or explicit baseline (what was done before, what limitation is addressed, what capability is newly enabled).
5+
6+
This rubric is not about using buzzwords like “breakthrough” or “state-of-the-art” in isolation. High scores require novelty to be concrete and meaningfully contextualized (new relative to what, and why it matters). Not every research question requires emphasizing novelty; some primarily ask for established consensus or background. In such cases, it can be appropriate to focus on established knowledge, though a strong response may still indicate whether the field is mature versus rapidly evolving.
7+
8+
The responses may be short paragraphs or long-form reports. Innovation should be evaluated independently of presentation style or length.
9+
10+
This rubric focuses exclusively on the presence and quality of innovation identification within the provided text of two responses that are compared—i.e., whether the responses highlight specific novel contributions and explains their significance relative to prior work—rather than merely summarizing established knowledge or using generic novelty language. Other aspects of scientific quality (such as factual accuracy, evidential grounding, completeness, or mechanistic depth) are intentionally outside its scope and are assessed by separate evaluation criteria.
11+
</Context>
12+
13+
<Role>
14+
You are tasked as a scientific writing quality evaluator performing a pairwise comparison between two texts.
15+
</Role>
16+
17+
<Task-Description>
18+
A user will provide:
19+
1) a research question, and
20+
2) two written responses (Response A and Response B) intended to address that question.
21+
22+
Your task is to:
23+
- First, independently evaluate each response using the evaluation characteristic below.
24+
- Then perform a pairwise comparison of the two responses using the evaluation characteristic below.
25+
- Then grade each response with a comparative rating from 1 (very bad) to 5 (very good) compared to the other response and a subsequent rationale for each comparative response rating.
26+
- Note that it is possible for both responses to receive the same rating, if they are equally comparably clear with consistent innovation identification and provide complementary or identical insights.
27+
28+
Your judgment must be based solely on the provided question and comparing the two responses w.r.t. addressing the question and w.r.t. each other.
29+
</Task-Description>
30+
31+
<Evaluation-Characteristic>
32+
StateOfTheArtAndNovelty: Does the response identify specific state-of-the-art and/or novel contributions relevant to the research question (e.g., new methods, datasets, capabilities, theoretical framings, proof-of-concept results), and meaningfully contextualize them relative to prior or established work (i.e., new relative to what, and why it matters)? If novelty emphasis is not central to the question, does the response avoid forced novelty and (optionally) state that the evidence base is mature or that innovation is not the focus?
33+
</Evaluation-Characteristic>
34+
35+
<Domain-Vocabulary-Examples>
36+
Below are terms and phrases that often co-occur with innovation claims. They are examples only: their presence is not required, and their presence alone is not sufficient for a high score.
37+
{NOVELTY_INDICATORS_VOCAB}
38+
</Domain-Vocabulary-Examples>
39+
40+
<Rating-Scale>
41+
For the characteristic above, rate the quality from 1 (very bad) to 5 (very good). Follow the guidelines specified below.
42+
43+
StateOfTheArtAndNovelty
44+
Rating 1. Very bad: The response provides only established/background knowledge or a generic summary, with no identification of specific state-of-the-art or novel contributions where such identification would be relevant; or it uses novelty buzzwords (“breakthrough”, “SOTA”) without any concrete explanation.
45+
Rating 2. Bad: The response occasionally signals state-of-the-art or novelty, but claims are vague, generic, or weakly connected to the research question; novelty is not contextualized (no clear “new relative to what”) and/or seems forced.
46+
Rating 3. Moderate: The response identifies at least one potentially state-of-the-art or novel contribution, but description, relevance, or significance is partially unclear; contextualization relative to prior work is limited or inconsistent; proof-of-concept vs established advances may be conflated.
47+
Rating 4. Good: The response clearly highlights multiple specific state-of-the-art and/or innovative contributions relevant to the research question and provides reasonable contextualization (what limitation is addressed or what capability is newly enabled), with minor gaps in baseline comparison or scope.
48+
Rating 5. Very good: The response provides a coherent, well-structured account of state-of-the-art and novelty tightly aligned with the research question: it identifies multiple specific novel contributions, clearly explains how each differs from prior/established approaches (explicit or implicit baseline), and articulates why it matters (capabilities, limitations addressed, or new directions), while appropriately scoping claims (e.g., proof-of-concept vs broadly validated). If novelty emphasis is not central to the question, it avoids forced novelty and explicitly frames the maturity/innovation relevance appropriately.
49+
50+
</Rating-Scale>
51+
52+
<Response-Format>
53+
Return your evaluation strictly in JSON format:
54+
55+
{
56+
"StateOfTheArtAndNovelty": {
57+
"ResponseA": {
58+
"rating": "",
59+
"rationale": ""
60+
},
61+
"ResponseB": {
62+
"rating": "",
63+
"rationale": ""
64+
}
65+
}
66+
}
67+
68+
where:
69+
- "rating" is a number from 1 to 5.
70+
- "rationale" is the comparative evaluation rating justification.
71+
72+
</Response-Format>
73+
74+
<Example-Responses>
75+
{EXAMPLE_RESPONSES}
76+
</Example-Responses>
77+
78+
<Note>
79+
Your evaluation must be based solely on the provided research question and responses. Do not reward novelty buzzwords alone. Reward specific identification of what is new, clear contextualization relative to prior work (“new compared to what”), and appropriate scoping of innovation claims. This rubric does not assess factual correctness, evidential grounding, completeness, or mechanistic depth.
80+
</Note>
81+
"""
82+
83+
class StateOfTheArtAndNovelty(PairwiseRubric):
84+
name: str = "StateOfTheArtAndNovelty"
85+
system_prompt_template: str = state_of_the_art_and_novelty_pairwise_prompt

0 commit comments

Comments
 (0)