Evaluates how well model responses follow instruction constraints. Returns a partial credit score (0.0 to 1.0).
import sys
sys.path.insert(0, '/path/to/eval_protocol/rewards/ifeval')
from reward import ifeval_partial_credit_reward
response = "Hello world! This is my response."
ground_truth = {
"instruction_id": ["keywords:existence"],
"kwargs": [{"keywords": ["hello", "world"]}]
}
score = ifeval_partial_credit_reward(response, ground_truth)
# Score: 1.0 (all constraints satisfied)pip install nltk langdetect emoji syllapy immutabledict absl-pyNLTK resources are downloaded automatically on first use.
- Automatically strips
<think>...</think>tags before evaluation - Ground truth can be a dict, list, or JSON string
- 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD)
Copied from open-instruct/open_instruct/IFEvalG/:
ifeval_instructions.py(frominstructions.py)ifeval_registry.py(frominstructions_registry.py)ifeval_util.py(frominstructions_util.py)
Copied from IFBench/ (commit 8e6a9be, 2025-01):
ifbench_instructions.py(frominstructions.py)ifbench_registry.py(frominstructions_registry.py)ifbench_util.py(frominstructions_util.py)
New code:
reward.py- main reward function__init__.py- package exports