Evaluates how well model responses follow instruction constraints. Returns a partial credit score (0.0 to 1.0).
pytest eval_protocol/benchmarks/ifeval/test_ifeval.py -vfrom eval_protocol.benchmarks.ifeval import ifeval_partial_credit_reward
response = "Hello world! This is my response."
ground_truth = {
"instruction_id": ["keywords:existence"],
"kwargs": [{"keywords": ["hello", "world"]}]
}
score = ifeval_partial_credit_reward(response, ground_truth)
# Score: 1.0 (all constraints satisfied)pip install nltk langdetect emoji syllapy immutabledict absl-pyNLTK resources are downloaded automatically on first use.
- Automatically strips
<think>...</think>tags before evaluation - Ground truth can be a dict, list, or JSON string
- 112 total constraints (54 IFEval/IFTrain + 58 IFBench OOD)
Copied from open-instruct/open_instruct/IFEvalG/:
ifeval_instructions.py,ifeval_registry.py,ifeval_util.py
Copied from IFBench/ (commit 8e6a9be, 2025-01):
ifbench_instructions.py,ifbench_registry.py,ifbench_util.py
New code:
reward.py- scoring functiontest_ifeval.py- eval-protocol benchmark test