[Dataset] Add ZebraLogic benchmark by amanyara · Pull Request #2464 · open-compass/opencompass

amanyara · 2026-05-31T15:38:26Z

Summary

Add ZebraLogic logical reasoning benchmark (WildEval/ZebraLogic) to OpenCompass
Paper: ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
ZebraLogic evaluates LLMs on constraint satisfaction problems (logic grid puzzles) and reveals the "curse of complexity" — accuracy degrades sharply as puzzle size grows, even for larger models with more inference-time compute

Two evaluation modes

Mode	Samples	Evaluation
`mc_mode`	3 259	Multiple-choice — extract answer letter, exact match
`grid_mode`	1 000	Full grid completion — cell-level accuracy via markdown table parsing

Files changed

opencompass/datasets/zebralogic.py — ZebraLogicDataset (HuggingFace loader, mc_mode choice formatting), ZebraLogicMCEvaluator (answer extraction + exact match), ZebraLogicGridEvaluator (markdown table parsing + cell-level accuracy)
opencompass/datasets/__init__.py — register new classes
opencompass/configs/datasets/ZebraLogic/zebralogic_gen.py — eval configs for both modes
dataset-index.yml — add zebralogic entry under Reasoning category

Test plan

All helper functions (_extract_mc_answer, _parse_grid_reference, _extract_grid_from_text) pass unit tests
ZebraLogicMCEvaluator correctly scores predictions (50.0 on mixed correct/incorrect)
ZebraLogicGridEvaluator achieves 100% perfect accuracy on a perfect prediction

🤖 Generated with Claude Code

Integrate the ZebraLogic logical reasoning benchmark (WildEval/ZebraLogic) into OpenCompass. ZebraLogic evaluates LLMs on constraint satisfaction problems (logic grid puzzles) and reveals the "curse of complexity" — accuracy degrades sharply as puzzle size grows. Two evaluation modes are supported: - mc_mode (3 259 samples): multiple-choice, evaluated with exact-match on the answer letter extracted from free-form model output. - grid_mode (1 000 samples): full grid completion, evaluated with cell-level accuracy by parsing markdown tables from model output. Changes: - opencompass/datasets/zebralogic.py – ZebraLogicDataset loader, ZebraLogicMCEvaluator, ZebraLogicGridEvaluator and helper functions - opencompass/datasets/__init__.py – register new classes - opencompass/configs/datasets/ZebraLogic/zebralogic_gen.py – eval configs - dataset-index.yml – add zebralogic entry Paper: https://arxiv.org/abs/2502.01100 HuggingFace: WildEval/ZebraLogic Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

mm-assistant Bot assigned tonysy May 31, 2026

Fix ZebraLogic dataset config

dc9e329

Myhs-phz approved these changes Jun 26, 2026

View reviewed changes

Myhs-phz merged commit 5589a37 into open-compass:main Jun 26, 2026
2 checks passed

amanyara deleted the add-zebralogic branch June 30, 2026 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dataset] Add ZebraLogic benchmark#2464

[Dataset] Add ZebraLogic benchmark#2464
Myhs-phz merged 2 commits into
open-compass:mainfrom
amanyara:add-zebralogic

amanyara commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

amanyara commented May 31, 2026

Summary

Two evaluation modes

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants