Skip to content

[Dataset] Add ZebraLogic benchmark#2464

Merged
Myhs-phz merged 2 commits into
open-compass:mainfrom
amanyara:add-zebralogic
Jun 26, 2026
Merged

[Dataset] Add ZebraLogic benchmark#2464
Myhs-phz merged 2 commits into
open-compass:mainfrom
amanyara:add-zebralogic

Conversation

@amanyara

Copy link
Copy Markdown
Contributor

Summary

Two evaluation modes

Mode Samples Evaluation
mc_mode 3 259 Multiple-choice — extract answer letter, exact match
grid_mode 1 000 Full grid completion — cell-level accuracy via markdown table parsing

Files changed

  • opencompass/datasets/zebralogic.pyZebraLogicDataset (HuggingFace loader, mc_mode choice formatting), ZebraLogicMCEvaluator (answer extraction + exact match), ZebraLogicGridEvaluator (markdown table parsing + cell-level accuracy)
  • opencompass/datasets/__init__.py — register new classes
  • opencompass/configs/datasets/ZebraLogic/zebralogic_gen.py — eval configs for both modes
  • dataset-index.yml — add zebralogic entry under Reasoning category

Test plan

  • All helper functions (_extract_mc_answer, _parse_grid_reference, _extract_grid_from_text) pass unit tests
  • ZebraLogicMCEvaluator correctly scores predictions (50.0 on mixed correct/incorrect)
  • ZebraLogicGridEvaluator achieves 100% perfect accuracy on a perfect prediction

🤖 Generated with Claude Code

Integrate the ZebraLogic logical reasoning benchmark (WildEval/ZebraLogic)
into OpenCompass. ZebraLogic evaluates LLMs on constraint satisfaction
problems (logic grid puzzles) and reveals the "curse of complexity" —
accuracy degrades sharply as puzzle size grows.

Two evaluation modes are supported:
- mc_mode  (3 259 samples): multiple-choice, evaluated with exact-match on
  the answer letter extracted from free-form model output.
- grid_mode (1 000 samples): full grid completion, evaluated with
  cell-level accuracy by parsing markdown tables from model output.

Changes:
- opencompass/datasets/zebralogic.py      – ZebraLogicDataset loader,
  ZebraLogicMCEvaluator, ZebraLogicGridEvaluator and helper functions
- opencompass/datasets/__init__.py        – register new classes
- opencompass/configs/datasets/ZebraLogic/zebralogic_gen.py – eval configs
- dataset-index.yml                       – add zebralogic entry

Paper: https://arxiv.org/abs/2502.01100
HuggingFace: WildEval/ZebraLogic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Myhs-phz Myhs-phz merged commit 5589a37 into open-compass:main Jun 26, 2026
2 checks passed
@amanyara amanyara deleted the add-zebralogic branch June 30, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants