[Dataset] Add ArxivRollBench by liangzid · Pull Request #2458 · open-compass/opencompass

liangzid · 2026-05-24T05:59:41Z

Summary

This PR adds ArxivRollBench to OpenCompass as a generative exact-match benchmark.

ArxivRollBench is a private-to-public rolling benchmark built from recent arXiv papers. It evaluates whether LLMs can reason over fresh scholarly text through three task formats: sequencing, cloze, and next-fragment prediction. Its key principle is time-aware evaluation: benchmark snapshots are created from papers that were private or newly released at the time of construction, then later become public so the community can reproduce and compare results. This helps measure model generalization on recent scientific content while reducing the overestimation risk caused by benchmark contamination.

The benchmark paper has been accepted by AAAI 2026:
https://ojs.aaai.org/index.php/AAAI/article/view/41098

The public Hugging Face datasets are available under:
https://huggingface.co/liangzid

Changes

Adds opencompass/configs/datasets/arxivrollbench/arxivrollbench_gen.py.
Provides default compact -50 splits for ArxivRollBench 2024b, 2025a, and 2026a across eight arXiv domains and three SCP task formats.
Provides full split configs through arxivrollbench_full_datasets.
Adds registered answer postprocessors for Selection 1-Selection 4 and A-D answers.
Adds ArxivRollBench to dataset-index.yml.

Why This Benchmark Is Useful For OpenCompass

OpenCompass already covers many static knowledge, reasoning, and domain benchmarks. ArxivRollBench complements them by focusing on recent scientific papers and rolling release snapshots. This gives users a lightweight way to test models on time-sensitive scholarly reasoning tasks, while keeping the default -50 split cost-controlled for API and large-model evaluation. The full split remains available for complete benchmark runs.

Validation

python -m compileall -q opencompass/datasets/arxivrollbench.py opencompass/configs/datasets/arxivrollbench/arxivrollbench_gen.py
YAML parse check for dataset-index.yml.
Config import check: generated 72 compact datasets and 72 full datasets.
Hugging Face smoke load checks for compact and full ArxivRollBench dataset paths.
OpenCompass build_dataset_from_cfg smoke check for one configured dataset.
AccEvaluator smoke check for both registered answer postprocessors.
git diff --check

Add ArxivRollBench dataset configs

f25f4df

mm-assistant Bot assigned tonysy May 24, 2026

liangzid marked this pull request as ready for review May 24, 2026 06:03

liangzid changed the title ~~[codex] Add ArxivRollBench dataset configs~~ [Dataset] Add ArxivRollBench May 24, 2026

Fix ArxivRollBench config prompts

c9778f8

Myhs-phz approved these changes Jun 26, 2026

View reviewed changes

Myhs-phz merged commit 2a8740e into open-compass:main Jun 26, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dataset] Add ArxivRollBench#2458

[Dataset] Add ArxivRollBench#2458
Myhs-phz merged 2 commits into
open-compass:mainfrom
liangzid:add-arxivrollbench

liangzid commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

liangzid commented May 24, 2026

Summary

Changes

Why This Benchmark Is Useful For OpenCompass

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants