Skip to content

[Dataset] Add ArxivRollBench#2458

Merged
Myhs-phz merged 2 commits into
open-compass:mainfrom
liangzid:add-arxivrollbench
Jun 26, 2026
Merged

[Dataset] Add ArxivRollBench#2458
Myhs-phz merged 2 commits into
open-compass:mainfrom
liangzid:add-arxivrollbench

Conversation

@liangzid

Copy link
Copy Markdown
Contributor

Summary

This PR adds ArxivRollBench to OpenCompass as a generative exact-match benchmark.

ArxivRollBench is a private-to-public rolling benchmark built from recent arXiv papers. It evaluates whether LLMs can reason over fresh scholarly text through three task formats: sequencing, cloze, and next-fragment prediction. Its key principle is time-aware evaluation: benchmark snapshots are created from papers that were private or newly released at the time of construction, then later become public so the community can reproduce and compare results. This helps measure model generalization on recent scientific content while reducing the overestimation risk caused by benchmark contamination.

The benchmark paper has been accepted by AAAI 2026:
https://ojs.aaai.org/index.php/AAAI/article/view/41098

The public Hugging Face datasets are available under:
https://huggingface.co/liangzid

Changes

  • Adds opencompass/configs/datasets/arxivrollbench/arxivrollbench_gen.py.
  • Provides default compact -50 splits for ArxivRollBench 2024b, 2025a, and 2026a across eight arXiv domains and three SCP task formats.
  • Provides full split configs through arxivrollbench_full_datasets.
  • Adds registered answer postprocessors for Selection 1-Selection 4 and A-D answers.
  • Adds ArxivRollBench to dataset-index.yml.

Why This Benchmark Is Useful For OpenCompass

OpenCompass already covers many static knowledge, reasoning, and domain benchmarks. ArxivRollBench complements them by focusing on recent scientific papers and rolling release snapshots. This gives users a lightweight way to test models on time-sensitive scholarly reasoning tasks, while keeping the default -50 split cost-controlled for API and large-model evaluation. The full split remains available for complete benchmark runs.

Validation

  • python -m compileall -q opencompass/datasets/arxivrollbench.py opencompass/configs/datasets/arxivrollbench/arxivrollbench_gen.py
  • YAML parse check for dataset-index.yml.
  • Config import check: generated 72 compact datasets and 72 full datasets.
  • Hugging Face smoke load checks for compact and full ArxivRollBench dataset paths.
  • OpenCompass build_dataset_from_cfg smoke check for one configured dataset.
  • AccEvaluator smoke check for both registered answer postprocessors.
  • git diff --check

@liangzid liangzid marked this pull request as ready for review May 24, 2026 06:03
@liangzid liangzid changed the title [codex] Add ArxivRollBench dataset configs [Dataset] Add ArxivRollBench May 24, 2026
@Myhs-phz Myhs-phz merged commit 2a8740e into open-compass:main Jun 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants