Skip to content

Commit 898ed1c

Browse files
authored
[New Task] Add support for benchmark PhyX (#766)
* add support for PhyX * UPD PhyX * UPD * fix following robot * fix follow robot
1 parent f604dfb commit 898ed1c

8 files changed

Lines changed: 556 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
<details>
3434
<summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
3535

36+
- [2025-07] 🎉🎉 We welcome the new task [PhyX](https://phyx-bench.github.io/), the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
3637
- [2025-06] 🎉🎉 We welcome the new task [VideoMathQA](https://mbzuai-oryx.github.io/VideoMathQA), designed to evaluate mathematical reasoning in real-world educational videos.
3738
- [2024-10] 🎉🎉 We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
3839
- [2024-10] 🎉🎉 We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap.

lmms_eval/tasks/phyx/phyx.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: phyx
2+
task:
3+
- phyx_mc
4+
- phyx_oe
5+
- phyx_mini_mc
6+
- phyx_mini_oe
7+
metadata:
8+
version: 0.0
9+
eval_model_name: "deepseek-chat"
10+
quick_extract: false

0 commit comments

Comments
 (0)