Skip to content

feat: add physics reasoning benchmarks (PhysBench, ContPhy, PhysGame, PhysicsRW, PhysReason)#1272

Open
Luodian wants to merge 1 commit intomainfrom
feat/physics-benchmarks
Open

feat: add physics reasoning benchmarks (PhysBench, ContPhy, PhysGame, PhysicsRW, PhysReason)#1272
Luodian wants to merge 1 commit intomainfrom
feat/physics-benchmarks

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 26, 2026

Summary

  • Add five physics reasoning benchmarks covering diverse physical understanding tasks
  • Add shared MCQ answer extraction utility (mcq_extract.py) for robust multiple-choice parsing

Benchmarks

Benchmark Paper Description
PhysBench ICLR 2025 Multi-domain physics reasoning across mechanics, optics, thermodynamics, etc.
ContPhy ICML 2024 Continuum physics understanding from video (fluid, cloth, rope dynamics)
PhysGame - Physics understanding from game/simulation environments
PhysicsRW - Real-world physics scenario questions
PhysReason - Physics reasoning with full and mini splits

New shared utility

  • _task_utils/mcq_extract.py: Robust MCQ answer extraction supporting 10+ answer formats with priority ranking (parentheses, periods, colons, natural language phrases, positional fallbacks)

Test plan

  • Verify task registration with lmms-eval --tasks list | grep phys
  • Run PhysBench with an image-capable model
  • Run ContPhy with a video-capable model
  • Confirm MCQ extraction handles edge cases (reasoning tags, multi-format answers)

… PhysicsRW, PhysReason)

Add five physics reasoning benchmarks:
- PhysBench: multi-domain physics reasoning (ICLR 2025)
- ContPhy: continuum physics understanding from videos (ICML 2024)
- PhysGame: physics understanding from game environments
- PhysicsRW: real-world physics scenarios
- PhysReason: physics reasoning with mini split

Also adds shared MCQ answer extraction utility used by PhysBench.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant