Skip to content

fix: improve evaluation logic across 10+ existing benchmarks#1274

Merged
Luodian merged 1 commit intomainfrom
fix/benchmark-task-improvements
Apr 11, 2026
Merged

fix: improve evaluation logic across 10+ existing benchmarks#1274
Luodian merged 1 commit intomainfrom
fix/benchmark-task-improvements

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 26, 2026

Summary

Fixes and enhancements across multiple existing benchmark tasks to improve evaluation accuracy and code quality.

Changes by benchmark

Task Change
egotempo Enhanced temporal reasoning evaluation with detailed per-category metrics
lvbench Simplified evaluation logic, removed redundant option-matching code
mlvu Streamlined answer extraction, removed unused imports
mmstar Simplified MCQ extraction with cleaner regex patterns
omnidocbench Major refactor — improved document evaluation metrics, better text/table/formula handling
realworldqa Cleaned up YAML config (removed hardcoded options), improved utils
spatialviz Added missing metric aggregation functions
videomme Simplified video evaluation, removed redundant frame extraction code
videommmu Enhanced video processing with better multi-image support
mmlongbench_doc Improved document benchmark with better page-level evaluation

Test plan

  • Run affected benchmarks with a compatible model to verify no regressions
  • Spot-check omnidocbench refactor (largest change) with document-capable model
  • Verify realworldqa yaml changes don't break task loading

Fixes and enhancements across multiple benchmark tasks:
- egotempo: enhanced temporal reasoning evaluation with detailed metrics
- lvbench/mlvu: simplified evaluation logic, removed redundant code
- mmstar: streamlined answer extraction
- omnidocbench: major refactor with improved document evaluation metrics
- realworldqa: cleaned up yaml config and utils
- spatialviz: added missing metric functions
- videomme/videommmu: improved video evaluation handling
- mmlongbench_doc: enhanced document benchmark processing
@Luodian Luodian merged commit 9ca4445 into main Apr 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant