Skip to content

Commit 8f142bc

Browse files
LuodianKelvinDo183pbcongYichenG170b8zhong
authored
[Feat] v0.5 Release Pack (#846)
* add scibench task (full) and change medqa (#840) * add scibench task (full ) and change medqa * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * add csbench (#841) * add csbench * run precommit --------- Co-authored-by: pbcong <congphamba2005@gmail.com> * fix linting (#842) * [Feature] Add WenetSpeech Dataset (#837) * [fix] batch size in openai compatible endpoint (#835) * more * more * more * more * more * more * more * more * more * more * more * more * more * more * [Feature] Add WenetSpeech Dataset * add lmms-eval-0.5 doc's 1st draft * remove unneccessary parts in lmms-eval-0.5.md --------- Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> * This commit documents the official release of **LMMS-Eval v0.5: Multimodal Expansion**, detailing significant new features including: * A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech). * A production-ready **response caching system**. * Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3). * Addition of **numerous new benchmarks** across vision, coding, and STEM domains. * Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**. * This commit formally announces and documents the **LMMS-Eval v0.5: Multimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks. * Updates the status legend for reproducibility validation in the LMMS-Eval v0.5 release notes, changing '†' to '+-'. * Revise metrics and model integration in lmms-eval doc Updated metrics and model integration details in the documentation. * Fix model name in LMMs-Eval v0.5 announcement Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section. --------- Co-authored-by: Do Duc Anh (Erwin) <104162175+KelvinDo183@users.noreply.github.com> Co-authored-by: pbcong <congphamba2005@gmail.com> Co-authored-by: Cong <101887866+pbcong@users.noreply.github.com> Co-authored-by: JAM_Yichen <110095482+YichenG170@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
1 parent 36dcfbb commit 8f142bc

20 files changed

Lines changed: 1184 additions & 151 deletions

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020

2121
## Annoucement
2222

23+
- [2025-10] 🚀🚀 **LMMs-Eval v0.5** is here! This major release introduces comprehensive audio evaluation, response caching, 5 new models (GPT-4o Audio Preview, Gemma-3, LongViLA-R1, LLaVA-OneVision 1.5, Thyme), and 50+ new benchmark variants spanning audio (Step2, VoiceBench, WenetSpeech), vision (CharXiv, Lemonade), and reasoning (CSBench, SciBench, MedQA, SuperGPQA) with reproducible results. Please refer to the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.5.md) for details.
2324
- [2025-07] 🚀🚀 We have released the `lmms-eval-0.4`. Please refer to the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.4.md) for more details. This is a major update with new features and improvements, for users wish to use `lmms-eval-0.3` please refer to the branch `stable/v0d3`. For our mission to better reproductability, we've opened a specific thread to discuss about the model's eval results in [discussion](https://github.com/EvolvingLMMs-Lab/lmms-eval/discussions/779).
2425
- [2025-07] 🎉🎉 We welcome the new task [PhyX](https://phyx-bench.github.io/), the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
2526
- [2025-06] 🎉🎉 We welcome the new task [VideoMathQA](https://mbzuai-oryx.github.io/VideoMathQA), designed to evaluate mathematical reasoning in real-world educational videos.

0 commit comments

Comments
 (0)