You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add MOVA Benchmark for Arena to README (#50)
Release the 732-sample evaluation benchmark on Hugging Face,
covering MOVA-Bench (132 samples, 7 categories) and bilingual
VerseBench (600 samples). Add News entry, Evaluation subsection
with dataset link, and mark Evaluation Benchmark as done in TODO.
Copy file name to clipboardExpand all lines: README.md
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,7 @@ We introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model des
27
27
-**Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.
28
28
29
29
## 🔥News!!!
30
+
- 2026/03/14: 🎉We released the **MOVA Benchmark for Arena** (732 samples) on [🤗 Hugging Face](https://huggingface.co/datasets/zhiyuzhang-0212/MOVA_benchmark_for_arena) for reproducible evaluation.
30
31
- 2026/03/09: 🎉**MOVA API** is now available! Apply for your API key at [studio.mosi.cn](https://studio.mosi.cn/docs/models/mova?src=github) to start generating videos programmatically.
31
32
- 2026/03/09: 🎉**ComfyUI support** is here! Thanks to [@richservo](https://github.com/richservo), you can now use MOVA in ComfyUI at low cost via [comfyui-mova](https://github.com/richservo/comfyui-mova).
32
33
- 2026/02/10: 🎉We released **MOVA**[technical report](https://arxiv.org/abs/2602.08794) and update [inference workflow](https://github.com/OpenMOSS/MOVA/pull/29).
@@ -153,6 +154,19 @@ Below are the Elo scores and win rates comparing MOVA to existing open-source mo
We release the **MOVA Benchmark for Arena** on Hugging Face for reproducible subjective evaluation. The benchmark contains **732 samples** organized into two subsets:
160
+
161
+
| Subset | Samples | Description |
162
+
|--------|---------|-------------|
163
+
| MOVA-Bench | 132 | Real-world scenarios across 7 categories: multi-speaker (27), movie (12), sports (20), games (20), shot-effect (30), anime (20), and others (3) |
164
+
| VerseBench (Bilingual) | 600 | Bilingual English-Chinese speech data adapted from [VerseBench](https://huggingface.co/datasets/dorni/Verse-Bench), split into set1 (205), set2 (295), and set3 (100) |
165
+
166
+
Each sample includes a **first-frame image** and a **prompt** (rewritten by the workflow introduced in the paper) for joint image-text to video-audio generation.
[SGLang](https://github.com/sgl-project/sglang) provides Day0-support for MOVA. You can use the latest SGLang release and the examples below for high-throughput inference.
@@ -309,6 +323,7 @@ All peak usage numbers below are measured on **360p, 8-second** video training s
0 commit comments