- May 1 2026: MMBench-Live has been accepted at ICML 2026!
Evaluation benchmarks are essential for measuring the capabilities of vision–language models (VLMs). Existing multimodal benchmarks are mostly static, which makes them prone to data contamination, temporal staleness, and high construction costs. To address these challenges, we propose MMBench-Live, a dynamic, multi-agent-driven benchmark that continuously updates without human intervention. It integrates structured benchmark descriptions, real-time data collection, and automated QA generation, enabling scalable and low-cost evaluations. Additionally, a distribution-consistent updating strategy ensures reliable and fair evaluation across versions, while maintaining data diversity and quality.
conda create -n mmbench-live python=3.10.16 -y
conda activate mmbench-livegit clone https://github.com/SupineYoke123/MMBench-Live.git
cd MMBench-Livepip install -r requirements.txtpip install playwright
python -m playwright install chromiumBefore running the project, you need to set up the necessary API keys and configuration parameters. The project will look for environment variables by default, but you can also modify the configuration file directly.
| Environment Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for OpenAI models |
GEMINI_API_KEY |
API key for Google AI Studio |
SERPER_API_KEY |
API key for Google Image search |
To run the agents, you must configure the following:
| Environment / Variable | Description |
|---|---|
FLICKR_API_KEY |
API key for Flickr image search |
TAVILY_API_KEY |
API key for Tavily image search |
BENCHMARK_SUMMARY_AGENT_MCP_PORT |
Port for the benchmark summary agent |
DATA_ACQUISITION_AGENT_MCP_PORT |
Port for the data acquisition agent |
QA_GENERATION_AGENT_MCP_PORT |
Port for the QA generation agent |
QA_VALIDATE_AGENT_MCP_PORT |
Port for the QA validation agent |
SEGMENTATION_MCP |
URL for segmentation service |
VISION_MCP |
URL for vision service |
DEPTH_MCP |
URL for depth service |
This section provides instructions for quickly running the project in two modes: Non-Agent Pipeline and Agent Pipeline.
If you do not want to run the agents, a simplified pipeline is provided.
Notes:
- Only the Google image API is used for data acquisition.
- The QA validation step is skipped.
Steps:
- Run the benchmark summary script:
python pipeline/benchmark_summary.py- Run the data acquisition script:
python pipeline/data_acquisition.py- Run the QA generation script:
python pipeline/qa_generation.pyIf you want to run the agents, make sure all required API keys, MCP endpoints, and ports are properly configured (see ⚙️ Configuration section).
Steps:
- Open separate terminals for each agent and run:
# Terminal 1
python agents/benchmark_summary_agent.py
# Terminal 2
python agents/data_acquisition_agent.py
# Terminal 3
python agents/qa_generation_agent.py
# Terminal 4
python agents/qa_validate_agent.py- Run the main agent runner:
python agent_run.pyIf you find this work useful for your research, please kindly cite our paper:
@unpublished{liu2026mmbenchlive,
title={{MMB}ench-Live: A Continuously Evolving Benchmark for Multimodal Models},
author={Yuanzhi Liu and Shousheng Zhao and Bo Zhou and Kongming Liang and Zhanyu Ma},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
}
