Skip to content

Commit 52c5620

Browse files
authored
feat: add Video-MME-v2 benchmark task (#1289)
* feat(videomme_v2): add task config and default template - Dataset: MME-Benchmarks/Video-MME-v2 (800 videos, 3200 questions) - 8-option MCQ (A-H) with grouped non-linear scoring - Generation config: max_new_tokens=64, temperature=0 * feat(videomme_v2): add scoring, prompts, and evaluation logic - Grouped non-linear scoring: relevance (quadratic) + logic (chain-based) - 3 group structures: [1,2,3,4], [1,[2,3],4], [[1,2],3,4] - Answer extraction with 11 prefix patterns (A-H range) - Per-level, per-category, per-group-type breakdown reporting - Prompt aligned with official INSTRUCT_PROMPT - Verified against VLMEvalKit implementation * feat(videomme_v2): add subtitle variant (concatenated mode) - Load word-level JSONL subtitles and prepend to prompt - Graceful fallback when subtitle file is missing - Task: videomme_v2_w_subtitle * feat(videomme_v2): add reasoning mode variant - Chain-of-thought prompt requiring Final Answer: <letter> format - max_new_tokens=4096 for reasoning space - Task: videomme_v2_reasoning * fix(videomme_v2): report sub-category scores as separate metrics Per review feedback (ref: PR #1285 pattern): - Report relevance/logic group-type scores separately - Report per-level (1/2/3) scores separately - Refactor aggregate logic into _compute_all_subscores helper - process_results returns same entry under all 6 metric keys - Detailed second_head/third_head breakdowns still logged --------- Co-authored-by: mwxely <mwxely@users.noreply.github.com>
1 parent 295b35d commit 52c5620

5 files changed

Lines changed: 496 additions & 0 deletions

File tree

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
dataset_path: MME-Benchmarks/Video-MME-v2
2+
dataset_kwargs:
3+
token: True
4+
cache_dir: videomme_v2
5+
video: True
6+
test_split: test
7+
output_type: generate_until
8+
doc_to_visual: !function utils.videomme_v2_doc_to_visual
9+
doc_to_text: !function utils.videomme_v2_doc_to_text
10+
doc_to_target: "answer"
11+
generation_kwargs:
12+
max_new_tokens: 64
13+
temperature: 0
14+
top_p: 1.0
15+
num_beams: 1
16+
do_sample: false
17+
process_results: !function utils.videomme_v2_process_results
18+
metric_list:
19+
- metric: videomme_v2_score
20+
aggregation: !function utils.videomme_v2_aggregate_results
21+
higher_is_better: true
22+
- metric: videomme_v2_relevance_score
23+
aggregation: !function utils.videomme_v2_aggregate_relevance
24+
higher_is_better: true
25+
- metric: videomme_v2_logic_score
26+
aggregation: !function utils.videomme_v2_aggregate_logic
27+
higher_is_better: true
28+
- metric: videomme_v2_level_1
29+
aggregation: !function utils.videomme_v2_aggregate_level_1
30+
higher_is_better: true
31+
- metric: videomme_v2_level_2
32+
aggregation: !function utils.videomme_v2_aggregate_level_2
33+
higher_is_better: true
34+
- metric: videomme_v2_level_3
35+
aggregation: !function utils.videomme_v2_aggregate_level_3
36+
higher_is_better: true
37+
lmms_eval_specific_kwargs:
38+
default:
39+
pre_prompt: ""
40+
post_prompt: "\nAnswer with the option's letter from the given choices directly."
41+
qwen3_vl:
42+
format: "qwen3_vl"
43+
pre_prompt: "Question: "
44+
post_prompt: "Answer with the option letter only."
45+
metadata:
46+
- version: 0.0

0 commit comments

Comments
 (0)