feat: add HD-EPIC VQA benchmark (CVPR 2025) by aliazani · Pull Request #1316 · EvolvingLMMs-Lab/lmms-eval

aliazani · 2026-04-30T17:31:29Z

Summary

Adds HD-EPIC VQA benchmark (Perrett et al., CVPR 2025), a 26,550-question egocentric kitchen video QA benchmark across 30 prototypes and 7 categories
Provides per-prototype, per-category, and full benchmark task configs runnable via --tasks hd_epic_<prototype>, --tasks hd_epic_<category>, or --tasks hd_epic
Validated against published zero-shot Qwen2.5-VL-7B baseline from the HD-EPIC challenge community report

In scope

30 per-prototype YAML task configs + 7 category group YAMLs + 1 master group YAML (hd_epic)
utils.py with doc_to_messages, doc_to_visual, doc_to_text, process_results, doc_to_target, and <TIME>/<BBOX> tag resolution
hd_epic_to_hf.py converter from official annotation JSONs to lmms-eval JSONL format
generate_task_yamls.py script for regenerating all YAMLs
README.md with setup instructions and validation results

Out of scope

Hosting videos or JSONL on HuggingFace (videos are 41 hours; JSONL is generated locally from official annotations)
Fine-tuned model evaluation
Changes to any existing tasks or models

Validation

python -m lmms_eval --model qwen2_5_vl --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct,fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176 --tasks hd_epic_ingredient_ingredient_weight --batch_size 1 | sample size: N=50 | key metrics: accuracy | result: pass (26%, R3 community report baseline: ~28%, within SE of 6pp)

Risk / Compatibility

No changes to existing tasks, models, or shared infrastructure
Requires ffmpeg in PATH and HD_EPIC_VIDEO_DIR env var set at eval time; gracefully falls back to full video if clip extraction fails

Type of Change

New benchmark/task

- 30 question prototypes across 7 categories (Recipe, Ingredient, Nutrition, Fine-grained Actions, 3D Perception, Object Motion, Gaze) - 26,550 multiple-choice questions from 41 hours of egocentric video - Runnable per-prototype, per-category, or full benchmark - Validated: Qwen2.5-VL-7B 26% on Ingredient Weight (R3 report: ~28%)

-c copy snaps to nearest keyframe, causing clips to start early. Replace with -c:v libx264 -preset ultrafast -crf 23 for exact cuts.

kcz358

Hi, the tasks LGTM. It would be better if you can split the yaml into sub folders so that the categories are clearer for the user. Thanks

Split the 30 per-prototype YAMLs and the 7 category-group YAMLs into one subfolder per HD-EPIC category (recipe/, ingredient/, nutrition/, fine_grained/, 3d_perception/, object_motion/, gaze/) so the category structure is visible at a glance. _hd_epic_base.yaml and the master group YAML stay at the top level. Per-prototype YAMLs now use `include: ../_hd_epic_base.yaml`. Task names, group wiring, and --tasks invocations are unchanged. Also: update generate_task_yamls.py to emit the new layout; add a Repository layout section to the README; fix stale clip extraction note (-c copy → libx264/ultrafast/crf23).

lmms-eval resolves !function module names relative to each YAML's own directory. After moving per-prototype YAMLs into <category>/ subfolders, references like `!function utils.filter_X` could no longer find the top-level utils.py and threw ImportError on task load. Each category subfolder now contains a small utils.py shim that prepends the parent directory to sys.path and re-exports the shared helpers, so the existing !function references resolve transparently. No YAML or top-level utils.py changes required. generate_task_yamls.py also writes the shim alongside each subfolder so regenerations stay consistent.

Pure input seek (-ss before -i) snaps to the nearest keyframe, which can start a clip several seconds early. This caused the model to see a different time window than the question intended, dropping accuracy on ingredient_ingredient_weight from 26% to 22% (~1 question per 12). Pure output seek (-ss after -i) is frame-accurate but decodes from the start of the file, making extraction 10-20x slower on the long HD-EPIC recordings (36 min for 50 questions vs ~2 min with input seek). Switch to two-pass seek: fast keyframe-aligned input seek to ~2s before the target, then a short precise output seek for the remaining offset. Frame accuracy is equivalent to pure output seek; extraction time is equivalent to input seek. Validated: accuracy returns to 26% (matching R3 baseline, within SE) at 6:38 total for 50 questions. Also update the docstring for _extract_clip to document the strategy, caching behaviour, and fallback path.

aliazani · 2026-05-06T09:46:00Z

Hi, the tasks LGTM. It would be better if you can split the yaml into sub folders so that the categories are clearer for the user. Thanks

@kcz358
Done — pushed in the latest commits. Each of the 7 HD-EPIC categories now has its own subfolder under tasks/hd_epic/ containing its per-prototype YAMLs and category group YAML. _hd_epic_base.yaml and the master group YAML stay at the top level. generate_task_yamls.py has been updated to emit this layout going forward.

One thing worth noting: lmms-eval resolves !function module names relative to each YAML's own directory, so moving the YAMLs one level deeper broke !function utils.filter_* lookups. Fixed by adding a small utils.py shim to each category subfolder that re-exports the shared helpers from the top-level utils.py — no YAML or top-level code changes required.

Also took the opportunity to add a "Repository layout" section to the README documenting the new structure, and fixed a stale note in the Notes section (clip extraction now uses two-pass ffmpeg seek rather than -c copy).

Task names and --tasks invocations are unchanged throughout. Let me know if anything needs adjusting!

Original Benchmark link: https://github.com/hd-epic/hd-epic-vqa-eval/tree/main

aliazani and others added 2 commits April 30, 2026 19:23

fix: use libx264 re-encoding in _extract_clip for frame-exact trimming

87097c7

-c copy snaps to nearest keyframe, causing clips to start early. Replace with -c:v libx264 -preset ultrafast -crf 23 for exact cuts.

kcz358 reviewed May 6, 2026

View reviewed changes

aliazani added 3 commits May 6, 2026 09:13

fix cluster_key to include time windows, not just video ids

a5a3700

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HD-EPIC VQA benchmark (CVPR 2025)#1316

feat: add HD-EPIC VQA benchmark (CVPR 2025)#1316
aliazani wants to merge 6 commits intoEvolvingLMMs-Lab:mainfrom
aliazani:feat/hd-epic-benchmark

aliazani commented Apr 30, 2026

Uh oh!

kcz358 left a comment

Uh oh!

aliazani commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aliazani commented Apr 30, 2026

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Uh oh!

kcz358 left a comment

Choose a reason for hiding this comment

Uh oh!

aliazani commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aliazani commented May 6, 2026 •

edited

Loading