Skip to content

feat: add HD-EPIC VQA benchmark (CVPR 2025)#1316

Open
aliazani wants to merge 6 commits intoEvolvingLMMs-Lab:mainfrom
aliazani:feat/hd-epic-benchmark
Open

feat: add HD-EPIC VQA benchmark (CVPR 2025)#1316
aliazani wants to merge 6 commits intoEvolvingLMMs-Lab:mainfrom
aliazani:feat/hd-epic-benchmark

Conversation

@aliazani
Copy link
Copy Markdown

Summary

  • Adds HD-EPIC VQA benchmark (Perrett et al., CVPR 2025), a 26,550-question egocentric kitchen video QA benchmark across 30 prototypes and 7 categories
  • Provides per-prototype, per-category, and full benchmark task configs runnable via --tasks hd_epic_<prototype>, --tasks hd_epic_<category>, or --tasks hd_epic
  • Validated against published zero-shot Qwen2.5-VL-7B baseline from the HD-EPIC challenge community report

In scope

  • 30 per-prototype YAML task configs + 7 category group YAMLs + 1 master group YAML (hd_epic)
  • utils.py with doc_to_messages, doc_to_visual, doc_to_text, process_results, doc_to_target, and <TIME>/<BBOX> tag resolution
  • hd_epic_to_hf.py converter from official annotation JSONs to lmms-eval JSONL format
  • generate_task_yamls.py script for regenerating all YAMLs
  • README.md with setup instructions and validation results

Out of scope

  • Hosting videos or JSONL on HuggingFace (videos are 41 hours; JSONL is generated locally from official annotations)
  • Fine-tuned model evaluation
  • Changes to any existing tasks or models

Validation

  • python -m lmms_eval --model qwen2_5_vl --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct,fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176 --tasks hd_epic_ingredient_ingredient_weight --batch_size 1 | sample size: N=50 | key metrics: accuracy | result: pass (26%, R3 community report baseline: ~28%, within SE of 6pp)

Risk / Compatibility

  • No changes to existing tasks, models, or shared infrastructure
  • Requires ffmpeg in PATH and HD_EPIC_VIDEO_DIR env var set at eval time; gracefully falls back to full video if clip extraction fails

Type of Change

  • New benchmark/task

aliazani and others added 2 commits April 30, 2026 19:23
- 30 question prototypes across 7 categories (Recipe, Ingredient,
  Nutrition, Fine-grained Actions, 3D Perception, Object Motion, Gaze)
- 26,550 multiple-choice questions from 41 hours of egocentric video
- Runnable per-prototype, per-category, or full benchmark
- Validated: Qwen2.5-VL-7B 26% on Ingredient Weight (R3 report: ~28%)
-c copy snaps to nearest keyframe, causing clips to start early.
Replace with -c:v libx264 -preset ultrafast -crf 23 for exact cuts.
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the tasks LGTM. It would be better if you can split the yaml into sub folders so that the categories are clearer for the user. Thanks

aliazani added 3 commits May 6, 2026 09:13
Split the 30 per-prototype YAMLs and the 7 category-group YAMLs into one subfolder per HD-EPIC category (recipe/, ingredient/, nutrition/, fine_grained/, 3d_perception/, object_motion/, gaze/) so the category structure is visible at a glance. _hd_epic_base.yaml and the master group YAML stay at the top level.

Per-prototype YAMLs now use `include: ../_hd_epic_base.yaml`. Task names, group wiring, and --tasks invocations are unchanged.

Also: update generate_task_yamls.py to emit the new layout; add a Repository layout section to the README; fix stale clip extraction note (-c copy → libx264/ultrafast/crf23).
lmms-eval resolves !function module names relative to each YAML's own
directory. After moving per-prototype YAMLs into <category>/ subfolders,
references like `!function utils.filter_X` could no longer find the
top-level utils.py and threw ImportError on task load.

Each category subfolder now contains a small utils.py shim that prepends
the parent directory to sys.path and re-exports the shared helpers, so
the existing !function references resolve transparently. No YAML or
top-level utils.py changes required.

generate_task_yamls.py also writes the shim alongside each subfolder so
regenerations stay consistent.
Pure input seek (-ss before -i) snaps to the nearest keyframe, which
can start a clip several seconds early. This caused the model to see
a different time window than the question intended, dropping accuracy
on ingredient_ingredient_weight from 26% to 22% (~1 question per 12).

Pure output seek (-ss after -i) is frame-accurate but decodes from
the start of the file, making extraction 10-20x slower on the long
HD-EPIC recordings (36 min for 50 questions vs ~2 min with input seek).

Switch to two-pass seek: fast keyframe-aligned input seek to ~2s
before the target, then a short precise output seek for the remaining
offset. Frame accuracy is equivalent to pure output seek; extraction
time is equivalent to input seek. Validated: accuracy returns to 26%
(matching R3 baseline, within SE) at 6:38 total for 50 questions.

Also update the docstring for _extract_clip to document the strategy,
caching behaviour, and fallback path.
@aliazani
Copy link
Copy Markdown
Author

aliazani commented May 6, 2026

Hi, the tasks LGTM. It would be better if you can split the yaml into sub folders so that the categories are clearer for the user. Thanks

@kcz358
Done — pushed in the latest commits. Each of the 7 HD-EPIC categories now has its own subfolder under tasks/hd_epic/ containing its per-prototype YAMLs and category group YAML. _hd_epic_base.yaml and the master group YAML stay at the top level. generate_task_yamls.py has been updated to emit this layout going forward.

One thing worth noting: lmms-eval resolves !function module names relative to each YAML's own directory, so moving the YAMLs one level deeper broke !function utils.filter_* lookups. Fixed by adding a small utils.py shim to each category subfolder that re-exports the shared helpers from the top-level utils.py — no YAML or top-level code changes required.

Also took the opportunity to add a "Repository layout" section to the README documenting the new structure, and fixed a stale note in the Notes section (clip extraction now uses two-pass ffmpeg seek rather than -c copy).

Task names and --tasks invocations are unchanged throughout. Let me know if anything needs adjusting!

Original Benchmark link: https://github.com/hd-epic/hd-epic-vqa-eval/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants