EvolvingLMMs-Lab · kcz358 · May 7, 2026 · Apr 30, 2026 · May 5, 2026 · May 6, 2026
diff --git a/.gitignore b/.gitignore
@@ -84,3 +84,4 @@ CLAUDE.md
 Bagel/
 MMaDA/
 .codex
+lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl
diff --git a/lmms_eval/tasks/hd_epic/3d_perception/_group_hd_epic_3d_perception.yaml b/lmms_eval/tasks/hd_epic/3d_perception/_group_hd_epic_3d_perception.yaml
@@ -0,0 +1,10 @@
+# HD-EPIC '3d_perception' category -- bundles its 4 prototypes.
+# Run all of them with --tasks hd_epic_3d_perception
+group: hd_epic_3d_perception
+task:
+  - hd_epic_3d_perception_fixture_location
+  - hd_epic_3d_perception_object_location
+  - hd_epic_3d_perception_object_contents_retrieval
+  - hd_epic_3d_perception_fixture_interaction_counting
+metadata:
+  version: 1.0
diff --git a/..._eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_fixture_interaction_counting.yaml b/..._eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_fixture_interaction_counting.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: 3d_perception_fixture_interaction_counting
+# "How many times did I close the item indicated by <BBOX> in TIME?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_3d_perception_fixture_interaction_counting
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_3d_perception_fixture_interaction_counting
+
+metadata:
+  version: 1.0
+  task_type: 3d_perception_fixture_interaction_counting
+  description: "\"How many times did I close the item indicated by <BBOX> in TIME?\""
diff --git a/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_fixture_location.yaml b/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_fixture_location.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: 3d_perception_fixture_location
+# "Given the direction I am looking at TIME, where is the X located?" (clock-face directions: 1 o'clock ... 12 o'clock)
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_3d_perception_fixture_location
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_3d_perception_fixture_location
+
+metadata:
+  version: 1.0
+  task_type: 3d_perception_fixture_location
+  description: "\"Given the direction I am looking at TIME, where is the X located?\" (clock-face directions: 1 o'clock ... 12 o'clock)"
diff --git a/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_object_contents_retrieval.yaml b/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_object_contents_retrieval.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: 3d_perception_object_contents_retrieval
+# "Which of these objects did the person put in/on the item indicated by <BBOX> in TIME?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_3d_perception_object_contents_retrieval
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_3d_perception_object_contents_retrieval
+
+metadata:
+  version: 1.0
+  task_type: 3d_perception_object_contents_retrieval
+  description: "\"Which of these objects did the person put in/on the item indicated by <BBOX> in TIME?\""
diff --git a/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_object_location.yaml b/lmms_eval/tasks/hd_epic/3d_perception/hd_epic_3d_perception_object_location.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: 3d_perception_object_location
+# "Where did I put the object identified by <BBOX> at TIME after taking it at TIME?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_3d_perception_object_location
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_3d_perception_object_location
+
+metadata:
+  version: 1.0
+  task_type: 3d_perception_object_location
+  description: "\"Where did I put the object identified by <BBOX> at TIME after taking it at TIME?\""
diff --git a/lmms_eval/tasks/hd_epic/3d_perception/utils.py b/lmms_eval/tasks/hd_epic/3d_perception/utils.py
@@ -0,0 +1,15 @@
+"""
+Shim: re-export the shared HD-EPIC utils so YAMLs in this subfolder can
+reference !function utils.filter_* without a path prefix.
+
+lmms-eval resolves !function module names relative to each YAML's own
+directory, so each category subfolder needs `utils` reachable from here.
+"""
+import os
+import sys
+
+_PARENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+if _PARENT_DIR not in sys.path:
+    sys.path.insert(0, _PARENT_DIR)
+
+from utils import *  # noqa: F401,F403,E402
diff --git a/lmms_eval/tasks/hd_epic/README.md b/lmms_eval/tasks/hd_epic/README.md
@@ -0,0 +1,161 @@
+# HD-EPIC VQA Benchmark
+
+[HD-EPIC](https://hd-epic.github.io/) (A Highly-Detailed Egocentric Video Dataset) is a video question answering benchmark from Perrett et al., CVPR 2025. It covers egocentric kitchen activities with 30 question prototypes across 7 categories, generating 26,550 multiple-choice questions from 41 hours of video.
+
+## Categories and Prototypes
+
+| Category | Prototypes | Questions |
+|---|---|---|
+| Recipe | Recipe Recognition, Multi-Recipe Recognition, Multi-Step Localisation, Step Localisation, Prep Localisation, Step Recognition, Rough Step Localisation, Following Activity Recognition | 8 |
+| Ingredient | Ingredient Retrieval, Ingredient Weight, Ingredients Order, Ingredient Adding Localisation, Ingredient Recognition, Exact Ingredient Recognition | 6 |
+| Nutrition | Image Nutrition Estimation, Nutrition Change, Video Nutrition Estimation | 3 |
+| Fine-grained Actions | Action Recognition, How Recognition, Why Recognition, Action Localisation | 4 |
+| 3D Perception | Fixture Location, Object Location, Object Contents Retrieval, Fixture Interaction Counting | 4 |
+| Object Motion | Object Movement Itinerary, Object Movement Counting, Stationary Object Localisation | 3 |
+| Gaze | Gaze Estimation, Interaction Anticipation | 2 |
+
+## Repository layout
+
+The 30 prototype YAMLs are organised into one subfolder per HD-EPIC category:
+
+\`\`\`
+lmms_eval/tasks/hd_epic/
+├── _hd_epic_base.yaml          # shared base config
+├── _group_hd_epic.yaml         # master group (all 30 prototypes)
+├── utils.py
+├── hd_epic_to_hf.py
+├── recipe/                     # 8 prototypes + category group
+├── ingredient/                 # 6 prototypes + category group
+├── nutrition/                  # 3 prototypes + category group
+├── fine_grained/               # 4 prototypes + category group
+├── 3d_perception/              # 4 prototypes + category group
+├── object_motion/              # 3 prototypes + category group
+└── gaze/                       # 2 prototypes + category group
+\`\`\`
+
+Each category subfolder contains a `_group_hd_epic_<category>.yaml` (the
+category group definition) plus the per-prototype YAMLs for that category.
+Task names are flat — e.g. `hd_epic_gaze_gaze_estimation` — so `--tasks`
+invocations don't need to know about the folder layout.
+
+## Setup
+
+### 1. Download videos
+
+Follow the instructions at [hd-epic.github.io](https://hd-epic.github.io/) to download the dataset videos. Videos should be organised as:
+
+```
+/path/to/videos/
+  P01/
+    P01-20240427-151808.mp4
+    ...
+  P02/
+    ...
+```
+
+### 2. Download annotations
+
+```bash
+git clone https://github.com/hd-epic/hd-epic-annotations
+```
+
+### 3. Convert annotations to JSONL
+
+```bash
+python lmms_eval/tasks/hd_epic/hd_epic_to_hf.py \
+    --questions-dir hd-epic-annotations/vqa-benchmark \
+    --output lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl \
+    --video-dir /path/to/videos
+```
+
+### 4. Set the video directory environment variable
+
+```bash
+export HD_EPIC_VIDEO_DIR=/path/to/videos
+```
+
+This environment variable overrides the `video_dir` field baked into the JSONL at conversion time, so you can move videos without reconverting.
+
+## Running Evaluations
+
+### Full benchmark (all 30 prototypes)
+
+```bash
+python -m lmms_eval \
+    --model qwen2_5_vl \
+    --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
+    --tasks hd_epic \
+    --batch_size 1 \
+    --output_path ./logs/hd_epic_full
+```
+
+### By category
+
+```bash
+# e.g. recipe category only
+python -m lmms_eval \
+    --model qwen2_5_vl \
+    --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
+    --tasks hd_epic_recipe \
+    --batch_size 1 \
+    --output_path ./logs/hd_epic_recipe
+```
+
+Available category tasks: `hd_epic_recipe`, `hd_epic_ingredient`, `hd_epic_nutrition`, `hd_epic_fine_grained`, `hd_epic_3d_perception`, `hd_epic_object_motion`, `hd_epic_gaze`.
+
+### Single prototype
+
+```bash
+# e.g. gaze estimation only
+python -m lmms_eval \
+    --model qwen2_5_vl \
+    --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
+    --tasks hd_epic_gaze_gaze_estimation \
+    --batch_size 1 \
+    --output_path ./logs/hd_epic_gaze_estimation
+```
+
+## Validation Results
+
+Validated using Qwen2.5-VL-7B-Instruct with settings matching the R3 community report (Zhang et al., 2025): `fps=1, max_num_frames=32, min_pixels=50176, max_pixels=50176` (224×224 per frame).
+
+```bash
+python -m lmms_eval \
+    --model qwen2_5_vl \
+    --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct,fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176 \
+    --tasks hd_epic_ingredient_ingredient_weight \
+    --batch_size 1 \
+    --output_path ./logs/qwen_r3_match
+```
+
+| Prototype | R3 report (Zhang et al.) | This integration | Δ |
+|---|---|---|---|
+| Ingredient Weight | ~28% | 26% | -2pp (within SE) |
+
+The 2pp difference is within the statistical noise for n=50 questions (SE ≈ 6pp). Results are consistent with the published zero-shot Qwen2.5-VL-7B baseline.
+
+**Note on frame sampling:** Higher frame counts or resolution improve accuracy on short-clip prototypes. The default lmms-eval Qwen2.5-VL settings (32 frames, default pixel budget) give ~40% on Ingredient Weight — above R3's 28% because more frames are sampled per second for short clips. To reproduce R3's numbers exactly, use `fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176`.
+
+## Notes
+
+- **BBOX coordinates**: Normalised from native 1408×1408 (Project Aria RGB camera) to 1000×1000, matching the official HD-EPIC eval protocol.
+- **TIME tags**: Made relative to clip start, consistent with the original `hd-epic-vqa-eval` repository.
+- **Multi-video questions**: Videos are passed in order as separate `{"type": "video", "url": ...}` blocks in the user message.
+- **JSONL file**: `hd_epic_questions.jsonl` is generated locally and should not be committed to the repository (it is listed in `.gitignore`).
+
+## Citation
+
+```bibtex
+@inproceedings{perrett2025hdepic,
+  title     = {{HD-EPIC}: A Highly-Detailed Egocentric Video Dataset},
+  author    = {Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and
+               Emara, Omar and Pollard, Sam and Parida, Kranti and Liu, Kaiting and
+               Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and
+               Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and
+               Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and
+               Wray, Michael and Doughty, Hazel and Damen, Dima},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
+               and Pattern Recognition (CVPR)},
+  year      = {2025}
+}
+```
diff --git a/lmms_eval/tasks/hd_epic/__init__.py b/lmms_eval/tasks/hd_epic/__init__.py
diff --git a/lmms_eval/tasks/hd_epic/_group_hd_epic.yaml b/lmms_eval/tasks/hd_epic/_group_hd_epic.yaml
@@ -0,0 +1,15 @@
+# HD-EPIC -- top-level group covering all 30 prototypes.
+# Use --tasks hd_epic to evaluate the full benchmark.
+# The 7 category sub-groups are also runnable individually,
+# e.g. --tasks hd_epic_recipe, --tasks hd_epic_gaze.
+group: hd_epic
+task:
+  - hd_epic_recipe
+  - hd_epic_ingredient
+  - hd_epic_nutrition
+  - hd_epic_fine_grained
+  - hd_epic_3d_perception
+  - hd_epic_object_motion
+  - hd_epic_gaze
+metadata:
+  version: 1.0
diff --git a/lmms_eval/tasks/hd_epic/_hd_epic_base.yaml b/lmms_eval/tasks/hd_epic/_hd_epic_base.yaml
@@ -0,0 +1,55 @@
+# HD-EPIC base task configuration.
+# All per-task YAMLs use `group_alias` and inherit from this base via `include`.
+#
+# The dataset is expected to be loaded from a local JSONL file that was produced
+# by running `hd_epic_to_hf.py`, OR from a HuggingFace Hub dataset.
+#
+# Required setup:
+#   1. Run `python hd_epic_to_hf.py` to generate the JSONL dataset files, OR
+#      set `dataset_path` below to your HuggingFace Hub dataset ID.
+#   2. Export HD_EPIC_VIDEO_DIR=/path/to/your/videos
+#   3. Run lmms-eval:
+#        python -m lmms_eval \
+#          --model qwen2_5_vl \
+#          --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
+#          --tasks hd_epic \
+#          --batch_size 1 \
+#          --include_path /path/to/this/folder
+
+dataset_path: json                        # use 'json' for local JSONL files
+                                          # OR replace with your HF Hub ID, e.g.:
+                                          # hd-epic/hd-epic-benchmark
+dataset_kwargs:
+  data_files:
+    test: /home/s3817733/contrib/lmms-eval/lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl         # path to the output of hd_epic_to_hf.py
+
+test_split: test
+output_type: generate_until
+
+# Preferred: chat-model interface (doc_to_messages)
+doc_to_messages: !function utils.hd_epic_doc_to_messages
+
+# Legacy fallback (simple models using doc_to_visual + doc_to_text)
+doc_to_visual: !function utils.hd_epic_doc_to_visual
+doc_to_text: !function utils.hd_epic_doc_to_text
+doc_to_target: !function utils.hd_epic_doc_to_target
+
+process_results: !function utils.hd_epic_process_results
+
+metric_list:
+  - metric: accuracy
+    aggregation: !function utils.hd_epic_aggregate_accuracy
+    higher_is_better: true
+
+generation_kwargs:
+  max_new_tokens: 10
+  temperature: 0
+  do_sample: false
+  stop_sequences:
+    - "\n"
+
+cluster_key: cluster_key
+
+
+metadata:
+  version: 1.0
diff --git a/lmms_eval/tasks/hd_epic/fine_grained/_group_hd_epic_fine_grained.yaml b/lmms_eval/tasks/hd_epic/fine_grained/_group_hd_epic_fine_grained.yaml
@@ -0,0 +1,10 @@
+# HD-EPIC 'fine_grained' category -- bundles its 4 prototypes.
+# Run all of them with --tasks hd_epic_fine_grained
+group: hd_epic_fine_grained
+task:
+  - hd_epic_fine_grained_action_recognition
+  - hd_epic_fine_grained_how_recognition
+  - hd_epic_fine_grained_why_recognition
+  - hd_epic_fine_grained_action_localization
+metadata:
+  version: 1.0
diff --git a/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_action_localization.yaml b/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_action_localization.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: fine_grained_action_localization
+# "When did the action <X> happen in the video?" (pick from 5 time-range choices)
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_fine_grained_action_localization
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_fine_grained_action_localization
+
+metadata:
+  version: 1.0
+  task_type: fine_grained_action_localization
+  description: "\"When did the action <X> happen in the video?\" (pick from 5 time-range choices)"
diff --git a/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_action_recognition.yaml b/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_action_recognition.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: fine_grained_action_recognition
+# "Which of these sentences best describe the ongoing action(s) in the video?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_fine_grained_action_recognition
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_fine_grained_action_recognition
+
+metadata:
+  version: 1.0
+  task_type: fine_grained_action_recognition
+  description: "\"Which of these sentences best describe the ongoing action(s) in the video?\""
diff --git a/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_how_recognition.yaml b/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_how_recognition.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: fine_grained_how_recognition
+# "What is the best description for HOW the person carried out the action <X>?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_fine_grained_how_recognition
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_fine_grained_how_recognition
+
+metadata:
+  version: 1.0
+  task_type: fine_grained_how_recognition
+  description: "\"What is the best description for HOW the person carried out the action <X>?\""
diff --git a/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_why_recognition.yaml b/lmms_eval/tasks/hd_epic/fine_grained/hd_epic_fine_grained_why_recognition.yaml
@@ -0,0 +1,13 @@
+# HD-EPIC subtask: fine_grained_why_recognition
+# "What is the best description for WHY the person performed the action <X>?"
+include: ../_hd_epic_base.yaml
+
+task: hd_epic_fine_grained_why_recognition
+
+# Filter the combined JSONL down to this prototype's rows.
+process_docs: !function utils.filter_fine_grained_why_recognition
+
+metadata:
+  version: 1.0
+  task_type: fine_grained_why_recognition
+  description: "\"What is the best description for WHY the person performed the action <X>?\""
-Original file line number
+Diff line change
@@ Expand Up / @@ -84,3 +84,4 @@ CLAUDE.md @@
     Bagel/
     MMaDA/
     .codex
+    lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl