Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,4 @@ CLAUDE.md
Bagel/
MMaDA/
.codex
lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# HD-EPIC '3d_perception' category -- bundles its 4 prototypes.
# Run all of them with --tasks hd_epic_3d_perception
group: hd_epic_3d_perception
task:
- hd_epic_3d_perception_fixture_location
- hd_epic_3d_perception_object_location
- hd_epic_3d_perception_object_contents_retrieval
- hd_epic_3d_perception_fixture_interaction_counting
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: 3d_perception_fixture_interaction_counting
# "How many times did I close the item indicated by <BBOX> in TIME?"
include: ../_hd_epic_base.yaml

task: hd_epic_3d_perception_fixture_interaction_counting

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_3d_perception_fixture_interaction_counting

metadata:
version: 1.0
task_type: 3d_perception_fixture_interaction_counting
description: "\"How many times did I close the item indicated by <BBOX> in TIME?\""
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: 3d_perception_fixture_location
# "Given the direction I am looking at TIME, where is the X located?" (clock-face directions: 1 o'clock ... 12 o'clock)
include: ../_hd_epic_base.yaml

task: hd_epic_3d_perception_fixture_location

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_3d_perception_fixture_location

metadata:
version: 1.0
task_type: 3d_perception_fixture_location
description: "\"Given the direction I am looking at TIME, where is the X located?\" (clock-face directions: 1 o'clock ... 12 o'clock)"
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: 3d_perception_object_contents_retrieval
# "Which of these objects did the person put in/on the item indicated by <BBOX> in TIME?"
include: ../_hd_epic_base.yaml

task: hd_epic_3d_perception_object_contents_retrieval

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_3d_perception_object_contents_retrieval

metadata:
version: 1.0
task_type: 3d_perception_object_contents_retrieval
description: "\"Which of these objects did the person put in/on the item indicated by <BBOX> in TIME?\""
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: 3d_perception_object_location
# "Where did I put the object identified by <BBOX> at TIME after taking it at TIME?"
include: ../_hd_epic_base.yaml

task: hd_epic_3d_perception_object_location

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_3d_perception_object_location

metadata:
version: 1.0
task_type: 3d_perception_object_location
description: "\"Where did I put the object identified by <BBOX> at TIME after taking it at TIME?\""
15 changes: 15 additions & 0 deletions lmms_eval/tasks/hd_epic/3d_perception/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""
Shim: re-export the shared HD-EPIC utils so YAMLs in this subfolder can
reference !function utils.filter_* without a path prefix.

lmms-eval resolves !function module names relative to each YAML's own
directory, so each category subfolder needs `utils` reachable from here.
"""
import os
import sys

_PARENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if _PARENT_DIR not in sys.path:
sys.path.insert(0, _PARENT_DIR)

from utils import * # noqa: F401,F403,E402
161 changes: 161 additions & 0 deletions lmms_eval/tasks/hd_epic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# HD-EPIC VQA Benchmark

[HD-EPIC](https://hd-epic.github.io/) (A Highly-Detailed Egocentric Video Dataset) is a video question answering benchmark from Perrett et al., CVPR 2025. It covers egocentric kitchen activities with 30 question prototypes across 7 categories, generating 26,550 multiple-choice questions from 41 hours of video.

## Categories and Prototypes

| Category | Prototypes | Questions |
|---|---|---|
| Recipe | Recipe Recognition, Multi-Recipe Recognition, Multi-Step Localisation, Step Localisation, Prep Localisation, Step Recognition, Rough Step Localisation, Following Activity Recognition | 8 |
| Ingredient | Ingredient Retrieval, Ingredient Weight, Ingredients Order, Ingredient Adding Localisation, Ingredient Recognition, Exact Ingredient Recognition | 6 |
| Nutrition | Image Nutrition Estimation, Nutrition Change, Video Nutrition Estimation | 3 |
| Fine-grained Actions | Action Recognition, How Recognition, Why Recognition, Action Localisation | 4 |
| 3D Perception | Fixture Location, Object Location, Object Contents Retrieval, Fixture Interaction Counting | 4 |
| Object Motion | Object Movement Itinerary, Object Movement Counting, Stationary Object Localisation | 3 |
| Gaze | Gaze Estimation, Interaction Anticipation | 2 |

## Repository layout

The 30 prototype YAMLs are organised into one subfolder per HD-EPIC category:

\`\`\`
lmms_eval/tasks/hd_epic/
├── _hd_epic_base.yaml # shared base config
├── _group_hd_epic.yaml # master group (all 30 prototypes)
├── utils.py
├── hd_epic_to_hf.py
├── recipe/ # 8 prototypes + category group
├── ingredient/ # 6 prototypes + category group
├── nutrition/ # 3 prototypes + category group
├── fine_grained/ # 4 prototypes + category group
├── 3d_perception/ # 4 prototypes + category group
├── object_motion/ # 3 prototypes + category group
└── gaze/ # 2 prototypes + category group
\`\`\`

Each category subfolder contains a `_group_hd_epic_<category>.yaml` (the
category group definition) plus the per-prototype YAMLs for that category.
Task names are flat — e.g. `hd_epic_gaze_gaze_estimation` — so `--tasks`
invocations don't need to know about the folder layout.

## Setup

### 1. Download videos

Follow the instructions at [hd-epic.github.io](https://hd-epic.github.io/) to download the dataset videos. Videos should be organised as:

```
/path/to/videos/
P01/
P01-20240427-151808.mp4
...
P02/
...
```

### 2. Download annotations

```bash
git clone https://github.com/hd-epic/hd-epic-annotations
```

### 3. Convert annotations to JSONL

```bash
python lmms_eval/tasks/hd_epic/hd_epic_to_hf.py \
--questions-dir hd-epic-annotations/vqa-benchmark \
--output lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl \
--video-dir /path/to/videos
```

### 4. Set the video directory environment variable

```bash
export HD_EPIC_VIDEO_DIR=/path/to/videos
```

This environment variable overrides the `video_dir` field baked into the JSONL at conversion time, so you can move videos without reconverting.

## Running Evaluations

### Full benchmark (all 30 prototypes)

```bash
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks hd_epic \
--batch_size 1 \
--output_path ./logs/hd_epic_full
```

### By category

```bash
# e.g. recipe category only
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks hd_epic_recipe \
--batch_size 1 \
--output_path ./logs/hd_epic_recipe
```

Available category tasks: `hd_epic_recipe`, `hd_epic_ingredient`, `hd_epic_nutrition`, `hd_epic_fine_grained`, `hd_epic_3d_perception`, `hd_epic_object_motion`, `hd_epic_gaze`.

### Single prototype

```bash
# e.g. gaze estimation only
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks hd_epic_gaze_gaze_estimation \
--batch_size 1 \
--output_path ./logs/hd_epic_gaze_estimation
```

## Validation Results

Validated using Qwen2.5-VL-7B-Instruct with settings matching the R3 community report (Zhang et al., 2025): `fps=1, max_num_frames=32, min_pixels=50176, max_pixels=50176` (224×224 per frame).

```bash
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct,fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176 \
--tasks hd_epic_ingredient_ingredient_weight \
--batch_size 1 \
--output_path ./logs/qwen_r3_match
```

| Prototype | R3 report (Zhang et al.) | This integration | Δ |
|---|---|---|---|
| Ingredient Weight | ~28% | 26% | -2pp (within SE) |

The 2pp difference is within the statistical noise for n=50 questions (SE ≈ 6pp). Results are consistent with the published zero-shot Qwen2.5-VL-7B baseline.

**Note on frame sampling:** Higher frame counts or resolution improve accuracy on short-clip prototypes. The default lmms-eval Qwen2.5-VL settings (32 frames, default pixel budget) give ~40% on Ingredient Weight — above R3's 28% because more frames are sampled per second for short clips. To reproduce R3's numbers exactly, use `fps=1,max_num_frames=32,min_pixels=50176,max_pixels=50176`.

## Notes

- **BBOX coordinates**: Normalised from native 1408×1408 (Project Aria RGB camera) to 1000×1000, matching the official HD-EPIC eval protocol.
- **TIME tags**: Made relative to clip start, consistent with the original `hd-epic-vqa-eval` repository.
- **Multi-video questions**: Videos are passed in order as separate `{"type": "video", "url": ...}` blocks in the user message.
- **JSONL file**: `hd_epic_questions.jsonl` is generated locally and should not be committed to the repository (it is listed in `.gitignore`).

## Citation

```bibtex
@inproceedings{perrett2025hdepic,
title = {{HD-EPIC}: A Highly-Detailed Egocentric Video Dataset},
author = {Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and
Emara, Omar and Pollard, Sam and Parida, Kranti and Liu, Kaiting and
Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and
Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and
Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and
Wray, Michael and Doughty, Hazel and Damen, Dima},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR)},
year = {2025}
}
```
Empty file.
15 changes: 15 additions & 0 deletions lmms_eval/tasks/hd_epic/_group_hd_epic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# HD-EPIC -- top-level group covering all 30 prototypes.
# Use --tasks hd_epic to evaluate the full benchmark.
# The 7 category sub-groups are also runnable individually,
# e.g. --tasks hd_epic_recipe, --tasks hd_epic_gaze.
group: hd_epic
task:
- hd_epic_recipe
- hd_epic_ingredient
- hd_epic_nutrition
- hd_epic_fine_grained
- hd_epic_3d_perception
- hd_epic_object_motion
- hd_epic_gaze
metadata:
version: 1.0
55 changes: 55 additions & 0 deletions lmms_eval/tasks/hd_epic/_hd_epic_base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# HD-EPIC base task configuration.
# All per-task YAMLs use `group_alias` and inherit from this base via `include`.
#
# The dataset is expected to be loaded from a local JSONL file that was produced
# by running `hd_epic_to_hf.py`, OR from a HuggingFace Hub dataset.
#
# Required setup:
# 1. Run `python hd_epic_to_hf.py` to generate the JSONL dataset files, OR
# set `dataset_path` below to your HuggingFace Hub dataset ID.
# 2. Export HD_EPIC_VIDEO_DIR=/path/to/your/videos
# 3. Run lmms-eval:
# python -m lmms_eval \
# --model qwen2_5_vl \
# --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
# --tasks hd_epic \
# --batch_size 1 \
# --include_path /path/to/this/folder

dataset_path: json # use 'json' for local JSONL files
# OR replace with your HF Hub ID, e.g.:
# hd-epic/hd-epic-benchmark
dataset_kwargs:
data_files:
test: /home/s3817733/contrib/lmms-eval/lmms_eval/tasks/hd_epic/hd_epic_questions.jsonl # path to the output of hd_epic_to_hf.py

test_split: test
output_type: generate_until

# Preferred: chat-model interface (doc_to_messages)
doc_to_messages: !function utils.hd_epic_doc_to_messages

# Legacy fallback (simple models using doc_to_visual + doc_to_text)
doc_to_visual: !function utils.hd_epic_doc_to_visual
doc_to_text: !function utils.hd_epic_doc_to_text
doc_to_target: !function utils.hd_epic_doc_to_target

process_results: !function utils.hd_epic_process_results

metric_list:
- metric: accuracy
aggregation: !function utils.hd_epic_aggregate_accuracy
higher_is_better: true

generation_kwargs:
max_new_tokens: 10
temperature: 0
do_sample: false
stop_sequences:
- "\n"

cluster_key: cluster_key


metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# HD-EPIC 'fine_grained' category -- bundles its 4 prototypes.
# Run all of them with --tasks hd_epic_fine_grained
group: hd_epic_fine_grained
task:
- hd_epic_fine_grained_action_recognition
- hd_epic_fine_grained_how_recognition
- hd_epic_fine_grained_why_recognition
- hd_epic_fine_grained_action_localization
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: fine_grained_action_localization
# "When did the action <X> happen in the video?" (pick from 5 time-range choices)
include: ../_hd_epic_base.yaml

task: hd_epic_fine_grained_action_localization

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_fine_grained_action_localization

metadata:
version: 1.0
task_type: fine_grained_action_localization
description: "\"When did the action <X> happen in the video?\" (pick from 5 time-range choices)"
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: fine_grained_action_recognition
# "Which of these sentences best describe the ongoing action(s) in the video?"
include: ../_hd_epic_base.yaml

task: hd_epic_fine_grained_action_recognition

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_fine_grained_action_recognition

metadata:
version: 1.0
task_type: fine_grained_action_recognition
description: "\"Which of these sentences best describe the ongoing action(s) in the video?\""
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: fine_grained_how_recognition
# "What is the best description for HOW the person carried out the action <X>?"
include: ../_hd_epic_base.yaml

task: hd_epic_fine_grained_how_recognition

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_fine_grained_how_recognition

metadata:
version: 1.0
task_type: fine_grained_how_recognition
description: "\"What is the best description for HOW the person carried out the action <X>?\""
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# HD-EPIC subtask: fine_grained_why_recognition
# "What is the best description for WHY the person performed the action <X>?"
include: ../_hd_epic_base.yaml

task: hd_epic_fine_grained_why_recognition

# Filter the combined JSONL down to this prototype's rows.
process_docs: !function utils.filter_fine_grained_why_recognition

metadata:
version: 1.0
task_type: fine_grained_why_recognition
description: "\"What is the best description for WHY the person performed the action <X>?\""
Loading
Loading