Skip to content

Commit bb1ebe7

Browse files
Add lemonade benchmark to the evaluation (#813)
* Video loader with caching and download * Video loader with caching and download * black and isort formating * clean imports * Video loader with caching and download * black and isort formating * clean imports * implement coderabbitai comments * download data in cache * add README for lemonade * remove custom download def, move max_num_frames to config * add lemonade to current_tasks --------- Co-authored-by: Matea Tashkovska <matea_tas@yahoo.com>
1 parent 8818d45 commit bb1ebe7

4 files changed

Lines changed: 409 additions & 0 deletions

File tree

docs/current_tasks.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,7 @@ python -m lmms_eval --tasks list_with_num
245245
- egoschema_mcppl
246246
- egoschema_subset_mcppl
247247
- egoschema_subset
248+
- [LEMONADE](https://huggingface.co/datasets/amathislab/LEMONADE) (lemonade)
248249
- [LongVideoBench](https://github.com/longvideobench/LongVideoBench)
249250
- [MovieChat](https://github.com/rese1f/MovieChat) (moviechat)
250251
- Global Mode for entire video (moviechat_global)

lmms_eval/tasks/lemonade/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# LEMONADE
2+
3+
## Task Description
4+
5+
**LEMONADE** (Language models Evaluation of MOtion aNd Action-Driven Enquiries) is a QA benchmark extracted from the **EPFL-Smart-Kitchen-30** dataset (see [arXiv](https://arxiv.org/abs/2506.01608)). It consists of **36,521 closed-ended QA pairs** linked to egocentric video clips.
6+
7+
Questions are organized into three groups and six subcategories:
8+
9+
- **Behavior Understanding**
10+
- *Perception*: recognizing perceived actions
11+
- *Reasoning*: reasoning over unseen behaviors
12+
- **Long-term Understanding**
13+
- *Summarization*: summarizing over longer clips
14+
- *Session Properties*: inferring session-level information
15+
- **Motion & Biomechanics**
16+
- *Physical Attributes*: inferring hand shapes, joint angles, etc.
17+
- *Kinematics*: inferring trajectory velocities
18+
19+
The benchmark was evaluated using **`lmms-eval`** in the associated publication.
20+
21+
22+
## Implementation
23+
24+
- **utils.py**: Handles data loading from Hugging Face, video loading, answer parsing, and metric evaluation.
25+
- **lemonade.yaml**: Contains the default prompts and evaluation settings.
26+
27+
When running LEMONADE through `lmms-eval`, the data is automatically downloaded. For direct dataset access, please refer to [Hugging Face](https://huggingface.co/datasets/amathislab/LEMONADE) or [Zenodo](https://zenodo.org/records/15535461).
28+
29+
Performance is evaluated in terms of accuracy against the ground truth, with results reported overall as well as per category and subcategory.
30+
31+
## Citation
32+
33+
If you use **LEMONADE**, please cite:
34+
35+
```bibtex
36+
@misc{bonnetto2025epflsmartkitchen,
37+
title={EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models},
38+
author={Andy Bonnetto and Haozhe Qi and Franklin Leong and Matea Tashkovska and Mahdi Rad and Solaiman Shokur and Friedhelm Hummel and Silvestro Micera and Marc Pollefeys and Alexander Mathis},
39+
year={2025},
40+
eprint={2506.01608},
41+
archivePrefix={arXiv},
42+
primaryClass={cs.CV},
43+
url={https://arxiv.org/abs/2506.01608},
44+
}
45+
```
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
dataset_path: amathislab/LEMONADE
2+
dataset_kwargs:
3+
video: true
4+
cache_dir: lemonade_data
5+
force_unzip: true
6+
task: "lemonade"
7+
test_split: test
8+
output_type: generate_until
9+
doc_to_visual: !function utils.lemonade_doc_to_visual
10+
doc_to_text: !function utils.lemonade_doc_to_text
11+
doc_to_target: "Correct Answer"
12+
13+
generation_kwargs:
14+
max_new_tokens: 128
15+
temperature: 0
16+
do_sample: false
17+
18+
process_results: !function utils.lemonade_process_results
19+
metric_list:
20+
- metric: acc
21+
aggregation: !function utils.lemonade_aggregate_results
22+
higher_is_better: true
23+
24+
lmms_eval_specific_kwargs:
25+
default:
26+
pre_prompt: "Answer the following multiple-choice question using the given images.\n"
27+
post_prompt: "\nRespond only with the letter of the correct answer."
28+
max_num_frames: 8

0 commit comments

Comments
 (0)