Add lemonade benchmark to the evaluation (#813)

andybonnetto · TashkovskaMatea · web-flow · commit bb1ebe76e7a9 · 2025-09-29T09:41:22.000+08:00
* Video loader with caching and download

* Video loader with caching and download

* black and isort formating

* clean imports

* Video loader with caching and download

* black and isort formating

* clean imports

* implement coderabbitai comments

* download data in cache

* add README for lemonade

* remove custom download def, move max_num_frames to config

* add lemonade to current_tasks

---------

Co-authored-by: Matea Tashkovska &lt;matea_tas@yahoo.com&gt;
diff --git a/docs/current_tasks.md b/docs/current_tasks.md
@@ -245,6 +245,7 @@ python -m lmms_eval --tasks list_with_num
   - egoschema_mcppl
   - egoschema_subset_mcppl
   - egoschema_subset
+- [LEMONADE](https://huggingface.co/datasets/amathislab/LEMONADE) (lemonade)
 - [LongVideoBench](https://github.com/longvideobench/LongVideoBench)
 - [MovieChat](https://github.com/rese1f/MovieChat) (moviechat)
   - Global Mode for entire video (moviechat_global)
diff --git a/lmms_eval/tasks/lemonade/README.md b/lmms_eval/tasks/lemonade/README.md
@@ -0,0 +1,45 @@
+# LEMONADE
+
+## Task Description  
+
+**LEMONADE** (Language models Evaluation of MOtion aNd Action-Driven Enquiries) is a QA benchmark extracted from the **EPFL-Smart-Kitchen-30** dataset (see [arXiv](https://arxiv.org/abs/2506.01608)). It consists of **36,521 closed-ended QA pairs** linked to egocentric video clips.  
+
+Questions are organized into three groups and six subcategories:  
+
+- **Behavior Understanding**  
+  - *Perception*: recognizing perceived actions  
+  - *Reasoning*: reasoning over unseen behaviors  
+- **Long-term Understanding**  
+  - *Summarization*: summarizing over longer clips  
+  - *Session Properties*: inferring session-level information  
+- **Motion & Biomechanics**  
+  - *Physical Attributes*: inferring hand shapes, joint angles, etc.  
+  - *Kinematics*: inferring trajectory velocities  
+
+The benchmark was evaluated using **`lmms-eval`** in the associated publication.  
+
+
+## Implementation  
+
+- **utils.py**: Handles data loading from Hugging Face, video loading, answer parsing, and metric evaluation.  
+- **lemonade.yaml**: Contains the default prompts and evaluation settings.
+
+When running LEMONADE through `lmms-eval`, the data is automatically downloaded. For direct dataset access, please refer to [Hugging Face](https://huggingface.co/datasets/amathislab/LEMONADE) or [Zenodo](https://zenodo.org/records/15535461).  
+
+Performance is evaluated in terms of accuracy against the ground truth, with results reported overall as well as per category and subcategory.
+
+## Citation  
+
+If you use **LEMONADE**, please cite:  
+
+```bibtex
+@misc{bonnetto2025epflsmartkitchen,
+      title={EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models}, 
+      author={Andy Bonnetto and Haozhe Qi and Franklin Leong and Matea Tashkovska and Mahdi Rad and Solaiman Shokur and Friedhelm Hummel and Silvestro Micera and Marc Pollefeys and Alexander Mathis},
+      year={2025},
+      eprint={2506.01608},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2506.01608}, 
+}
+```
diff --git a/lmms_eval/tasks/lemonade/lemonade.yaml b/lmms_eval/tasks/lemonade/lemonade.yaml
@@ -0,0 +1,28 @@
+dataset_path: amathislab/LEMONADE
+dataset_kwargs:
+  video: true
+  cache_dir: lemonade_data
+  force_unzip: true
+task: "lemonade"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.lemonade_doc_to_visual
+doc_to_text: !function utils.lemonade_doc_to_text
+doc_to_target: "Correct Answer"
+
+generation_kwargs:
+  max_new_tokens: 128
+  temperature: 0
+  do_sample: false
+
+process_results: !function utils.lemonade_process_results
+metric_list:
+  - metric: acc
+    aggregation: !function utils.lemonade_aggregate_results
+    higher_is_better: true
+
+lmms_eval_specific_kwargs:
+  default:
+    pre_prompt: "Answer the following multiple-choice question using the given images.\n"
+    post_prompt: "\nRespond only with the letter of the correct answer."
+  max_num_frames: 8
diff --git a/lmms_eval/tasks/lemonade/utils.py b/lmms_eval/tasks/lemonade/utils.py