Classification/ZeroshotClassifcation: - [x] [mteb/kinetics-400](https://huggingface.co/datasets/mteb/kinetics-400) - [x] [mteb/HMDB51](https://huggingface.co/datasets/mteb/HMDB51) - [x] [mteb/Breakfast](https://huggingface.co/datasets/mteb/Breakfast) - [x] [mteb/kinetics-700-2020](https://huggingface.co/datasets/mteb/kinetics-700-2020) - [x] [mteb/UCF101-51VA](https://huggingface.co/datasets/mteb/UCF101-51VA) Task implemented. Pending: #4520 - [ ] [mteb/SomethingSomethingV2](https://huggingface.co/datasets/mteb/SomethingSomethingV2) Implemented. Pending: #4542 - [x] [mteb/kinetics-600](https://huggingface.co/datasets/mteb/kinetics-600) - [x] [mteb/VGGSound](https://huggingface.co/datasets/mteb/VGGSound) Implemented. Pending: #4589 - [x] [mteb/AVE-Dataset](https://huggingface.co/datasets/mteb/AVE-Dataset) - [x] [mteb/Human-Animal-Cartoon](https://huggingface.co/datasets/mteb/Human-Animal-Cartoon) - [x] [mteb/RAVDESS_AV](https://huggingface.co/datasets/mteb/RAVDESS_AV) - [x] [mteb/MUSIC-AVQA_cls-preprocessed](https://huggingface.co/datasets/mteb/MUSIC-AVQA_cls-preprocessed) - [x] [mteb/MELD](https://huggingface.co/datasets/mteb/MELD) - [x] [mteb/AVMeme-Exam](https://huggingface.co/datasets/mteb/AVMeme-Exam) - [x] [mteb/WorldSense_1min](https://huggingface.co/datasets/mteb/WorldSense_1min) Pair Classification: - [x] [VideoConPairClassification](https://github.com/embeddings-benchmark/mteb/pull/4471) - [x] [VinogroundPairClassification](https://github.com/embeddings-benchmark/mteb/pull/4471) - [x] [AVSpeakerBenchPairClassification](https://github.com/embeddings-benchmark/mteb/pull/4471) --------------- - [x] [mteb/Human-Animal-Cartoon](https://huggingface.co/datasets/mteb/Human-Animal-Cartoon) - [x] [mteb/AVE-Dataset](https://huggingface.co/datasets/mteb/AVE-Dataset) - [x] [mteb/RAVDESS_AV](https://huggingface.co/datasets/mteb/RAVDESS_AV) - [x] [mteb/MELD](https://huggingface.co/datasets/mteb/MELD) - [x] [mteb/MUSIC-AVQA_cls-preprocessed](https://huggingface.co/datasets/mteb/MUSIC-AVQA_cls-preprocessed) Clustering: - [x] #4382 - [x] #4409 - [x] [mteb/RAVDESS_AV](https://huggingface.co/datasets/mteb/RAVDESS_AV) - [x] #4498 https://github.com/embeddings-benchmark/mteb/pull/4534 - [x] #4422 - [x] #4447 - [x] #4468 - [ ] VideoMME by domain? Retrieval (VA2T, V2T, T2VA, T2V): - [x] [mteb/MSR-VTT](https://huggingface.co/datasets/mteb/MSR-VTT) Task Implemented. Pending: #4375 - [x] [mteb/MSVD](https://huggingface.co/datasets/mteb/MSVD) - [x] [mteb/DiDeMo](https://huggingface.co/datasets/mteb/DiDeMo) - [x] [mteb/TUNA-Bench_1K](https://huggingface.co/datasets/mteb/TUNA-Bench_1K) - [x] [mteb/ActivityNet_Captions_val2](https://huggingface.co/datasets/mteb/ActivityNet_Captions_val2) - [x] [mteb/YouCook2_val](https://huggingface.co/datasets/mteb/YouCook2_val) - [x] [mteb/VATEX_test_1k](https://huggingface.co/datasets/mteb/VATEX_test_1k) - [x] [mteb/Shot2Story20K_test](https://huggingface.co/datasets/mteb/Shot2Story20K_test) - [x] [mteb/VGGSound_AV_RETRIEVAL](https://huggingface.co/datasets/mteb/VGGSound_AV_RETRIEVAL) - [x] [mteb/VALOR-32K](https://huggingface.co/datasets/mteb/VALOR-32K) - [x] [mteb/AudioCaps_AV](https://huggingface.co/datasets/mteb/AudioCaps_AV) - [x] [mteb/panda-70m](https://huggingface.co/datasets/mteb/panda-70m) - [x] [mteb/AVMeme-Exam](https://huggingface.co/datasets/mteb/AVMeme-Exam) Retrieval (V2A, A2V): - [x] ... (from above datasets) Video Question Answering: - [x] [mteb/worldqa](https://huggingface.co/datasets/mteb/worldqa) - [x] [mteb/EgoSchema_subset](https://huggingface.co/datasets/mteb/EgoSchema_subset) - [x] [mteb/NExT-QA](https://huggingface.co/datasets/mteb/NExT-QA) - [x] [mteb/PerceptionTest_val](https://huggingface.co/datasets/mteb/PerceptionTest_val) - [ ] [mteb/star_bench_val](https://huggingface.co/datasets/mteb/star_bench_val) (blocked?) - [x] [mteb/AV-SpeakerBench](https://huggingface.co/datasets/mteb/AV-SpeakerBench) - [x] [mteb/WorldSense_1min](https://huggingface.co/datasets/mteb/WorldSense_1min) - [x] [mteb/Daily-Omni](https://huggingface.co/datasets/mteb/Daily-Omni) - [ ] [mteb/Video-MME_short](https://huggingface.co/datasets/mteb/Video-MME_short) (blocked?) - [x] [mteb/OmniVideoBench_subset](https://huggingface.co/datasets/mteb/OmniVideoBench_subset) - [x] [mteb/AVQA_val](https://huggingface.co/datasets/mteb/AVQA_val) - [x] [mteb/AVMeme-Exam](https://huggingface.co/datasets/mteb/AVMeme-Exam) - [ ] [mteb/MVBench](https://huggingface.co/datasets/mteb/MVBench) (blocked?) - [ ] [MME-Benchmarks/Video-MME-v2](https://huggingface.co/datasets/MME-Benchmarks/Video-MME-v2) (new release; not processed like others) Models: - [x] [LCO](https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-7B) - [x] [PE-AV (Facebook)](https://huggingface.co/facebook/pe-av-large) https://github.com/embeddings-benchmark/mteb/issues/3797 - [x] #4385 - [x] #4386 - [x] [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) - [x] [e5-omni](https://huggingface.co/Haon-Chen/e5-omni-7B) - [x] #4387 - [x] #4414 -------- - [ ] [Seed-1.6-Embedding](https://seed1-6-embedding.github.io/) - [ ] [ImageBind](https://github.com/facebookresearch/ImageBind) - [ ] [LanguageBind](https://huggingface.co/LanguageBind/LanguageBind_Video) - [ ] [ONE-PEACE](https://github.com/OFA-Sys/ONE-PEACE) - [ ] [OmniBind](https://github.com/zehanwang01/OmniBind) - [x] #4512 - [x] #4513 - [x] #4514
Classification/ZeroshotClassifcation:
Pair Classification:
Clustering:
Retrieval (VA2T, V2T, T2VA, T2V):
Retrieval (V2A, A2V):
Video Question Answering:
Models:
PE-AV#3797