You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title={LAION-5B: An open large-scale dataset for training next generation image-text models},
141
-
author={Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and others},
142
-
journal={Advances in Neural Information Processing Systems},
143
-
volume={35},
144
-
pages={25278--25294},
145
-
year={2022}
138
+
@inproceedings{emonet_face,
139
+
title={{EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition}},
140
+
author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Kraus, Maurice and Friedrich, Felix and Nguyen, Huu and Kalyan, Krishna and Nadi, Kourosh and Kersting, Kristian and Auer, S\"{o}ren},
141
+
booktitle={NeurIPS},
142
+
year={2025},
143
+
doi={10.48550/arXiv.2505.20033}
144
+
}
145
+
146
+
@article{emonet_voice,
147
+
title={{EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection}},
148
+
author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, S\"{o}ren},
149
+
journal={arXiv preprint arXiv:2506.09827},
150
+
year={2025},
151
+
doi={10.48550/arXiv.2506.09827}
152
+
}
153
+
154
+
@inproceedings{clip,
155
+
title = {Learning Transferable Visual Models From Natural Language Supervision},
156
+
author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
157
+
booktitle = {Proceedings of the 38th International Conference on Machine Learning},
Copy file name to clipboardExpand all lines: paper/paper.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ bibliography: paper.bib
39
39
40
40
# Summary
41
41
42
-
VideoAnnotator is an open-source Python toolkit for automated video annotation, designed for behavioral, social, and health research at scale. It ships ten declaratively configured pipelines spanning four modalities: person tracking via YOLOv11 with ByteTrack [@yolo11; @bytetrack]; facial analysis using DeepFace [@deepface], LAION CLIP face embeddings [@laion], and OpenFace 3 [@openface2]; scene detection with PySceneDetect and CLIP-based labelling [@pyscenedetect]; and audio processing comprising Whisper speech recognition [@whisper], pyannote speaker diarization [@pyannote], and LAION empathic voice emotion analysis. All pipelines share a uniform interface behind a local-first FastAPI service [@fastapi], with Docker images for consistent CPU and GPU execution. Outputs are standardized to established formats (COCO JSON, RTTM, WebVTT) and accompanied by provenance metadata suitable for downstream modeling and review.
42
+
VideoAnnotator is an open-source Python toolkit for automated video annotation, designed for behavioral, social, and health research at scale. It ships ten declaratively configured pipelines spanning four modalities: person tracking via YOLOv11 with ByteTrack [@yolo11; @bytetrack]; facial analysis using DeepFace [@deepface], LAION EmoNet face emotion analysis [@emonet_face], and OpenFace 3 [@openface3]; scene detection with PySceneDetect and CLIP-based labelling [@pyscenedetect; @clip]; and audio processing comprising Whisper speech recognition [@whisper], pyannote speaker diarization [@pyannote], and LAION EmoNet voice emotion analysis[@emonet_voice]. All pipelines share a uniform interface behind a local-first FastAPI service [@fastapi], with Docker images for consistent CPU and GPU execution. Outputs are standardized to established formats (COCO JSON, RTTM, WebVTT) and accompanied by provenance metadata suitable for downstream modeling and review.
43
43
44
44
A companion web application, Video Annotation Viewer [@viewer], provides an interactive interface for overlaying annotations on source video — rendering pose skeletons, speaker timelines, subtitle tracks, and scene boundaries — so that researchers can visually inspect and validate pipeline outputs before downstream analysis.
45
45
@@ -55,7 +55,7 @@ VideoAnnotator addresses this gap by providing a maintainable software layer tha
55
55
56
56
Existing tools for behavioral video analysis fall into two broad categories. Manual annotation platforms such as ELAN [@elan] and Datavyu [@datavyu] provide flexible coding interfaces but require trained human coders and do not scale to large corpora. At the other end, specialized computer-vision libraries such as DeepLabCut [@deeplabcut] and YOLO [@yolo11] offer powerful pose estimation and object detection but target a single modality and leave integration, output standardization, and batch orchestration to the user.
57
57
58
-
For facial affect, toolkits such as Py-Feat [@pyfeat] and OpenFace [@openface2] extract action units, landmarks, and emotion labels from video frames. On the audio side, openSMILE [@opensmile] remains widely cited for acoustic feature extraction but has seen limited maintenance, and no current open-source toolkit offers end-to-end speech emotion analysis integrated with video. These tools all run locally but each addresses a single modality and produces its own output schema. Scene-detection libraries such as PySceneDetect [@pyscenedetect] and speaker-diarization toolkits like pyannote [@pyannote] similarly solve one piece of the puzzle. A researcher studying parent–child interaction, for example, may need person tracking, facial expression analysis, speech segmentation, and scene detection applied to the same set of videos — currently requiring ad-hoc glue code across four or more libraries with no shared output format or batch orchestration.
58
+
For facial affect, toolkits such as Py-Feat [@pyfeat] and OpenFace [@openface3] extract action units, landmarks, and emotion labels from video frames. On the audio side, openSMILE [@opensmile] remains widely cited for acoustic feature extraction but has seen limited maintenance, and no current open-source toolkit offers end-to-end speech emotion analysis integrated with video. These tools all run locally but each addresses a single modality and produces its own output schema. Scene-detection libraries such as PySceneDetect [@pyscenedetect] and speaker-diarization toolkits like pyannote [@pyannote] similarly solve one piece of the puzzle. A researcher studying parent–child interaction, for example, may need person tracking, facial expression analysis, speech segmentation, and scene detection applied to the same set of videos — currently requiring ad-hoc glue code across four or more libraries with no shared output format or batch orchestration.
59
59
60
60
VideoAnnotator was built rather than contributed to an existing project because no single package offered the combination of multi-modal pipeline composition, declarative configuration, standardized output formats (COCO, RTTM, WebVTT), and local-first batch orchestration that our research workflows required. The closest comparable systems are either commercial, cloud-dependent, or tightly coupled to a single detector.
0 commit comments