|
1 | 1 | --- |
2 | | -title: "VideoAnnotator: an extensible, reproducible toolkit for automated and manual video annotation in behavioral research" |
3 | | -tags: |
4 | | - - Python |
5 | | - - video analysis |
6 | | - - behavioral science |
7 | | - - reproducibility |
8 | | - - machine learning |
9 | | -authors: |
10 | | - - name: Caspar Addyman |
11 | | - orcid: https://orcid.org/0000-0003-0001-9548 |
12 | | - affiliation: 1 |
13 | | - - name: Jeremiah Ishaya |
14 | | - affiliation: 1 |
15 | | - - name: Irene Uwerikowe |
16 | | - affiliation: 1 |
17 | | - - name: Daniel Stamate |
18 | | - affiliation: 2 |
19 | | - - name: Jamie Lachman |
20 | | - affiliation: 3 |
21 | | - - name: Mark Tomlinson |
22 | | - affiliation: 1 |
23 | | -affiliations: |
24 | | - - name: Institute for Life Course Health Research (ILCHR), Stellenbosch University, South Africa |
25 | | - index: 1 |
26 | | - - name: Department of Computing, Goldsmiths, University of London, United Kingdom |
27 | | - index: 2 |
28 | | - - name: Department of Social Policy and Intervention (DISP), University of Oxford, United Kingdom |
29 | | - index: 3 |
30 | | -date: 18 December 2025 |
31 | | -bibliography: paper.bib |
| 2 | +title: "JOSS manuscript" |
32 | 3 | --- |
33 | 4 |
|
34 | | -# Summary |
| 5 | +The canonical JOSS manuscript for VideoAnnotator is maintained in `paper/paper.md` with its references in `paper/paper.bib`. |
35 | 6 |
|
36 | | -**VideoAnnotator** is an open-source Python toolkit for _automated and manual annotation of video_, designed for behavioral, social, and health research at scale. It provides: |
37 | | - |
38 | | -- a pluggable pipeline that wraps commonly used detectors (e.g., **OpenFace 3**, **DeepFace**) for face, action‐unit, affect, gaze, speech and motion features; |
39 | | -- a **FastAPI** service for local or server deployment; |
40 | | -- **Docker** images for fully reproducible execution with GPU support where available; |
41 | | -- a clear data contract for inputs/outputs (JSON/CSV/Parquet), timestamped tracks, and provenance metadata suitable for downstream modeling and review. |
42 | | - |
43 | | -The toolkit targets researchers who need _auditable, explainable feature timelines_ (e.g., smiles, gaze‐on/off, vocal activity, proximity), while remaining domain‐agnostic for use in psychology, HCI, education research, clinical observation, sports science, or any scenario where video behaviors must be measured consistently. |
44 | | - |
45 | | -# Statement of need |
46 | | - |
47 | | -Across behavioral sciences, observational methods remain the gold standard for assessing rich interpersonal phenomena, but manual coding is costly, subjective, and difficult to scale. Prior work on parenting–child interaction assessment, for example, highlights both the value of holistic constructs and the practical limits of human macro-coding (training burden, reliability drift, cultural variance) when datasets grow beyond small lab cohorts. These concerns generalize to many video-based fields (therapy sessions, classroom interactions, telehealth triage), where the measurement gap—lack of scalable, standardized, and transparent coding—constrains progress. |
48 | | - |
49 | | -**VideoAnnotator** addresses this need by (i) standardizing access to modern open models for faces, pose, and voice; (ii) emitting **timestamped micro-events** that are inspectable and auditable; and (iii) packaging the whole stack for reproducible, resource-constrained deployment (laptops, on-prem servers, or cloud GPUs). The library does _not_ prescribe a single theory of behavior; rather, it provides the _feature scaffolding_ upon which diverse constructs or downstream models can be built (e.g., sensitivity, synchrony, rapport), with outputs suitable for both qualitative review and quantitative ML. |
50 | | - |
51 | | -# Functionality |
52 | | - |
53 | | -- **Pipelines & plugins.** Modular wrappers for detectors (face/affect, keypoints, diarization/transcripts) chained into pipelines via declarative YAML/JSON configs. New detectors can be added with small adapter classes. |
54 | | -- **Annotations as first-class data.** Event schemas for segments, point events, and tracks (with confidences), plus per-stage provenance and hashes for audit. |
55 | | -- **Batch & service modes.** Run from CLI for batch processing, or as a **FastAPI** service to integrate into lab workflows, notebooks, or web apps. |
56 | | -- **Reproducible runs.** Dockerfiles/compose recipes and pinned environments for CPU/GPU, designed to minimize “works on my machine” bugs. |
57 | | -- **Privacy-aware processing.** Intended to run locally/on-prem; supports redaction steps (e.g., face blurring tracks) as optional pipeline stages. |
58 | | -- **Interoperability.** Outputs align with common tabular formats and can be visualized in the companion _Video Annotation Viewer_ (separate submission). |
59 | | - |
60 | | -# Illustrative use cases |
61 | | - |
62 | | -- **Education/HCI:** quantifying joint attention and participation in classroom videos. |
63 | | -- **Clinical & therapy:** triaging sessions by indicators such as engagement or agitation. |
64 | | -- **Team interaction / sports:** timing of gaze, gestures, or proximity changes in drills. |
65 | | -- **Developmental science:** producing objective micro-codes that later map to global constructs in a transparent, two-stage analysis. |
66 | | - |
67 | | -# Design & architecture |
68 | | - |
69 | | -VideoAnnotator exposes: |
70 | | - |
71 | | -1. **`annotate()` API & CLI** to run configured pipelines over folders or manifests. |
72 | | -2. **Detectors layer** (e.g., wrappers for OpenFace 3, DeepFace, pose/landmarks, ASR/diarization) with consistent batching and GPU utilization. |
73 | | -3. **Event store** building timestamped tracks with confidences and per-stage provenance. |
74 | | -4. **FastAPI service** exposing health, queue, and processing endpoints for integration. |
75 | | -5. **Dockerized runtimes** (CPU/GPU) with pinned models and test fixtures. |
76 | | - |
77 | | -# Quality control |
78 | | - |
79 | | -We provide minimal smoke tests for pipeline execution, schema validation for outputs, and example configs with short clips to verify end-to-end operation. Reproducibility is validated by hashing the pipeline configuration, model versions, and container image used for each run (included in the metadata footer of outputs). |
80 | | - |
81 | | -# Statement of limitations |
82 | | - |
83 | | -- The toolkit depends on upstream detectors; accuracy/cultural generalizability reflect those models and recording conditions (lighting, angle, mic). |
84 | | -- Inference on long videos may require GPU resources for real-time/near real-time performance. |
85 | | -- Ethical deployment (consent, data governance, redaction) remains the responsibility of adopters; the library offers hooks to implement these steps. |
86 | | - |
87 | | -# Acknowledgements |
88 | | - |
89 | | -We thank colleagues in developmental science and global health for formative discussions on scalable, interpretable video measurement, including prior analyses of macro-coding, measurement scalability, and const |
| 7 | +This file is intentionally kept as a pointer to avoid divergence between multiple manuscript copies. |
0 commit comments