@@ -110,29 +110,35 @@ Additional Specs:
110110- CLI Validation: ` uv run videoannotator validate-emotion path/to/file.emotion.json ` returns non-zero exit on failure
111111 Client tools (e.g. the Video Annotation Viewer) should rely on those sources or the ` /api/v1/pipelines ` endpoint rather than hard-coding pipeline assumptions.
112112
113- ### ** Person Tracking Pipeline **
113+ ### Person Tracking (1 pipeline)
114114
115- - ** Technology** : YOLO11 + ByteTrack multi-object tracking
116- - ** Outputs ** : Bounding boxes, pose keypoints, persistent person IDs
117- - ** Use cases ** : Movement analysis, social interaction tracking, activity recognition
115+ | Pipeline | Technology | Outputs | Stability |
116+ | ---------- | ----------- | --------- | ----------- |
117+ | ** Person Tracking & Pose ** | YOLO11 + ByteTrack | COCO bounding boxes, 17-point pose keypoints, persistent person IDs | beta |
118118
119- ### ** Face Analysis Pipeline **
119+ ### Face Analysis (3 pipelines)
120120
121- - ** Technology** : [ OpenFace 3.0] ( https://github.com/CMU-MultiComp-Lab/OpenFace-3.0 ) , LAION Face ([ LAION] ( https://laion.ai/ ) ), OpenCV backends
122- - ** Outputs** : 68-point landmarks, emotions, action units, gaze direction, head pose
123- - ** Use cases** : Emotional analysis, attention tracking, facial expression studies
121+ | Pipeline | Technology | Outputs | Stability |
122+ | ----------| -----------| ---------| -----------|
123+ | ** Face Analysis** | DeepFace (TensorFlow/OpenCV) | Emotion labels, age/gender, action units | stable |
124+ | ** LAION CLIP Face Embedding** | LAION CLIP-derived model | 512-D semantic embeddings, zero-shot attribute & emotion tagging | experimental |
125+ | ** OpenFace3 Face Embedding** | OpenFace 3.0 (ONNX/PyTorch) | 512-D face embeddings for recognition or clustering | experimental |
124126
125- ### ** Scene Detection Pipeline **
127+ ### Scene Detection (1 pipeline)
126128
127- - ** Technology** : PySceneDetect + CLIP environment classification
128- - ** Outputs ** : Scene boundaries, environment labels, temporal segmentation
129- - ** Use cases ** : Context analysis, setting classification, behavioral context
129+ | Pipeline | Technology | Outputs | Stability |
130+ | ---------- | ----------- | --------- | ----------- |
131+ | ** Scene Detection ** | PySceneDetect + CLIP | Scene boundaries, environment classification, temporal segmentation | beta |
130132
131- ### ** Audio Processing Pipeline **
133+ ### Audio Processing (4 pipelines + 1 combined)
132134
133- - ** Technology** : OpenAI Whisper + pyannote speaker diarization
134- - ** Outputs** : Speech transcripts, speaker identification, voice emotions
135- - ** Use cases** : Conversation analysis, language development, vocal behavior
135+ | Pipeline | Technology | Outputs | Stability |
136+ | ----------| -----------| ---------| -----------|
137+ | ** Speech Recognition** | OpenAI Whisper | WebVTT transcripts with word-level timestamps | stable |
138+ | ** Speaker Diarization** | pyannote.audio | RTTM speaker turns with timestamps | stable |
139+ | ** Audio Processing** | Whisper + pyannote (combined) | WebVTT transcripts + RTTM speaker turns | beta |
140+ | ** LAION Empathic Voice** | LAION Empathic Insight + Whisper embeddings | Emotion segments, empathic scores, emotion timeline | stable |
141+ | ** Voice Emotion Baseline** | Spectral CNN over Whisper embeddings | _ (planned — not yet implemented)_ | experimental |
136142
137143## 💡 Why VideoAnnotator?
138144
@@ -275,8 +281,11 @@ docker run -p 18011:18011 --gpus all videoannotator:dev
275281
276282- ** FastAPI** - High-performance REST API with automatic documentation
277283- ** YOLO11** - State-of-the-art object detection and pose estimation
278- - ** OpenFace 3.0** - Comprehensive facial behavior analysis
284+ - ** DeepFace / OpenFace 3.0 / LAION CLIP ** - Facial analysis, embeddings, and emotion recognition
279285- ** Whisper** - Robust speech recognition and transcription
286+ - ** pyannote.audio** - Speaker diarization and segmentation
287+ - ** LAION Empathic Insight** - Voice emotion analysis from Whisper embeddings
288+ - ** PySceneDetect + CLIP** - Scene boundary detection and environment classification
280289- ** PyTorch** - GPU-accelerated machine learning inference
281290
282291### ** Performance Characteristics**
@@ -340,9 +349,13 @@ MIT License - Full terms in [LICENSE](LICENSE)
340349
341350Built with and grateful to:
342351
343- - ** [ YOLO & Ultralytics] ( https://ultralytics.com/ ) ** - Object detection and tracking
344- - ** [ OpenFace 3.0] ( https://github.com/CMU-MultiComp-Lab/OpenFace-3.0 ) ** - Facial behavior analysis
352+ - ** [ YOLO & Ultralytics] ( https://ultralytics.com/ ) ** - Object detection, tracking, and pose estimation
353+ - ** [ DeepFace] ( https://github.com/serengil/deepface ) ** - Face detection and emotion recognition
354+ - ** [ OpenFace 3.0] ( https://github.com/CMU-MultiComp-Lab/OpenFace-3.0 ) ** - Facial behavior analysis and embeddings
355+ - ** [ LAION] ( https://laion.ai/ ) ** - CLIP face embeddings and empathic voice emotion models
345356- ** [ OpenAI Whisper] ( https://github.com/openai/whisper ) ** - Speech recognition
357+ - ** [ pyannote.audio] ( https://github.com/pyannote/pyannote-audio ) ** - Speaker diarization
358+ - ** [ PySceneDetect] ( https://www.scenedetect.com/ ) ** - Scene boundary detection
346359- ** [ FastAPI] ( https://github.com/tiangolo/fastapi ) ** - Modern web framework
347360- ** [ PyTorch] ( https://pytorch.org/ ) ** - Machine learning infrastructure
348361
0 commit comments