This module performs automated scene segmentation, person detection, multi-object tracking, and scene-level classification using Ultralytics YOLO11. Compared to previous designs, YOLO11 supports unified detection, pose estimation, and tracking within one model family—reducing dependency overhead and improving performance.
| Stage | Tool/Model | Output |
|---|---|---|
| 1. Scene Segmentation | PySceneDetect |
Shot boundaries (start/end timestamps) |
| 2. Scene Classification | CLIP / ImageBind |
Scene labels (e.g. “living room”) |
| 3. Person Detection & Pose | YOLO11-pose |
Bounding boxes, skeleton keypoints, confidences |
| 4. Object Tracking | YOLO11-pose with track mode (ByteTrack/BoT-SORT integrated) |
Consistent person_ids across frames |
| 5. Audio Context (optional) | CLAP / VGGish |
Audio tags per scene (e.g. “speech”) |
- Input: Raw video
- Tool:
PySceneDetect - Output: List of
{start_time, end_time}for each scene
- Input: Keyframes from each scene
- Tool:
CLIPorImageBind - Output: Semantic label per scene with confidence
- Input: Full video or scenes
- Model:
YOLO11-pose - Mode: Run in
trackmode to get detection + tracking in one pass - Output JSON:
{
"type": "person_skeleton",
"video_id": "vid123",
"t": 12.34,
"person_id": 1,
"bbox": [x, y, w, h],
"keypoints": [
{ "joint": "nose", "x": 123, "y": 456, "conf": 0.98 },
...
]
}scene_detect:
threshold: 30
min_scene_length: 2.0
clip:
prompts: ["living room", "clinic", "outdoor", "nursery"]
tracking:
model: yolo11n-pose
track_mode: true
conf_thresh: 0.4Each scene produces:
{
"scene_id": "scene_02",
"start": 15.2,
"end": 32.4,
"label": "living room",
"people": [
{
"person_id": 1,
"entry": 15.3,
"exit": 31.2,
"trajectory": [
{ "t": 15.3, "bbox": [x1, y1, x2, y2], "keypoints": {...} },
...
]
}
],
"audio_tags": ["speech"]
}- Unified model support: Pose estimation, detection, tracking, and optional segmentation via
yolo11-poseoryolo11-segmodels in a single package(docs.ultralytics.com, docs.ultralytics.com, medium.com, ultralytics.com) - Track mode built-in: Supports object tracking out-of-the-box via
trackmode using ByteTrack or BoT-SORT, removing extra dependencies like DeepSORT - Efficient and accurate: Offers high mAP with fewer parameters than YOLOv8 (e.g. same or better performance with lower compute overhead)(docs.ultralytics.com)
/scene_person_module/
├── scene_splitter/
├── scene_classifier/
├── person_pose_and_track/
├── outputs/
│ ├── scenes/
│ ├── tracking/
│ └── merged_annotations.json
├── config.yaml
└── run_pipeline.py
Leveraging Ultralytics YOLO11, this module simplifies the stack by combining detection, pose, and tracking features in one open-source model. It provides robust, GPU-accelerated annotations and scene context in JSON format, aligning well with your annotation and viewing workflows.