🎬 Scene & Person Tracking Module Specification

Overview

This module performs automated scene segmentation, person detection, multi-object tracking, and scene-level classification using Ultralytics YOLO11. Compared to previous designs, YOLO11 supports unified detection, pose estimation, and tracking within one model family—reducing dependency overhead and improving performance.

🔁 Pipeline Summary (with YOLO11)

Stage	Tool/Model	Output
1. Scene Segmentation	`PySceneDetect`	Shot boundaries (start/end timestamps)
2. Scene Classification	`CLIP` / `ImageBind`	Scene labels (e.g. “living room”)
3. Person Detection & Pose	`YOLO11-pose`	Bounding boxes, skeleton keypoints, confidences
4. Object Tracking	`YOLO11-pose` with `track` mode (ByteTrack/BoT-SORT integrated)	Consistent `person_id`s across frames
5. Audio Context (optional)	`CLAP` / `VGGish`	Audio tags per scene (e.g. “speech”)

🧱 Module Components

`scene_splitter/`

Input: Raw video
Tool: PySceneDetect
Output: List of {start_time, end_time} for each scene

`scene_classifier/`

Input: Keyframes from each scene
Tool: CLIP or ImageBind
Output: Semantic label per scene with confidence

`person_pose_and_track/`

Input: Full video or scenes
Model: YOLO11-pose
Mode: Run in track mode to get detection + tracking in one pass
Output JSON:

{
  "type": "person_skeleton",
  "video_id": "vid123",
  "t": 12.34,
  "person_id": 1,
  "bbox": [x, y, w, h],
  "keypoints": [
    { "joint": "nose", "x": 123, "y": 456, "conf": 0.98 },
    ...
  ]
}

🔧 Configuration Parameters

scene_detect:
  threshold: 30
  min_scene_length: 2.0

clip:
  prompts: ["living room", "clinic", "outdoor", "nursery"]

tracking:
  model: yolo11n-pose
  track_mode: true
  conf_thresh: 0.4

📦 Output JSON Format

Each scene produces:

{
  "scene_id": "scene_02",
  "start": 15.2,
  "end": 32.4,
  "label": "living room",
  "people": [
    {
      "person_id": 1,
      "entry": 15.3,
      "exit": 31.2,
      "trajectory": [
        { "t": 15.3, "bbox": [x1, y1, x2, y2], "keypoints": {...} },
        ...
      ]
    }
  ],
  "audio_tags": ["speech"]
}

🚀 Advantages of YOLO11

Unified model support: Pose estimation, detection, tracking, and optional segmentation via yolo11-pose or yolo11-seg models in a single package(docs.ultralytics.com, docs.ultralytics.com, medium.com, ultralytics.com)
Track mode built-in: Supports object tracking out-of-the-box via track mode using ByteTrack or BoT-SORT, removing extra dependencies like DeepSORT
Efficient and accurate: Offers high mAP with fewer parameters than YOLOv8 (e.g. same or better performance with lower compute overhead)(docs.ultralytics.com)

🗂 Project Structure Example

/scene_person_module/
├── scene_splitter/
├── scene_classifier/
├── person_pose_and_track/
├── outputs/
│   ├── scenes/
│   ├── tracking/
│   └── merged_annotations.json
├── config.yaml
└── run_pipeline.py

🎯 Summary

Leveraging Ultralytics YOLO11, this module simplifies the stack by combining detection, pose, and tracking features in one open-source model. It provides robust, GPU-accelerated annotations and scene context in JSON format, aligning well with your annotation and viewing workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎬 Scene & Person Tracking Module Specification

Overview

🔁 Pipeline Summary (with YOLO11)

🧱 Module Components

`scene_splitter/`

`scene_classifier/`

`person_pose_and_track/`

🔧 Configuration Parameters

📦 Output JSON Format

🚀 Advantages of YOLO11

🗂 Project Structure Example

🎯 Summary

FilesExpand file tree

scene_detection.md

Latest commit

History

scene_detection.md

File metadata and controls

🎬 Scene & Person Tracking Module Specification

Overview

🔁 Pipeline Summary (with YOLO11)

🧱 Module Components

scene_splitter/

scene_classifier/

person_pose_and_track/

🔧 Configuration Parameters

📦 Output JSON Format

🚀 Advantages of YOLO11

🗂 Project Structure Example

🎯 Summary

`scene_splitter/`

`scene_classifier/`

`person_pose_and_track/`