Skip to content

Latest commit

 

History

History
253 lines (206 loc) · 6.15 KB

File metadata and controls

253 lines (206 loc) · 6.15 KB

Supported File Formats

Video Annotation Viewer supports industry-standard formats produced by the VideoAnnotator pipeline and other computer vision tools.

Video Files

Supported Formats

  • MP4 (H.264/H.265) - Recommended
  • WebM (VP8/VP9)
  • AVI (various codecs)
  • MOV (QuickTime)

Requirements

  • Maximum file size: 2GB (browser limitation)
  • Recommended resolution: 640x480 to 1920x1080
  • Frame rate: 15-60 fps

Annotation File Formats

1. Person Tracking (COCO Format)

File Extension: .json
Format: COCO keypoint annotations with timestamps

{
  "annotations": [
    {
      "id": 1,
      "image_id": "frame_0001",
      "category_id": 1,
      "keypoints": [
        x1, y1, v1,  // nose
        x2, y2, v2,  // left_eye
        x3, y3, v3,  // right_eye
        // ... 17 keypoints total (x, y, visibility)
      ],
      "bbox": [x, y, width, height],
      "area": 1234.5,
      "score": 0.95,
      "track_id": 1,
      "timestamp": 1.25,
      "frame_number": 37
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "person",
      "keypoints": ["nose", "left_eye", "right_eye", ...],
      "skeleton": [[16, 14], [14, 12], ...]
    }
  ]
}

COCO Keypoint Order (17 points):

  1. nose
  2. left_eye
  3. right_eye
  4. left_ear
  5. right_ear
  6. left_shoulder
  7. right_shoulder
  8. left_elbow
  9. right_elbow
  10. left_wrist
  11. right_wrist
  12. left_hip
  13. right_hip
  14. left_knee
  15. right_knee
  16. left_ankle
  17. right_ankle

2. Speech Recognition (WebVTT)

File Extension: .vtt
Format: Web Video Text Tracks

WEBVTT

NOTE
Generated by VideoAnnotator Speech Recognition Pipeline

00:00:01.000 --> 00:00:03.500
Hello, how are you doing today?

00:00:04.000 --> 00:00:06.200
I'm doing great, thanks for asking.

00:00:07.500 --> 00:00:09.800
That's wonderful to hear!

Format Requirements:

  • Must start with WEBVTT header
  • Timestamps in HH:MM:SS.mmm format
  • Cue separators with -->
  • UTF-8 encoding

3. Speaker Diarization (RTTM)

File Extension: .rttm
Format: Rich Transcription Time Marked (NIST)

SPEAKER filename 1 0.00 2.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER filename 1 3.00 1.80 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER filename 1 5.20 3.10 <NA> <NA> SPEAKER_00 <NA> <NA>

Field Description:

  1. SPEAKER - Record type
  2. filename - File identifier
  3. channel - Channel number (usually 1)
  4. start_time - Start time in seconds
  5. duration - Duration in seconds
  6. <NA> - Orthography (not used)
  7. <NA> - Speaker type (not used)
  8. speaker_id - Speaker identifier (SPEAKER_00, SPEAKER_01, etc.)
  9. <NA> - Confidence (not used)
  10. <NA> - Signal lookhead (not used)

4. Scene Detection

File Extension: .json
Format: JSON array of scene boundaries

[
  {
    "id": 1,
    "video_id": "sample_video",
    "timestamp": 0.0,
    "start_time": 0.0,
    "end_time": 5.2,
    "duration": 5.2,
    "scene_type": "conversation",
    "bbox": [10, 20, 300, 400],
    "score": 0.89
  },
  {
    "id": 2,
    "video_id": "sample_video", 
    "timestamp": 5.2,
    "start_time": 5.2,
    "end_time": 12.8,
    "duration": 7.6,
    "scene_type": "transition",
    "bbox": [15, 25, 310, 420],
    "score": 0.75
  }
]

5. Audio Files

File Extension: .wav
Format: Uncompressed WAV audio

Requirements:

  • Sample rate: 16kHz - 48kHz
  • Bit depth: 16-bit or 24-bit
  • Channels: Mono or stereo
  • Must match video duration

File Naming Conventions

VideoAnnotator Standard Naming

VideoAnnotator produces files with consistent naming patterns:

{video_name}_person_tracking.json
{video_name}_speech_recognition.vtt  
{video_name}_speaker_diarization.rttm
{video_name}_scene_detection.json
{video_name}_audio.wav

Example:

sample_video.mp4
sample_video_person_tracking.json
sample_video_speech_recognition.vtt
sample_video_speaker_diarization.rttm
sample_video_scene_detection.json
sample_video_audio.wav

Loading Multiple Files

Drag and Drop Interface

  1. Select all files related to your video
  2. Drag them together onto the upload area
  3. The system automatically detects file types
  4. Validation occurs before processing

File Type Detection

The application uses multiple methods to identify file types:

  • Extension-based: .mp4, .vtt, .rttm, .json, .wav
  • Content analysis: JSON structure validation, WebVTT header detection
  • MIME type: Browser-provided content type hints

Validation Errors

Common validation issues and solutions:

COCO Format:

  • ❌ Missing keypoints array → Ensure 51 values (17 × 3)
  • ❌ Invalid timestamp → Must be numeric value in seconds
  • ❌ Missing required fields → Include id, bbox, keypoints

WebVTT Format:

  • ❌ Missing header → File must start with WEBVTT
  • ❌ Invalid timestamps → Use HH:MM:SS.mmm --> HH:MM:SS.mmm format
  • ❌ Text encoding → Use UTF-8 encoding

RTTM Format:

  • ❌ Invalid line format → Must have exactly 10 space-separated fields
  • ❌ Non-numeric times → Start time and duration must be numbers
  • ❌ Missing SPEAKER records → File should contain speaker segments

Scene Detection:

  • ❌ Invalid JSON → Must be valid JSON array
  • ❌ Missing required fields → Include start_time, end_time, scene_type
  • ❌ Negative timestamps → All times must be ≥ 0

Performance Considerations

File Size Limits

  • Video: 2GB maximum (browser limitation)
  • Annotations: 100MB recommended maximum
  • Audio: 500MB maximum

Optimization Tips

  • Use H.264 encoding for videos
  • Compress large annotation files
  • Remove unnecessary precision from timestamps (3 decimal places sufficient)
  • Use relative paths in file references

Export Compatibility

Video Action Viewer maintains compatibility with:

  • VideoAnnotator: Native input/output format
  • CVAT: COCO format compatibility
  • Label Studio: JSON format compatibility
  • ELAN: WebVTT subtitle export capability

For questions about specific formats or integration requirements, please refer to the VideoAnnotator documentation.