Supported File Formats

Video Annotation Viewer supports industry-standard formats produced by the VideoAnnotator pipeline and other computer vision tools.

Video Files

Supported Formats

MP4 (H.264/H.265) - Recommended
WebM (VP8/VP9)
AVI (various codecs)
MOV (QuickTime)

Requirements

Maximum file size: 2GB (browser limitation)
Recommended resolution: 640x480 to 1920x1080
Frame rate: 15-60 fps

Annotation File Formats

1. Person Tracking (COCO Format)

File Extension: .json
Format: COCO keypoint annotations with timestamps

{
  "annotations": [
    {
      "id": 1,
      "image_id": "frame_0001",
      "category_id": 1,
      "keypoints": [
        x1, y1, v1,  // nose
        x2, y2, v2,  // left_eye
        x3, y3, v3,  // right_eye
        // ... 17 keypoints total (x, y, visibility)
      ],
      "bbox": [x, y, width, height],
      "area": 1234.5,
      "score": 0.95,
      "track_id": 1,
      "timestamp": 1.25,
      "frame_number": 37
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "person",
      "keypoints": ["nose", "left_eye", "right_eye", ...],
      "skeleton": [[16, 14], [14, 12], ...]
    }
  ]
}

COCO Keypoint Order (17 points):

nose
left_eye
right_eye
left_ear
right_ear
left_shoulder
right_shoulder
left_elbow
right_elbow
left_wrist
right_wrist
left_hip
right_hip
left_knee
right_knee
left_ankle
right_ankle

2. Speech Recognition (WebVTT)

File Extension: .vtt
Format: Web Video Text Tracks

WEBVTT

NOTE
Generated by VideoAnnotator Speech Recognition Pipeline

00:00:01.000 --> 00:00:03.500
Hello, how are you doing today?

00:00:04.000 --> 00:00:06.200
I'm doing great, thanks for asking.

00:00:07.500 --> 00:00:09.800
That's wonderful to hear!

Format Requirements:

Must start with WEBVTT header
Timestamps in HH:MM:SS.mmm format
Cue separators with -->
UTF-8 encoding

3. Speaker Diarization (RTTM)

File Extension: .rttm
Format: Rich Transcription Time Marked (NIST)

SPEAKER filename 1 0.00 2.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER filename 1 3.00 1.80 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER filename 1 5.20 3.10 <NA> <NA> SPEAKER_00 <NA> <NA>

Field Description:

SPEAKER - Record type
filename - File identifier
channel - Channel number (usually 1)
start_time - Start time in seconds
duration - Duration in seconds
<NA> - Orthography (not used)
<NA> - Speaker type (not used)
speaker_id - Speaker identifier (SPEAKER_00, SPEAKER_01, etc.)
<NA> - Confidence (not used)
<NA> - Signal lookhead (not used)

4. Scene Detection

File Extension: .json
Format: JSON array of scene boundaries

[
  {
    "id": 1,
    "video_id": "sample_video",
    "timestamp": 0.0,
    "start_time": 0.0,
    "end_time": 5.2,
    "duration": 5.2,
    "scene_type": "conversation",
    "bbox": [10, 20, 300, 400],
    "score": 0.89
  },
  {
    "id": 2,
    "video_id": "sample_video", 
    "timestamp": 5.2,
    "start_time": 5.2,
    "end_time": 12.8,
    "duration": 7.6,
    "scene_type": "transition",
    "bbox": [15, 25, 310, 420],
    "score": 0.75
  }
]

5. Audio Files

File Extension: .wav
Format: Uncompressed WAV audio

Requirements:

Sample rate: 16kHz - 48kHz
Bit depth: 16-bit or 24-bit
Channels: Mono or stereo
Must match video duration

File Naming Conventions

VideoAnnotator Standard Naming

VideoAnnotator produces files with consistent naming patterns:

{video_name}_person_tracking.json
{video_name}_speech_recognition.vtt  
{video_name}_speaker_diarization.rttm
{video_name}_scene_detection.json
{video_name}_audio.wav

Example:

sample_video.mp4
sample_video_person_tracking.json
sample_video_speech_recognition.vtt
sample_video_speaker_diarization.rttm
sample_video_scene_detection.json
sample_video_audio.wav

Loading Multiple Files

Drag and Drop Interface

Select all files related to your video
Drag them together onto the upload area
The system automatically detects file types
Validation occurs before processing

File Type Detection

The application uses multiple methods to identify file types:

Extension-based: .mp4, .vtt, .rttm, .json, .wav
Content analysis: JSON structure validation, WebVTT header detection
MIME type: Browser-provided content type hints

Validation Errors

Common validation issues and solutions:

COCO Format:

❌ Missing keypoints array → Ensure 51 values (17 × 3)
❌ Invalid timestamp → Must be numeric value in seconds
❌ Missing required fields → Include id, bbox, keypoints

WebVTT Format:

❌ Missing header → File must start with WEBVTT
❌ Invalid timestamps → Use HH:MM:SS.mmm --> HH:MM:SS.mmm format
❌ Text encoding → Use UTF-8 encoding

RTTM Format:

❌ Invalid line format → Must have exactly 10 space-separated fields
❌ Non-numeric times → Start time and duration must be numbers
❌ Missing SPEAKER records → File should contain speaker segments

Scene Detection:

❌ Invalid JSON → Must be valid JSON array
❌ Missing required fields → Include start_time, end_time, scene_type
❌ Negative timestamps → All times must be ≥ 0

Performance Considerations

File Size Limits

Video: 2GB maximum (browser limitation)
Annotations: 100MB recommended maximum
Audio: 500MB maximum

Optimization Tips

Use H.264 encoding for videos
Compress large annotation files
Remove unnecessary precision from timestamps (3 decimal places sufficient)
Use relative paths in file references

Export Compatibility

Video Action Viewer maintains compatibility with:

VideoAnnotator: Native input/output format
CVAT: COCO format compatibility
Label Studio: JSON format compatibility
ELAN: WebVTT subtitle export capability

For questions about specific formats or integration requirements, please refer to the VideoAnnotator documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported File Formats

Video Files

Supported Formats

Requirements

Annotation File Formats

1. Person Tracking (COCO Format)

2. Speech Recognition (WebVTT)

3. Speaker Diarization (RTTM)

4. Scene Detection

5. Audio Files

File Naming Conventions

VideoAnnotator Standard Naming

Loading Multiple Files

Drag and Drop Interface

File Type Detection

Validation Errors

Performance Considerations

File Size Limits

Optimization Tips

Export Compatibility

FilesExpand file tree

FILE_FORMATS.md

Latest commit

History

FILE_FORMATS.md

File metadata and controls

Supported File Formats

Video Files

Supported Formats

Requirements

Annotation File Formats

1. Person Tracking (COCO Format)

2. Speech Recognition (WebVTT)

3. Speaker Diarization (RTTM)

4. Scene Detection

5. Audio Files

File Naming Conventions

VideoAnnotator Standard Naming

Loading Multiple Files

Drag and Drop Interface

File Type Detection

Validation Errors

Performance Considerations

File Size Limits

Optimization Tips

Export Compatibility