Video Annotation Viewer supports industry-standard formats produced by the VideoAnnotator pipeline and other computer vision tools.
- MP4 (H.264/H.265) - Recommended
- WebM (VP8/VP9)
- AVI (various codecs)
- MOV (QuickTime)
- Maximum file size: 2GB (browser limitation)
- Recommended resolution: 640x480 to 1920x1080
- Frame rate: 15-60 fps
File Extension: .json
Format: COCO keypoint annotations with timestamps
{
"annotations": [
{
"id": 1,
"image_id": "frame_0001",
"category_id": 1,
"keypoints": [
x1, y1, v1, // nose
x2, y2, v2, // left_eye
x3, y3, v3, // right_eye
// ... 17 keypoints total (x, y, visibility)
],
"bbox": [x, y, width, height],
"area": 1234.5,
"score": 0.95,
"track_id": 1,
"timestamp": 1.25,
"frame_number": 37
}
],
"categories": [
{
"id": 1,
"name": "person",
"supercategory": "person",
"keypoints": ["nose", "left_eye", "right_eye", ...],
"skeleton": [[16, 14], [14, 12], ...]
}
]
}COCO Keypoint Order (17 points):
- nose
- left_eye
- right_eye
- left_ear
- right_ear
- left_shoulder
- right_shoulder
- left_elbow
- right_elbow
- left_wrist
- right_wrist
- left_hip
- right_hip
- left_knee
- right_knee
- left_ankle
- right_ankle
File Extension: .vtt
Format: Web Video Text Tracks
WEBVTT
NOTE
Generated by VideoAnnotator Speech Recognition Pipeline
00:00:01.000 --> 00:00:03.500
Hello, how are you doing today?
00:00:04.000 --> 00:00:06.200
I'm doing great, thanks for asking.
00:00:07.500 --> 00:00:09.800
That's wonderful to hear!
Format Requirements:
- Must start with
WEBVTTheader - Timestamps in
HH:MM:SS.mmmformat - Cue separators with
--> - UTF-8 encoding
File Extension: .rttm
Format: Rich Transcription Time Marked (NIST)
SPEAKER filename 1 0.00 2.50 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER filename 1 3.00 1.80 <NA> <NA> SPEAKER_01 <NA> <NA>
SPEAKER filename 1 5.20 3.10 <NA> <NA> SPEAKER_00 <NA> <NA>
Field Description:
SPEAKER- Record typefilename- File identifierchannel- Channel number (usually 1)start_time- Start time in secondsduration- Duration in seconds<NA>- Orthography (not used)<NA>- Speaker type (not used)speaker_id- Speaker identifier (SPEAKER_00, SPEAKER_01, etc.)<NA>- Confidence (not used)<NA>- Signal lookhead (not used)
File Extension: .json
Format: JSON array of scene boundaries
[
{
"id": 1,
"video_id": "sample_video",
"timestamp": 0.0,
"start_time": 0.0,
"end_time": 5.2,
"duration": 5.2,
"scene_type": "conversation",
"bbox": [10, 20, 300, 400],
"score": 0.89
},
{
"id": 2,
"video_id": "sample_video",
"timestamp": 5.2,
"start_time": 5.2,
"end_time": 12.8,
"duration": 7.6,
"scene_type": "transition",
"bbox": [15, 25, 310, 420],
"score": 0.75
}
]File Extension: .wav
Format: Uncompressed WAV audio
Requirements:
- Sample rate: 16kHz - 48kHz
- Bit depth: 16-bit or 24-bit
- Channels: Mono or stereo
- Must match video duration
VideoAnnotator produces files with consistent naming patterns:
{video_name}_person_tracking.json
{video_name}_speech_recognition.vtt
{video_name}_speaker_diarization.rttm
{video_name}_scene_detection.json
{video_name}_audio.wav
Example:
sample_video.mp4
sample_video_person_tracking.json
sample_video_speech_recognition.vtt
sample_video_speaker_diarization.rttm
sample_video_scene_detection.json
sample_video_audio.wav
- Select all files related to your video
- Drag them together onto the upload area
- The system automatically detects file types
- Validation occurs before processing
The application uses multiple methods to identify file types:
- Extension-based:
.mp4,.vtt,.rttm,.json,.wav - Content analysis: JSON structure validation, WebVTT header detection
- MIME type: Browser-provided content type hints
Common validation issues and solutions:
COCO Format:
- ❌ Missing
keypointsarray → Ensure 51 values (17 × 3) - ❌ Invalid timestamp → Must be numeric value in seconds
- ❌ Missing required fields → Include
id,bbox,keypoints
WebVTT Format:
- ❌ Missing header → File must start with
WEBVTT - ❌ Invalid timestamps → Use
HH:MM:SS.mmm --> HH:MM:SS.mmmformat - ❌ Text encoding → Use UTF-8 encoding
RTTM Format:
- ❌ Invalid line format → Must have exactly 10 space-separated fields
- ❌ Non-numeric times → Start time and duration must be numbers
- ❌ Missing SPEAKER records → File should contain speaker segments
Scene Detection:
- ❌ Invalid JSON → Must be valid JSON array
- ❌ Missing required fields → Include
start_time,end_time,scene_type - ❌ Negative timestamps → All times must be ≥ 0
- Video: 2GB maximum (browser limitation)
- Annotations: 100MB recommended maximum
- Audio: 500MB maximum
- Use H.264 encoding for videos
- Compress large annotation files
- Remove unnecessary precision from timestamps (3 decimal places sufficient)
- Use relative paths in file references
Video Action Viewer maintains compatibility with:
- VideoAnnotator: Native input/output format
- CVAT: COCO format compatibility
- Label Studio: JSON format compatibility
- ELAN: WebVTT subtitle export capability
For questions about specific formats or integration requirements, please refer to the VideoAnnotator documentation.