Temporal event boundary detection for video using UBoCo (contrastive kernel), Qwen3-VL segmenter, and Qwen3-Omni describer.
For a fast handoff flow, start with
QUICKSTART.md.For canonical commands only, see
PIPELINES.md.Non-core and research/analysis scripts are organized under
extra/.
This project provides tools for detecting event boundaries in videos:
- UBoCo – Unsupervised Boundary Contrastive Learning (Kang et al., CVPR 2022) with RTP contrastive kernel boundary detection
- Qwen segment – Qwen3-VL sliding-window binary boundary prediction
- Qwen describer – Qwen3-Omni detailed video understanding (scene descriptions, audio transcription)
Outputs can be evaluated against reference boundaries using evaluate_boundaries.py.
Collaborators must create their own environment. The verified setup below was run from a local env at ../envs/bseg (outside this repo).
Exact versions observed:
| Package | Version |
|---|---|
| Python | 3.12.12 |
| pip | 25.2 |
| torch | 2.8.0 |
| torchvision | 0.23.0 |
| transformers | 4.57.0 |
| opencv-python-headless | 4.12.0.88 |
| numpy | 2.2.6 |
| pandas | 2.3.2 |
| librosa | 0.11.0 |
| soundfile | 0.13.1 |
| tqdm | 4.67.1 |
| hmmlearn | 0.3.3 |
| scikit-learn | 1.7.2 |
| matplotlib | 3.10.6 |
| seaborn | 0.13.2 |
Additional: PIL, qwen-vl-utils (for Qwen3-VL), qwen_omni_utils (for Qwen3-Omni). Optional: ffmpeg for audio/video extraction.
Place the Sherlock intro clip at:
boundry_segmentation/sherlock.mp4
(or use an absolute path when invoking scripts). The clip is used for sanity checks and short eval runs.
Run commands either from inside boundry_segmentation/ (e.g. python uboco_gebd.py sherlock.mp4 ...) or from the parent directory with a prefixed path (e.g. python boundry_segmentation/uboco_gebd.py boundry_segmentation/sherlock.mp4 ...).
python uboco_gebd.py sherlock.mp4 \
--boundary_method peaks \
--rtp_kernel_size 5 \
--rtp_min_length 50 \
--rtp_threshold_diff 0.3 \
--rtp_max_depth 3 \
--rtp_max_boundaries 30 \
--peaks_distance 30 \
--peaks_prominence 0.6 \
--peaks_max_boundaries 25 \
--n_epochs 2 \
--end_time 60 \
--output_dir outputs/sanity_ubocopython qwen.py sherlock.mp4 \
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
--response-mode binary \
--end-time 10 \
--output-dir outputs/sanity_qwen_segmentQuick sanity alternative (smaller model):
python qwen.py sherlock.mp4 \
--model Qwen/Qwen3-VL-2B-Instruct \
--response-mode binary \
--end-time 10 \
--output-dir outputs/sanity_qwen_segmentpython qwen_omni_describer.py sherlock.mp4 \
--model Qwen/Qwen3-Omni-30B-A3B-Instruct \
--debug-save outputs/sanity_qwen_describer/debug_first_window \
--end-time 8 \
--window-size 4 \
--stride 4 \
--sample-fps 1 \
--output-dir outputs/sanity_qwen_describerThe describer can be slow on a first run because Qwen3-Omni-30B has a heavy cold load plus generation.
Sherlock annotation data is included as Sherlock_Segments_1000_NN_2017.xlsx.
The sanity script defaults to references/sherlock_reference_boundaries_from_ubeco.txt for a lightweight, reproducible smoke check. You can also evaluate directly against the spreadsheet GT.
-
Reference boundaries: use
references/sherlock_reference_boundaries_from_ubeco.txt(included in repo), or extract manually if needed:python -c " import json with open('outputs/captions_stride12/transfer boundaries/ubeco_sherlock.json') as f: d = json.load(f) for t in d['boundary_times']: print(t) " > outputs/sanity_eval/sherlock_reference_boundaries.txt
-
Evaluate UBoCo and Qwen outputs (using default reference in
references/):python evaluate_boundaries.py outputs/sanity_uboco/boundary_times.txt references/sherlock_reference_boundaries_from_ubeco.txt --fps 25 --tolerances 5 10 15 --output outputs/sanity_eval/uboco_vs_reference.json python evaluate_boundaries.py outputs/sanity_qwen_segment/boundaries.json references/sherlock_reference_boundaries_from_ubeco.txt --fps 25 --tolerances 5 10 15 --output outputs/sanity_eval/qwen_vs_reference.json
-
Evaluate against Sherlock spreadsheet GT (segment end times):
python evaluate_boundaries.py outputs/sanity_uboco/boundary_times.txt Sherlock_Segments_1000_NN_2017.xlsx --fps 25 --tolerances 5 10 15 --gt_column "End Time (s) " --output outputs/sanity_eval/uboco_vs_sherlock_xlsx.json python evaluate_boundaries.py outputs/sanity_qwen_segment/boundaries.json Sherlock_Segments_1000_NN_2017.xlsx --fps 25 --tolerances 5 10 15 --gt_column "End Time (s) " --output outputs/sanity_eval/qwen_vs_sherlock_xlsx.json
-
Optional coarse-scene boundary eval from spreadsheet scene markers:
python evaluate_boundaries.py outputs/sanity_uboco/boundary_times.txt Sherlock_Segments_1000_NN_2017.xlsx --fps 25 --tolerances 5 10 15 --coarse-scenes --output outputs/sanity_eval/uboco_vs_sherlock_coarse_scenes.json
Or run the full sanity script: bash run_sherlock_sanity.sh
By default the sanity script uses Qwen/Qwen3-VL-2B-Instruct for segmentation and skips describer (RUN_DESCRIBER=0).
Enable describer explicitly with: RUN_DESCRIBER=1 bash run_sherlock_sanity.sh
From references/uboco_sherlock_params.txt and existing outputs:
UBoCo (peaks):
--boundary_method peaks--rtp_kernel_size 5--rtp_min_length 50--rtp_threshold_diff 0.3--rtp_max_depth 3--rtp_max_boundaries 30--peaks_distance 30--peaks_prominence 0.6--peaks_max_boundaries 25
UBoCo (RTP):
--boundary_method rtpwith same RTP params above
For full GEBD evaluation (requires GT pickle and video dir):
# UBoCo: limit to 2 videos
python extra/run_uboco_on_gebd_eval.py --video-dir /path/to/videos --max-videos 2
# UBoCo: single video by ID
python extra/run_uboco_on_gebd_eval.py --video-dir /path/to/videos --only-video <vid_id>
# Qwen: limit to 2 videos
python extra/run_qwen_on_gebd_eval.py --video-dir /path/to/videos --max-videos 2
# Qwen: single video
python extra/run_qwen_on_gebd_eval.py --video-dir /path/to/videos --only-video <vid_id>