Skip to content

Commit 9a79d89

Browse files
committed
updated
1 parent d7559d5 commit 9a79d89

1 file changed

Lines changed: 12 additions & 3 deletions

File tree

README.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
</p>
88

99
<p align="center">
10-
<strong>OneVision Encoder</strong>
10+
<strong>HEVC-Style Vision Transformer</strong>
1111
</p>
1212

1313
## 📖 Table of Contents
@@ -24,7 +24,11 @@
2424

2525
## 🔍 Introduction
2626

27-
OneVision Encoder is a vision encoder designed for multimodal large language models, featuring efficient video representation with sparse video input. This project provides training code, data processing tools, and model evaluation utilities.
27+
Video understanding models face a fundamental trade-off: processing more frames captures richer temporal information but increases computation quadratically. Traditional approaches address this through sparse frame sampling, but this discards fine-grained motion dynamics and treats all spatial regions equally—wasting computation on static backgrounds.
28+
29+
We present OneVision Encoder, a vision transformer that resolves this trade-off using principles from HEVC video compression. Instead of sampling sparse frames densely (all patches from few frames), we sample dense frames sparsely (important patches from many frames). Our codec-style patch selection identifies temporally-salient regions—areas with motion, object interactions, or semantic changes—and processes only these informative patches.
30+
31+
Combined with global contrastive learning using a 2M concept bank, OneVision Encoder achieves state-of-the-art results on video benchmarks (MVBench, VideoMME, Perception Test) and image understanding tasks (DocVQA, ChartQA, OCRBench).
2832

2933
### Method Overview
3034

@@ -34,11 +38,16 @@ OneVision Encoder is a vision encoder designed for multimodal large language mod
3438

3539
### Cluster Discrimination Visualization
3640

41+
Standard contrastive learning (e.g., CLIP) is limited by batch size—negative samples are drawn only from the current batch, typically 32K-64K examples. This creates a narrow view of the embedding space and leads to suboptimal representations. Our approach maintains a global concept bank of 2M clustered centers, enabling each training sample to contrast against a diverse, representative set of negatives regardless of batch composition. This produces more discriminative embeddings with better-separated semantic clusters.
42+
43+
3744
<p align="center">
3845
<img src="pages/images/global_contrastive_comparison.gif" alt="Global Contrastive Comparison" width="800" style="max-width: 100%;">
3946
</p>
4047

41-
### Case Demonstrations
48+
### Video Processing Pipeline
49+
50+
The visualization below demonstrates our complete video processing pipeline. The animation shows four key stages: (1) Original Video - a continuous 64-frame stream capturing the full temporal context, (2) Uniform Frame Sampling - traditional approach selecting 4-8 evenly-spaced frames, which is simple but lossy and misses inter-frame motion, (3) Temporal Saliency Detection - analysis of all 64 frames to identify regions with high temporal information such as motion, appearance changes, and semantic events, and (4) Codec-Style Patch Extraction - extraction of only the salient patches in zigzag order, achieving 75-98% compression while preserving temporal dynamics.
4251

4352
<table>
4453
<tr>

0 commit comments

Comments
 (0)