You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -52,7 +48,6 @@ We introduce OneVision Encoder, a vision transformer that resolves this trade-of
52
48
53
49
Coupled with global contrastive learning over a 2M-scale concept memory bank, OneVision Encoder achieves state-of-the-art performance across major video benchmarks (MVBench, VideoMME, Perception Test), while also delivering strong results on image understanding tasks (DocVQA, ChartQA, and OCRBench).
54
50
55
-
56
51
### Key Features
57
52
58
53
-**Unified Vision Foundation**: A single base model for consistent understanding of images, videos, and OCR.
@@ -98,7 +93,6 @@ The visualization below illustrates four different video processing pipelines.
98
93
99
94
Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained by batch size, as negative samples are drawn only from the current batch, typically limited to 32K–64K examples. This restriction yields a narrow and incomplete view of the embedding space, often resulting in suboptimal representation learning. In contrast, our approach maintains a global concept bank comprising 2M clustered centers, allowing each training sample to contrast against a diverse and representative set of negatives independent of batch composition. This global contrasting mechanism leads to more discriminative embeddings and well-separated semantic clusters.
@@ -107,10 +101,8 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
107
101
</picture>
108
102
</p>
109
103
110
-
111
104
---
112
105
113
-
114
106
### LMM Probe Results
115
107
116
108
We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability.
@@ -123,25 +115,22 @@ We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVisi
123
115
</picture>
124
116
</p>
125
117
126
-
127
-
128
-
129
-
130
118
## ⚡ Quick Start
131
119
132
120
> [!IMPORTANT]
133
121
> **Transformers Version Compatibility:**
134
-
> - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
122
+
>
123
+
> - ✅ **`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()`
135
124
> - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
136
125
137
-
138
126
> **Note:** This model supports native resolution input. For optimal performance:
127
+
>
139
128
> -**Image**: 448×448 resolution (pre-trained)
140
129
> -**Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
141
130
>
142
131
> Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).
143
132
144
-
### Using AutoModel (Recommended: transformers==4.53.1)
133
+
### Using AutoModel (Recommended: transformers==4.57.3)
145
134
146
135
```python
147
136
from transformers import AutoModel, AutoImageProcessor
@@ -169,31 +158,32 @@ with torch.no_grad():
169
158
# outputs.pooler_output: [B, hidden_size]
170
159
171
160
# Video inference: [B, C, T, H, W] with patch_positions
-`cache_dir` (optional): Directory for cached codec patches. Use this to specify where codec-selected patches are stored/loaded when you want to persist or reuse them.
336
-
337
-
#### Shared Parameters
338
-
339
-
The following parameters are common to both evaluation methods:
340
-
341
-
-`dataset`: Dataset to evaluate on (e.g., `diving48`, `ssv2`, `kinetics400`). Prepare the dataset according to the Attentive Probe format.
342
-
-`num_frames`: Total number of frames in the video sequence (e.g., 8 for sampling, 64 for codec).
343
-
-`model_weight`: Path to the pre-trained model. Use `lmms-lab-encoder/onevision-encoder-large` to load directly from HuggingFace, or provide a local path.
344
-
-`model_name`: Model architecture name (e.g., `hf_llava_vit_large_ln`).
345
-
-`embedding_size`: Size of the embedding dimension (e.g., 1024).
346
-
-`batch_size`: Training batch size (varies by evaluation type).
347
-
-`default_lr_list`: Learning rate for the probe training.
348
-
-`default_weight_decay`: Weight decay for optimization.
349
-
-`eval_freq`: Evaluation frequency during training.
350
-
-`dali_py_num_workers`: Number of DALI data loading workers.
351
-
-`data_root`: Root directory containing your prepared dataset (codec evaluation only).
0 commit comments