Skip to content

Commit f7e7781

Browse files
committed
Update patch_positions
1 parent d7d82a4 commit f7e7781

File tree

7 files changed

+240
-173
lines changed

7 files changed

+240
-173
lines changed

README.md

Lines changed: 24 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
<!-- <p align="center">
2-
<img alt="OneVision Encoder" src="asset/onevision_encoder.png" width="1200" style="max-width: 100%;">
3-
</p> -->
41
<picture>
52
<source media="(prefers-color-scheme: dark)" srcset="asset/logo_dark.png">
63
<source media="(prefers-color-scheme: light)" srcset="asset/logo_light.png">
@@ -21,7 +18,6 @@
2118

2219
</div>
2320

24-
2521
<p align="center">
2622
<picture>
2723
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/method_github_dark.png">
@@ -52,7 +48,6 @@ We introduce OneVision Encoder, a vision transformer that resolves this trade-of
5248

5349
Coupled with global contrastive learning over a 2M-scale concept memory bank, OneVision Encoder achieves state-of-the-art performance across major video benchmarks (MVBench, VideoMME, Perception Test), while also delivering strong results on image understanding tasks (DocVQA, ChartQA, and OCRBench).
5450

55-
5651
### Key Features
5752

5853
- **Unified Vision Foundation**: A single base model for consistent understanding of images, videos, and OCR.
@@ -98,7 +93,6 @@ The visualization below illustrates four different video processing pipelines.
9893

9994
Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained by batch size, as negative samples are drawn only from the current batch, typically limited to 32K–64K examples. This restriction yields a narrow and incomplete view of the embedding space, often resulting in suboptimal representation learning. In contrast, our approach maintains a global concept bank comprising 2M clustered centers, allowing each training sample to contrast against a diverse and representative set of negatives independent of batch composition. This global contrasting mechanism leads to more discriminative embeddings and well-separated semantic clusters.
10095

101-
10296
<p align="center">
10397
<picture>
10498
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_dark.gif">
@@ -107,10 +101,8 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
107101
</picture>
108102
</p>
109103

110-
111104
---
112105

113-
114106
### LMM Probe Results
115107

116108
We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability.
@@ -123,25 +115,22 @@ We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVisi
123115
</picture>
124116
</p>
125117

126-
127-
128-
129-
130118
## ⚡ Quick Start
131119

132120
> [!IMPORTANT]
133121
> **Transformers Version Compatibility:**
134-
> -**`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
122+
>
123+
> -**`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()`
135124
> - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
136125
137-
138126
> **Note:** This model supports native resolution input. For optimal performance:
127+
>
139128
> - **Image**: 448×448 resolution (pre-trained)
140129
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
141130
>
142131
> Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).
143132
144-
### Using AutoModel (Recommended: transformers==4.53.1)
133+
### Using AutoModel (Recommended: transformers==4.57.3)
145134

146135
```python
147136
from transformers import AutoModel, AutoImageProcessor
@@ -169,31 +158,32 @@ with torch.no_grad():
169158
# outputs.pooler_output: [B, hidden_size]
170159

171160
# Video inference: [B, C, T, H, W] with patch_positions
172-
import math
173-
num_frames, frame_tokens, target_frames = 16, 256, 64
174-
patches_per_side = int(math.sqrt(frame_tokens)) # 16 for 256 tokens
161+
num_frames, target_frames = 16, 64
162+
patch_size = 14
175163
# Load video frames and preprocess each frame (replace with your video frame paths)
176164
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
177165
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
178166
# Reshape from [T, C, H, W] to [B, C, T, H, W]
179167
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
180168

181169
# Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
182-
# Each position is (t, h, w) where t is temporal index, h/w are spatial patch coordinates
183-
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [num_frames]
184-
per = torch.arange(frame_tokens).cuda() # [frame_tokens]
185-
186-
# Temporal positions: frame index for each patch
187-
t_positions = frame_pos.unsqueeze(-1).expand(-1, frame_tokens).reshape(1, -1) # [1, num_frames * frame_tokens]
188-
# Spatial positions: h and w within each frame's patch grid
189-
h_positions = (per // patches_per_side).unsqueeze(0).expand(num_frames, -1).reshape(1, -1)
190-
w_positions = (per % patches_per_side).unsqueeze(0).expand(num_frames, -1).reshape(1, -1)
191-
# Stack to create patch_positions: [B, num_frames * frame_tokens, 3]
192-
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1)
193-
# patch_positions example (with 256 tokens per frame, 16x16 patch grid):
194-
# patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]] # Frame 0 (t=0), first 4 patches
195-
# patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]] # Frame 1 (t=4, since 16 frames map to 64 positions)
196-
# Each [t, h, w] represents: t=temporal frame index (0-63), h=row in patch grid, w=column in patch grid
170+
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T]
171+
grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid
172+
frame_tokens = grid_h * grid_w
173+
174+
t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens]
175+
h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
176+
h_positions = h_positions.repeat(num_frames) # [T * frame_tokens]
177+
w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
178+
w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
179+
180+
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
181+
# patch_positions example (256 tokens per frame, 16x16 patch grid):
182+
# Each row is [t, h, w].
183+
# First 4 patches of frame 0 (t=0):
184+
# patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
185+
# First 4 patches of frame 1 (t=4):
186+
# patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
197187

198188
with torch.no_grad():
199189
outputs = model(video, patch_positions=patch_positions)
@@ -242,7 +232,6 @@ pip install -r requirements.txt
242232

243233
### Option 2 (Docker)
244234

245-
246235
```bash
247236
docker build -t onevision-encoder:2601 .
248237

@@ -252,7 +241,6 @@ docker run -it --rm --gpus all --ipc host --net host --privileged \
252241
onevision-encoder:2601 bash
253242
```
254243

255-
256244
### Install Package
257245

258246
Inside the container, install the package in editable mode:
@@ -283,21 +271,18 @@ git clone https://huggingface.co/lmms-lab-encoder/onevision-encoder-large-si
283271

284272
Download the pretraining data and prepare the data directory as per the instructions in `data/README.md`.
285273

286-
287274
More documentation will be added soon.
288275

289276
```bash
290277
bash shells/ov_encoder_large_stage2_residual_8gpus.sh
291278
```
292279

293-
294280
Training configurations and hyperparameters will be documented soon. For now, please refer to `--help` for available options.
295281

296282
## 📊 Evaluation
297283

298284
### Attentive Probe Evaluation
299285

300-
301286
#### Chunk-wise Sampling Evaluation
302287

303288
To evaluate the encoder with uniform frame sampling, first navigate to the evaluation directory:
@@ -314,6 +299,7 @@ bash shells_eval_ap/eval_ov_encoder_large_16frames.sh
314299
```
315300

316301
**Sampling-Specific Parameters:**
302+
317303
- `frames_token_num`: Number of tokens per frame (e.g., 256 tokens for standard sampling).
318304

319305
#### OV-Encoder Codec Evaluation
@@ -330,27 +316,6 @@ Then run the following command:
330316
bash shells_eval_ap/eval_ov_encoder_large_2kpatches_codec.sh
331317
```
332318

333-
**Codec-Specific Parameters:**
334-
- `K_keep`: Number of patches to keep.
335-
- `cache_dir` (optional): Directory for cached codec patches. Use this to specify where codec-selected patches are stored/loaded when you want to persist or reuse them.
336-
337-
#### Shared Parameters
338-
339-
The following parameters are common to both evaluation methods:
340-
341-
- `dataset`: Dataset to evaluate on (e.g., `diving48`, `ssv2`, `kinetics400`). Prepare the dataset according to the Attentive Probe format.
342-
- `num_frames`: Total number of frames in the video sequence (e.g., 8 for sampling, 64 for codec).
343-
- `model_weight`: Path to the pre-trained model. Use `lmms-lab-encoder/onevision-encoder-large` to load directly from HuggingFace, or provide a local path.
344-
- `model_name`: Model architecture name (e.g., `hf_llava_vit_large_ln`).
345-
- `embedding_size`: Size of the embedding dimension (e.g., 1024).
346-
- `batch_size`: Training batch size (varies by evaluation type).
347-
- `default_lr_list`: Learning rate for the probe training.
348-
- `default_weight_decay`: Weight decay for optimization.
349-
- `eval_freq`: Evaluation frequency during training.
350-
- `dali_py_num_workers`: Number of DALI data loading workers.
351-
- `data_root`: Root directory containing your prepared dataset (codec evaluation only).
352-
353-
354319
## 👥 Contributors
355320

356321
<!-- Add contributor list here -->

dataloader/ap_dataloader_dali.py

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,3 @@
1-
#
2-
# Created by anxiangsir
3-
# Date: 2025-11-13 12:26:36 (UTC)
4-
#
5-
61
import os
72
import warnings
83
from typing import Any, Dict, List, Tuple
@@ -13,14 +8,16 @@
138
import nvidia.dali.types as types
149
from nvidia.dali.pipeline import pipeline_def
1510
from nvidia.dali.plugin.pytorch import DALIGenericIterator
11+
12+
1613
try:
1714
import cv2
15+
1816
_HAS_CV2 = True
1917
except ImportError:
2018
_HAS_CV2 = False
2119

2220

23-
2421
# ----------------------------------------------------------------------------
2522
# 1. DALI Iterator Wrapper (modified - returns indices, total_frames and file_name)
2623
# ----------------------------------------------------------------------------
@@ -47,13 +44,14 @@ def __len__(self) -> int:
4744
def reset(self):
4845
self.iter.reset()
4946

47+
5048
# ----------------------------------------------------------------------------
5149
# 2. DALI External Source for Video Data (modified - returns indices, total_frames and file_name)
5250
# ----------------------------------------------------------------------------
5351
class VideoExternalSource:
5452
def __init__(self, mode: str, source_params: Dict[str, Any]):
5553
self.mode = mode
56-
self.file_list: List[Tuple[str, int]] = source_params["file_list"]
54+
self.file_list: List[Tuple[str, int]] = source_params["file_list"]
5755
self.num_shards: int = source_params["num_shards"]
5856
self.shard_id: int = source_params["shard_id"]
5957
self.batch_size: int = source_params["batch_size"]
@@ -73,7 +71,6 @@ def __init__(self, mode: str, source_params: Dict[str, Any]):
7371
self.fallback_example = self.file_list[0] if self.file_list else ("", 0)
7472

7573
def _get_frame_indices(self, num_frames: int) -> List[int]:
76-
7774
if num_frames < self.sequence_length:
7875
indices = list(range(num_frames))
7976
indices += [num_frames - 1] * (self.sequence_length - num_frames)
@@ -110,7 +107,8 @@ def __call__(self, sample_info) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.
110107
except Exception as e:
111108
warnings.warn(f"Failed to load video: {video_path}, error: {e}. Using fallback.")
112109
fallback_path, _ = self.fallback_example
113-
if not fallback_path: raise IOError(f"Fallback video path is empty!")
110+
if not fallback_path:
111+
raise IOError(f"Fallback video path is empty!")
114112
video_data, frame_indices, total_frames = self._load_video_data(fallback_path)
115113

116114
return video_data, np.int64([int(video_label)]), frame_indices, np.int64([total_frames])
@@ -198,7 +196,7 @@ def dali_video_pipeline(mode: str, source_params: Dict[str, Any]):
198196
batch=False,
199197
parallel=True,
200198
dtype=[types.UINT8, types.INT64, types.INT64, types.INT64],
201-
layout=["FHWC", "C", "C", "C"] # Empty layout for file_name byte array (variable length)
199+
layout=["FHWC", "C", "C", "C"], # Empty layout for file_name byte array (variable length)
202200
)
203201

204202
videos = videos.gpu()
@@ -209,6 +207,7 @@ def dali_video_pipeline(mode: str, source_params: Dict[str, Any]):
209207
videos = preprocess_videos(videos, mode, input_size, mean, std)
210208
return videos, labels, indices, total_frames
211209

210+
212211
# ----------------------------------------------------------------------------
213212
# 4. Main Dataloader Function (modified - output_map adds indices, total_frames and file_name)
214213
# ----------------------------------------------------------------------------
@@ -229,8 +228,7 @@ def get_dali_dataloader(
229228
seed: int = 0,
230229
feature_extract: bool = True,
231230
) -> DALIWarper:
232-
"""
233-
"""
231+
""" """
234232
print(f"[{mode} loader] Reading for: {data_csv_path}")
235233
file_list = []
236234
try:
@@ -249,26 +247,36 @@ def get_dali_dataloader(
249247
world_size = int(os.getenv("WORLD_SIZE", "1"))
250248

251249
source_params = {
252-
"num_shards": world_size, "shard_id": rank, "file_list": file_list,
253-
"batch_size": batch_size, "sequence_length": sequence_length, "seed": seed + rank,
254-
"use_rgb": use_rgb, "input_size": input_size, "short_side_size": short_side_size,
255-
"mean": mean, "std": std,
256-
"decord_num_threads": decord_num_threads, "feature_extract": feature_extract
250+
"num_shards": world_size,
251+
"shard_id": rank,
252+
"file_list": file_list,
253+
"batch_size": batch_size,
254+
"sequence_length": sequence_length,
255+
"seed": seed + rank,
256+
"use_rgb": use_rgb,
257+
"input_size": input_size,
258+
"short_side_size": short_side_size,
259+
"mean": mean,
260+
"std": std,
261+
"decord_num_threads": decord_num_threads,
262+
"feature_extract": feature_extract,
257263
}
258264

259265
pipe = dali_video_pipeline(
260-
batch_size=batch_size, num_threads=dali_num_threads, device_id=local_rank,
261-
seed=seed + rank, py_num_workers=dali_py_num_workers, py_start_method="forkserver",
262-
prefetch_queue_depth=2, mode=mode, source_params=source_params,
266+
batch_size=batch_size,
267+
num_threads=dali_num_threads,
268+
device_id=local_rank,
269+
seed=seed + rank,
270+
py_num_workers=dali_py_num_workers,
271+
py_start_method="forkserver",
272+
prefetch_queue_depth=2,
273+
mode=mode,
274+
source_params=source_params,
263275
)
264276
pipe.build()
265277

266278
# ===> output_map adds "indices", "total_frames" and "file_name" <===
267-
dali_iter = DALIGenericIterator(
268-
pipelines=[pipe],
269-
output_map=["videos", "labels", "indices", "total_frames"],
270-
auto_reset=True
271-
)
279+
dali_iter = DALIGenericIterator(pipelines=[pipe], output_map=["videos", "labels", "indices", "total_frames"], auto_reset=True)
272280
steps_per_epoch = len(file_list) // world_size // batch_size
273281
dataloader = DALIWarper(dali_iter=dali_iter, steps_per_epoch=steps_per_epoch)
274282

dataloader/data_v2_ocr.py

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -204,27 +204,3 @@ def dali_dataloader(
204204
label_select=label_select,
205205
)
206206
return dataloader
207-
208-
209-
if __name__ == "__main__":
210-
import cv2
211-
import numpy as np
212-
213-
loader = dali_dataloader(
214-
"/data_4/coyo_ocr_v0/train_00", 4, [336, 336], workers=4, is_training=True, mean=[0, 0, 0], std=[1, 1, 1], label_select=None, seed=1437, num_shards=None, shard_id=None, max_side=336
215-
)
216-
217-
image_list = []
218-
big_img = np.zeros((3360, 3360, 3), dtype=np.uint8)
219-
for image, label in loader:
220-
print(label)
221-
image = image[0].permute(1, 2, 0).cpu().numpy()
222-
image_list.append(image)
223-
if len(image_list) == 100:
224-
break
225-
226-
for i in range(100):
227-
row = i // 10
228-
col = i % 10
229-
big_img[row * 336 : (row + 1) * 336, col * 336 : (col + 1) * 336] = image_list[i]
230-
cv2.imwrite("output_big_image.jpg", big_img[:, :, ::-1])

dataset/dataset_onevision_encoder.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -142,8 +142,8 @@ def onevision_encoder_si_cfs_single_node():
142142
WARNING: This is NOT a recommended approach as it can cause severe data imbalance.
143143
"""
144144
patterns = [
145-
"/datasets_ov_encoder/coyo400m/*.rec",
146-
"/datasets_ov_encoder/laion260m/*.rec",
145+
"datasets_ov_encoder/coyo400m/*.rec",
146+
"datasets_ov_encoder/laion260m/*.rec",
147147
]
148148

149149
all_files = [f for pattern in patterns for f in glob.glob(pattern)]
@@ -277,7 +277,6 @@ def onevision_encoder_video_codec():
277277
"""
278278
assert world_size <= 128
279279
list_mp4_label_path = f"train_how_to_100m_panda70m_k710_square_with_index_filtered_split_128/part_{rank:03d}"
280-
281280
return Property(
282281
name="onevision_encoder_video_codec",
283282
prefixes=[list_mp4_label_path],

0 commit comments

Comments
 (0)