OneVision Encoder Large is a vision transformer that resolves the fundamental trade-off in video understanding: processing more frames captures richer temporal information but increases computation quadratically. Using principles from HEVC video compression, it implements codec-style patch selection that identifies temporally-salient regions—areas with motion, object interactions, or semantic changes—and processes only these informative patches.
| Property | Value |
|---|---|
| Model Type | Vision Transformer (ViT) |
| Architecture | HEVC-Style Vision Transformer |
| Hidden Size | 1024 |
| Intermediate Size | 4096 |
| Number of Layers | 24 |
| Number of Attention Heads | 16 |
| Patch Size | 14 |
| Image Resolution | 448×448 (pre-trained) |
| Video Resolution | 224×224 with 256 tokens per frame |
| Positional Encoding | 3D RoPE (4:6:6 split for T:H:W) |
| Normalization | Layer Normalization |
| Activation Function | GELU |
| Attention Implementation | Flash Attention 2 |
| License | Apache 2.0 |
- Unified Vision Foundation: A single base model for consistent understanding of images, videos, and OCR.
- Codec-Style Patch Selection: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
- 3D Rotary Position Embedding: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
- Global Contrastive Learning: Trained with a 2M concept bank for better-separated semantic clusters.
- Native Resolution Support: Supports native resolution input without tiling or cropping.
OneVision Encoder uses a unified architecture to process three types of visual inputs—images, video chunks (uniform frame sampling), and codec-style sparse patches—through the same Vision Transformer backbone. The key insight is that all inputs are converted to a sequence of patch tokens with 3D position encodings, enabling a single model to handle diverse visual modalities.
For single image input, the ViT processes data in the standard 4D tensor format [B, C, H, W]:
Input: [B, C, H, W] → e.g., [1, 3, 448, 448]
↓
Patch Embedding (Conv2d with kernel=14, stride=14)
↓
Flatten: [B, num_patches, hidden_size]
e.g., [1, 1024, 1024] for 448×448 image
↓
3D RoPE Position Encoding (T=1, H=32, W=32)
↓
Transformer Encoder (24 layers)
↓
Output: [B, num_patches, hidden_size]
Key points:
- Images are internally treated as single-frame videos with
T=1 - Position encoding uses the same 3D RoPE with temporal dimension fixed at 1
- All patches are processed (no masking), resulting in
(H/patch_size) × (W/patch_size)tokens
For video input with uniform frame sampling, the ViT processes 5D tensor format [B, C, T, H, W]:
Input: [B, C, T, H, W] → e.g., [1, 3, 16, 224, 224]
↓
Patch Embedding (per-frame Conv2d)
↓
Flatten: [B, T × H_patches × W_patches, hidden_size]
e.g., [1, 16 × 16 × 16, 1024] = [1, 4096, 1024]
↓
Build patch_positions for temporal mapping
↓
3D RoPE Position Encoding with frame positions
↓
Transformer Encoder (24 layers)
↓
Output: [B, num_visible_patches, hidden_size]
The patch_positions mechanism:
The patch_positions tensor provides explicit 3D position coordinates [t, h, w] for each patch, enabling proper temporal position encoding even with sparse frame sampling. Each patch is mapped to a position in a virtual temporal grid (e.g., 64 virtual frames):
# Example: 16 frames sampled from a video, mapped to 64 virtual frame positions
num_frames = 16 # Actual number of sampled frames
patches_per_side = 16 # Patches per side (16×16 for 224×224 with patch_size=14)
frame_tokens = patches_per_side * patches_per_side # 256 patches per frame
target_frames = 64 # Virtual temporal grid size (model's RoPE temporal dimension)
seq_len = num_frames * frame_tokens # Total number of patches
# Map 16 actual frames to positions in the 64-frame virtual grid
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long()
# frame_pos = [0, 4, 8, 12, 17, 21, 25, 29, 34, 38, 42, 46, 51, 55, 59, 63]
# Build temporal positions: each frame's patches get the same temporal position
t_positions = frame_pos.unsqueeze(-1).expand(-1, frame_tokens).reshape(-1) # [seq_len]
# Build spatial positions: h and w within each frame
h_ids = torch.arange(patches_per_side).repeat_interleave(patches_per_side) # [0,0,...,0,1,1,...,15]
w_ids = torch.arange(patches_per_side).repeat(patches_per_side) # [0,1,...,15,0,1,...,15]
h_positions = h_ids.unsqueeze(0).expand(num_frames, -1).reshape(-1) # [seq_len]
w_positions = w_ids.unsqueeze(0).expand(num_frames, -1).reshape(-1) # [seq_len]
# Build patch_positions: [batch_size, seq_len, 3] with [t, h, w] for each patch
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1) # [seq_len, 3]
patch_positions = patch_positions.unsqueeze(0) # [1, seq_len, 3] for batch_size=1
# Shape: [1, 4096, 3] (16 frames × 256 patches, with 3D coordinates)This enables the model to understand temporal relationships even when frames are not densely sampled.
Codec-style input is the most sophisticated processing mode, inspired by HEVC video compression. Instead of processing all patches from all frames, it selectively processes only temporally-salient patches identified through motion and residual analysis.
Input Video: 64 frames
↓
┌───────────────────────────────────────────────┐
│ HEVC Feature Extraction │
│ ├── Motion Vectors (MV): quarter-pel motion │
│ └── Residuals: prediction error signals │
└───────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────┐
│ Temporal Saliency Detection │
│ ├── MV Energy: camera-compensated motion mag │
│ ├── Residual Energy: prediction error mag │
│ └── Fused Energy: weighted combination │
└───────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────┐
│ Top-K Patch Selection │
│ ├── Score each patch by fused energy │
│ ├── Select K most salient patches │
│ └── Build sparse visible_indices │
└───────────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────┐
│ ViT Processing with Sparse visible_indices │
│ ├── Input: [B, C, T, H, W] full video │
│ ├── visible_indices: [B, K] selected patches │
│ └── Output: [B, K, hidden_size] │
└───────────────────────────────────────────────┘
Detailed Codec Processing Pipeline:
-
Motion Vector Analysis: Extract motion vectors from HEVC codec at quarter-pixel precision. Apply camera motion compensation (median, similarity, or affine model) to isolate object motion from camera movement.
-
Residual Analysis: Extract prediction residuals that capture texture changes and fine-grained motion not captured by block-based motion compensation.
-
Energy Fusion: Combine MV energy and residual energy with configurable weights to produce a unified saliency map.
-
Top-K Selection: Rank all patches (across all frames) by their saliency scores and select the top K patches. This achieves 75-98% compression while retaining critical temporal dynamics.
-
Sparse Processing: The selected patches are processed by the ViT with proper 3D position encodings, enabling the model to understand the spatiotemporal context of each selected patch.
Example codec-style inference:
# Codec-style: select 2048 most salient patches from 64 frames
# (equivalent to 8 full frames worth of tokens)
K_keep = 2048 # 256 patches/frame × 8 frames equivalent
# visible_indices are computed by the codec saliency detection
# Each index points to a specific (frame, h, w) position in the patch grid
visible_indices = compute_codec_visible_indices(
video_path,
K=K_keep,
mv_compensate="similarity", # Camera motion compensation
patch_size=14
)
# Process with the model
outputs = model(video, visible_indices=visible_indices)
# Output: [B, 2048, 1024] - features for 2048 selected patches| Mode | Input Shape | visible_indices | Output Shape | Use Case |
|---|---|---|---|---|
| Image | [B, 3, H, W] |
All patches | [B, (H/14)×(W/14), 1024] |
Single image understanding |
| Video Chunk | [B, 3, T, H, W] |
Frame-mapped | [B, T×(H/14)×(W/14), 1024] |
Uniform temporal sampling |
| Codec-Style | [B, 3, T, H, W] |
Top-K salient | [B, K, 1024] |
Efficient dense temporal |
All three input modes share the same 3D Rotary Position Embedding (RoPE) with a 4:6:6 split:
- Temporal (T): 4/16 of head dimension → captures frame ordering
- Height (H): 6/16 of head dimension → captures vertical position
- Width (W): 6/16 of head dimension → captures horizontal position
# 3D position encoding construction
head_dim = hidden_size // num_heads # 1024 // 16 = 64
half = head_dim // 2 # 32
# Split dimensions with 4:6:6 ratio (4+6+6 = 16 units total)
unit = half // 16 # 32 // 16 = 2
t_size = 4 * unit # 4 * 2 = 8 dims for temporal
h_size = 6 * unit # 6 * 2 = 12 dims for height
w_size = 6 * unit # 6 * 2 = 12 dims for width
# Total: 8 + 12 + 12 = 32 = half of head_dim
# Compute frequencies for each dimension
freqs = concat([
freq_temporal[t_ids], # Based on frame index
freq_height[h_ids], # Based on patch row
freq_width[w_ids] # Based on patch column
])This unified position encoding allows the model to maintain consistent spatial and temporal understanding across all input modalities.
- Video Understanding: Action recognition, video captioning, video question answering
- Image Understanding: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
- Vision-Language Models: As the vision encoder backbone for multimodal large language models
- Video benchmarks: MVBench, VideoMME, Perception Test
- Image understanding: DocVQA, ChartQA, OCRBench
- Action recognition: SSv2, UCF101, Kinetics
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# Load model and preprocessor
model = AutoModel.from_pretrained(
"lmms-lab-encoder/onevision-encoder-large",
trust_remote_code=True,
attn_implementation="flash_attention_2"
).to("cuda").eval()
preprocessor = AutoImageProcessor.from_pretrained(
"lmms-lab-encoder/onevision-encoder-large",
trust_remote_code=True
)
# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
outputs = model(pixel_values)
# outputs.last_hidden_state: [B, num_patches, hidden_size]
# outputs.pooler_output: [B, hidden_size]
# Video inference: [B, C, T, H, W] with visible_indices
num_frames, frame_tokens, target_frames = 16, 256, 64
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
# Build visible_indices for temporal sampling
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
with torch.no_grad():
outputs = model(video, visible_indices=visible_indices)Performance evaluated using Attentive Probe evaluation with single clip input, trained for 10 epochs across 8 action recognition datasets.
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning with native-resolution strategy.
- The model is pre-trained at specific resolutions (448×448 for images, 224×224 for video)
- Performance may vary on domains significantly different from training data
- Video processing requires proper temporal sampling configuration
@misc{onevision-encoder,
title={OneVision Encoder: HEVC-Style Vision Transformer},
author={EvolvingLMMs-Lab},
year={2024},
url={https://github.com/EvolvingLMMs-Lab/OneVision-Encoder}
}For questions and issues, please open an issue on the GitHub repository.
OneVision Encoder uses a unified training approach that simultaneously processes images, video codec-style patches, video frame sampling, and video collage within the same batch. This multi-modal training enables the model to learn robust representations across different input modalities.
Within each training batch, samples are divided into different processing modes:
Training Batch (bs=16)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Image Head │ │ Video Head │ │ OCR Head │ │
│ │ (origin) │ │ (decord_residual│ │ (ocr) │ │
│ │ │ │ │ │ │ │
│ │ [B, 3, H, W] │ │ Split by mode: │ │ [B, 3, H, W] │ │
│ │ │ │ • Codec 50% │ │ │ │
│ │ │ │ • Sampling 37.5│ │ │ │
│ │ │ │ • Collage 12.5%│ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
For video inputs, the batch is further split into three processing modes:
| Mode | Batch % | Description | Input → Output |
|---|---|---|---|
| Codec-Style | 50% | Select top-K salient patches based on HEVC residual | [n, 3, 64, 224, 224] → [n, 3, 8, 224, 224] |
| Frame Sampling | 37.5% | Uniform temporal sampling, 1 frame per bin | [n, 3, 64, 224, 224] → [n, 3, 8, 224, 224] |
| Collage | 12.5% | 8 frames concatenated into tall image | [n, 3, 64, 224, 224] → [n, 3, 1792, 224] |
Video Input: [bs, 3, 64, 224, 224]
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Codec-Style │ │Frame Sampling │ │ Collage │
│ (50%) │ │ (37.5%) │ │ (12.5%) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Patchify │ │ Sample frames │ │ Sample frames │
│ [n,3,16384,p²]│ │ from 8 bins │ │ from 8 bins │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Select top-K │ │ Build indices │ │ Concat frames │
│ by vis_idx │ │ for 8 frames │ │ vertically │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Unpatchify │ │ │ │ │
│[n,3,8,224,224]│ │[n,3,8,224,224]│ │[n,3,1792,224] │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└──────────────────┼──────────────────┘
│
▼
┌───────────────┐
│ ViT Backbone │
│ with RoPE │
└───────┬───────┘
│
▼
[bs, hidden_size]
This mode uses HEVC-extracted saliency information to select the most informative patches:
# Example: bs=16, first 8 samples use codec-style
# visible_indices contains pre-computed salient patch indices from HEVC analysis
# Step 1: Use pre-computed visible_indices (sorted by saliency)
out[mask_residual] = visible_indices[mask_residual, :target_num] # [8, 2048]
# Step 2: Patchify full video
# [8, 3, 64, 224, 224] → [8, 3, 16384, 14, 14] (64 frames × 256 patches/frame)
patches = video.view(n, C, T, Hp, patch_size, Wp, patch_size)
.permute(0, 1, 2, 3, 5, 4, 6)
.reshape(n, C, T * Hp * Wp, patch_size, patch_size)
# Step 3: Select top-K patches by visible_indices
selected = torch.gather(patches, 2, idx) # [8, 3, 2048, 14, 14]
# Step 4: Unpatchify back to video format
# 2048 patches = 8 frames × 256 patches/frame
combined_head_input = selected.view(n, C, 8, Hp, Wp, patch_size, patch_size)
.permute(0, 1, 2, 3, 5, 4, 6)
.reshape(n, C, 8, H, W) # [8, 3, 8, 224, 224]This mode uniformly samples frames from temporal bins:
# Example: samples 8-13 in batch use frame sampling
# Divide 64 frames into 8 bins of 8 frames each, sample 1 from each bin
# Step 1: Sample frame indices
# bins: [0-7], [8-15], [16-23], [24-31], [32-39], [40-47], [48-55], [56-63]
frames = torch.arange(8) * 8 + torch.randint(8, (nB, 8)) # [6, 8]
# Step 2: Build patch indices for all patches in selected frames
# Each frame has 256 patches
out[mask_frame_sampling] = (frames.unsqueeze(-1) * 256 +
torch.arange(256)).reshape(nB, -1) # [6, 2048]
# Step 3: Same patchify → select → unpatchify as codec-style
# Result: [6, 3, 8, 224, 224]This mode concatenates sampled frames into a single tall image:
# Example: samples 14-15 in batch use collage
# Sample 8 frames and concatenate vertically
# Step 1: Sample 8 frames (same bin-based sampling)
frames_idx = base + offsets # [2, 8], values in [0, 63]
# Step 2: Gather selected frames
sel_frames = torch.gather(video, 2, idx_expand) # [2, 3, 8, 224, 224]
# Step 3: Concatenate frames vertically
sel_frames = sel_frames.permute(0, 2, 1, 3, 4) # [2, 8, 3, 224, 224]
grid = torch.cat([sel_frames[:, i] for i in range(8)], dim=-2) # [2, 3, 1792, 224]
# Result: Processed as a tall image (1792 = 224 × 8)- Unified Architecture: Same ViT backbone handles all modalities through different input preprocessing
- Complementary Learning:
- Codec-style: Learns to focus on temporally salient regions
- Frame sampling: Learns uniform temporal understanding
- Collage: Learns spatial arrangement of temporal information
- Robust Representations: Exposure to diverse input formats improves generalization
- Efficient Training: Single forward pass processes all modalities together
All video modes use the same 3D RoPE position encoding:
# visible_indices maps selected patches to positions in a 64-frame virtual grid
# This enables consistent temporal position encoding across all modes
# Codec-style: patches scattered across 64 frames
# Frame sampling: 8 complete frames with gaps
# Collage: treated as single image (T=1)
# The model learns to handle all patterns through the unified RoPE mechanism