Skip to content

Commit 66b27db

Browse files
Copilotanxiangsir
andcommitted
Add model card and data card documentation
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 18e0759 commit 66b27db

2 files changed

Lines changed: 277 additions & 0 deletions

File tree

docs/datacard.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Data Card: OneVision Encoder Training Data
2+
3+
## Overview
4+
5+
This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets, totaling approximately 754 million samples.
6+
7+
## Dataset Summary
8+
9+
| Category | Total Samples |
10+
|----------|---------------|
11+
| **Image** | ~694M |
12+
| **Video** | ~60M+ |
13+
| **Total** | ~754M+ |
14+
15+
---
16+
17+
## Image Datasets
18+
19+
| Dataset | Samples | Description |
20+
|---------|---------|-------------|
21+
| **LAION-400M** | 250M | Large-scale image-text dataset curated from Common Crawl, filtered for high-quality image-text pairs |
22+
| **COYO-700M** | 400M | Comprehensive image-text dataset with diverse web-sourced content |
23+
| **OBELICS** | 15M | Interleaved image-text documents for multimodal understanding |
24+
| **Zero250M** | 15M | High-quality image dataset for visual representation learning |
25+
| **ImageNet-21K** | 14M | Large-scale hierarchical image dataset covering 21,841 synsets |
26+
27+
### Image Dataset Details
28+
29+
#### LAION-400M (250M samples used)
30+
- **Source**: Common Crawl web data
31+
- **Content**: Diverse web images with associated alt-text captions
32+
- **Usage**: Pre-training for general visual understanding
33+
34+
#### COYO-700M (400M samples used)
35+
- **Source**: Web-crawled image-text pairs
36+
- **Content**: Large-scale diverse visual content
37+
- **Usage**: Pre-training for broad visual coverage
38+
39+
#### OBELICS (15M samples)
40+
- **Source**: Curated multimodal documents
41+
- **Content**: Interleaved image-text documents
42+
- **Usage**: Learning from contextual image-text relationships
43+
44+
#### Zero250M (15M samples used)
45+
- **Source**: Curated image collection
46+
- **Content**: High-quality images for representation learning
47+
- **Usage**: Visual representation pre-training
48+
49+
#### ImageNet-21K (14M samples)
50+
- **Source**: ImageNet project
51+
- **Content**: Hierarchically organized images across 21,841 categories
52+
- **Usage**: Fine-grained visual recognition pre-training
53+
54+
---
55+
56+
## Video Datasets
57+
58+
| Dataset | Samples | Description |
59+
|---------|---------|-------------|
60+
| **HowTo100M** | 30M | Instructional videos with narrated activities |
61+
| **Panda-70M** | 30M | Large-scale video-text dataset with high-quality captions |
62+
| **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) |
63+
| **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) |
64+
65+
### Video Dataset Details
66+
67+
#### HowTo100M (30M samples)
68+
- **Source**: YouTube instructional videos
69+
- **Content**: How-to videos with automatic speech recognition transcripts
70+
- **Usage**: Learning temporal dynamics and action understanding
71+
72+
#### Panda-70M (30M samples)
73+
- **Source**: Curated video-text pairs
74+
- **Content**: High-quality video clips with detailed captions
75+
- **Usage**: Video-language alignment pre-training
76+
77+
#### Kinetics-710 (K710)
78+
- **Source**: YouTube videos of human actions
79+
- **Content**: Human action video clips
80+
- **Usage**: Action recognition evaluation and fine-tuning
81+
82+
#### Something-Something V2 (SSv2)
83+
- **Source**: Crowdsourced human actions
84+
- **Content**: Fine-grained hand-object interactions
85+
- **Usage**: Temporal reasoning evaluation and fine-tuning
86+
87+
---
88+
89+
## Data Processing
90+
91+
### Image Processing
92+
- Native resolution support up to 448×448
93+
- CLIP-style preprocessing
94+
- No tiling or cropping for native resolution matching
95+
96+
### Video Processing
97+
- Frame sampling with temporal saliency detection
98+
- Codec-style patch extraction for efficient processing
99+
- Support for dense temporal sampling (up to 64 frames)
100+
101+
---
102+
103+
## Data Licensing
104+
105+
Please refer to the original dataset licenses for usage terms:
106+
107+
- **LAION-400M**: CC-BY 4.0
108+
- **COYO-700M**: CC-BY 4.0
109+
- **OBELICS**: Various (see original source)
110+
- **ImageNet-21K**: ImageNet License
111+
- **HowTo100M**: Various (YouTube content)
112+
- **Panda-70M**: Various (see original source)
113+
- **Kinetics-710**: Various (YouTube content)
114+
- **Something-Something V2**: Non-commercial research use
115+
116+
---
117+
118+
## Citation
119+
120+
If you use this data configuration, please cite the original dataset papers:
121+
122+
```bibtex
123+
@article{schuhmann2021laion,
124+
title={LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs},
125+
author={Schuhmann, Christoph and others},
126+
year={2021}
127+
}
128+
129+
@article{kakaobrain2022coyo-700m,
130+
title={COYO-700M: Image-Text Pair Dataset},
131+
author={Kakao Brain},
132+
year={2022}
133+
}
134+
135+
@article{miech19howto100m,
136+
title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips},
137+
author={Miech, Antoine and others},
138+
year={2019}
139+
}
140+
141+
@article{chen2024panda70m,
142+
title={Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
143+
author={Chen, Tsai-Shien and others},
144+
year={2024}
145+
}
146+
```

docs/model_card.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Model Card: OneVision Encoder Large
2+
3+
## Model Overview
4+
5+
**OneVision Encoder Large** is a vision transformer that resolves the fundamental trade-off in video understanding: processing more frames captures richer temporal information but increases computation quadratically. Using principles from HEVC video compression, it implements codec-style patch selection that identifies temporally-salient regions—areas with motion, object interactions, or semantic changes—and processes only these informative patches.
6+
7+
### Model Details
8+
9+
| Property | Value |
10+
|----------|-------|
11+
| **Model Type** | Vision Transformer (ViT) |
12+
| **Architecture** | HEVC-Style Vision Transformer |
13+
| **Hidden Size** | 1024 |
14+
| **Intermediate Size** | 4096 |
15+
| **Number of Layers** | 24 |
16+
| **Number of Attention Heads** | 16 |
17+
| **Patch Size** | 16 |
18+
| **Image Resolution** | 448×448 (pre-trained) |
19+
| **Video Resolution** | 224×224 with 256 tokens per frame |
20+
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
21+
| **Normalization** | Layer Normalization |
22+
| **Activation Function** | GELU |
23+
| **Attention Implementation** | Flash Attention 2 |
24+
| **License** | Apache 2.0 |
25+
26+
## Key Features
27+
28+
- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
29+
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
30+
- **Global Contrastive Learning**: Trained with a 2M concept bank for better-separated semantic clusters.
31+
- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
32+
- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
33+
34+
## Intended Use
35+
36+
### Primary Use Cases
37+
38+
- **Video Understanding**: Action recognition, video captioning, video question answering
39+
- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
40+
- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
41+
42+
### Downstream Tasks
43+
44+
- Video benchmarks: MVBench, VideoMME, Perception Test
45+
- Image understanding: DocVQA, ChartQA, OCRBench
46+
- Action recognition: SSv2, UCF101, Kinetics
47+
48+
## Quick Start
49+
50+
```python
51+
from transformers import AutoModel, AutoImageProcessor
52+
from PIL import Image
53+
import torch
54+
55+
# Load model and preprocessor
56+
model = AutoModel.from_pretrained(
57+
"lmms-lab/onevision-encoder-large",
58+
trust_remote_code=True,
59+
attn_implementation="flash_attention_2"
60+
).to("cuda").eval()
61+
62+
preprocessor = AutoImageProcessor.from_pretrained(
63+
"lmms-lab/onevision-encoder-large",
64+
trust_remote_code=True
65+
)
66+
67+
# Image inference: [B, C, H, W]
68+
image = Image.open("path/to/your/image.jpg")
69+
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
70+
with torch.no_grad():
71+
outputs = model(pixel_values)
72+
# outputs.last_hidden_state: [B, num_patches, hidden_size]
73+
# outputs.pooler_output: [B, hidden_size]
74+
75+
# Video inference: [B, C, T, H, W] with visible_indices
76+
num_frames, frame_tokens, target_frames = 16, 256, 64
77+
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
78+
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
79+
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
80+
81+
# Build visible_indices for temporal sampling
82+
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
83+
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
84+
85+
with torch.no_grad():
86+
outputs = model(video, visible_indices=visible_indices)
87+
```
88+
89+
## Training
90+
91+
### Training Data
92+
93+
See [datacard.md](datacard.md) for detailed information about the training datasets.
94+
95+
### Training Procedure & Tips
96+
97+
1. **Pre-training**: Global contrastive learning with 2M concept bank for discriminative embeddings
98+
2. **Scale-up is the final step** - Maximize model capabilities before scaling, and ensure generalization phenomena emerge
99+
3. **Avoid direct supervision from existing models** - Indirect usage is preferred over direct distillation, which may limit scaling capabilities
100+
4. **Progressive training when resources are limited** - Start with low resolution/frame rate, then gradually fine-tune to higher settings
101+
102+
## Evaluation Results
103+
104+
### Attentive Probe Results
105+
106+
Performance evaluated using Attentive Probe evaluation with single clip input, trained for 10 epochs across 8 action recognition datasets.
107+
108+
### LMM Probe Results
109+
110+
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning with native-resolution strategy.
111+
112+
## Limitations
113+
114+
- The model is pre-trained at specific resolutions (448×448 for images, 224×224 for video)
115+
- Performance may vary on domains significantly different from training data
116+
- Video processing requires proper temporal sampling configuration
117+
118+
## Citation
119+
120+
```bibtex
121+
@misc{onevision-encoder,
122+
title={OneVision Encoder: HEVC-Style Vision Transformer},
123+
author={EvolvingLMMs-Lab},
124+
year={2024},
125+
url={https://github.com/EvolvingLMMs-Lab/OneVision-Encoder}
126+
}
127+
```
128+
129+
## Contact
130+
131+
For questions and issues, please open an issue on the [GitHub repository](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder).

0 commit comments

Comments
 (0)