|
1 | 1 | # Data Card: OneVision Encoder Training Data |
2 | 2 |
|
3 | | -> **📦 Data Availability Notice:** The training data requires approximately **200TB** of storage. We are currently looking for suitable storage solutions. If you need access to the data immediately, please contact [anxiangsir@outlook.com](mailto:anxiangsir@outlook.com). |
| 3 | +## Overview |
4 | 4 |
|
| 5 | +This document describes the datasets used for training OneVision Encoder. The pretraining corpus combines large-scale image and video datasets for unified visual representation learning. |
5 | 6 |
|
6 | | -## Overview |
| 7 | +## OneVision-Encoder Pretraining Dataset |
7 | 8 |
|
8 | | -This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets. |
| 9 | +| Source | Samples | Type | Modality | Temporal | Curation | |
| 10 | +|--------|---------|------|----------|----------|----------| |
| 11 | +| **LAION-400M** | 250M | WebImages | Image | -- | Yes | |
| 12 | +| **COYO-700M** | 400M | WebImages | Image | -- | Yes | |
| 13 | +| **OBELICS** | 15M | Documents | Image | -- | Yes | |
| 14 | +| **Zero250M** | 15M | CuratedImages | Image | -- | Yes | |
| 15 | +| **ImageNet-21K** | 14M | Images | Image | -- | Yes | |
| 16 | +| **HowTo100M** | 50M | ExoVideo | Video | Short | No | |
| 17 | +| **Panda-70M** | 50M | ExoVideo | Video | Long | Yes | |
| 18 | +| **Kinetics-710** | 658K | ActionVideo | Video | Short | Yes | |
| 19 | +| **SSV2** | 221K | ActionVideo | Video | Short | Yes | |
9 | 20 |
|
10 | | -## Dataset Summary |
| 21 | +### Dataset Summary |
11 | 22 |
|
12 | 23 | | Category | Total Samples | |
13 | 24 | |----------|---------------| |
14 | 25 | | **Image** | ~694M | |
15 | 26 | | **Video** | ~100M+ | |
16 | 27 | | **Total** | ~794M+ | |
17 | 28 |
|
18 | | ---- |
19 | | - |
20 | | -## Image Datasets |
21 | | - |
22 | | -| Dataset | Samples | Description | |
23 | | -|---------|---------|-------------| |
24 | | -| **LAION-400M** | 250M | Large-scale image-text dataset curated from Common Crawl, filtered for high-quality image-text pairs | |
25 | | -| **COYO-700M** | 400M | Comprehensive image-text dataset with diverse web-sourced content | |
26 | | -| **OBELICS** | 15M | Interleaved image-text documents for multimodal understanding | |
27 | | -| **Zero250M** | 15M | High-quality image dataset for visual representation learning | |
28 | | -| **ImageNet-21K** | 14M | Large-scale hierarchical image dataset covering 21,841 synsets | |
29 | | - |
30 | | -### Image Dataset Details |
31 | | - |
32 | | -#### LAION-400M (250M samples used) |
33 | | -- **Source**: Common Crawl web data |
34 | | -- **Content**: Diverse web images with associated alt-text captions |
35 | | -- **Usage**: Pre-training for general visual understanding |
36 | | - |
37 | | -#### COYO-700M (400M samples used) |
38 | | -- **Source**: Web-crawled image-text pairs |
39 | | -- **Content**: Large-scale diverse visual content |
40 | | -- **Usage**: Pre-training for broad visual coverage |
41 | | - |
42 | | -#### OBELICS (15M samples) |
43 | | -- **Source**: Curated multimodal documents |
44 | | -- **Content**: Interleaved image-text documents |
45 | | -- **Usage**: Learning from contextual image-text relationships |
46 | | - |
47 | | -#### Zero250M (15M samples used) |
48 | | -- **Source**: Curated image collection |
49 | | -- **Content**: High-quality images for representation learning |
50 | | -- **Usage**: Visual representation pre-training |
51 | | - |
52 | | -#### ImageNet-21K (14M samples) |
53 | | -- **Source**: ImageNet project |
54 | | -- **Content**: Hierarchically organized images across 21,841 categories |
55 | | -- **Usage**: Fine-grained visual recognition pre-training |
56 | | - |
57 | | ---- |
58 | | - |
59 | | -## Video Datasets |
60 | | - |
61 | | -| Dataset | Samples | Description | |
62 | | -|---------|---------|-------------| |
63 | | -| **HowTo100M** | 50M | Instructional videos with narrated activities | |
64 | | -| **Panda-70M** | 50M | Large-scale video-text dataset with high-quality captions | |
65 | | -| **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) | |
66 | | -| **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) | |
67 | | - |
68 | | -### Video Dataset Details |
69 | | - |
70 | | -#### HowTo100M |
71 | | -- **Source**: YouTube instructional videos |
72 | | -- **Content**: How-to videos with automatic speech recognition transcripts |
73 | | -- **Usage**: Learning temporal dynamics and action understanding |
74 | | - |
75 | | -#### Panda-70M |
76 | | -- **Source**: Curated video-text pairs |
77 | | -- **Content**: High-quality video clips with detailed captions |
78 | | -- **Usage**: Video-language alignment pre-training |
79 | | - |
80 | | -#### Kinetics-710 (K710) |
81 | | -- **Source**: YouTube videos of human actions |
82 | | -- **Content**: Human action video clips |
83 | | -- **Usage**: Action recognition evaluation and fine-tuning |
84 | | - |
85 | | -#### Something-Something V2 (SSv2) |
86 | | -- **Source**: Crowdsourced human actions |
87 | | -- **Content**: Fine-grained hand-object interactions |
88 | | -- **Usage**: Temporal reasoning evaluation and fine-tuning |
89 | | - |
90 | | ---- |
91 | | - |
92 | | - |
93 | | -## Data Licensing |
94 | | - |
95 | | -Please refer to the original dataset licenses for usage terms: |
96 | | - |
97 | | -- **LAION-400M**: CC-BY 4.0 |
98 | | -- **COYO-700M**: CC-BY 4.0 |
99 | | -- **OBELICS**: Various (see original source) |
100 | | -- **ImageNet-21K**: ImageNet License |
101 | | -- **HowTo100M**: Various (YouTube content) |
102 | | -- **Panda-70M**: Various (see original source) |
103 | | -- **Kinetics-710**: Various (YouTube content) |
104 | | -- **Something-Something V2**: Non-commercial research use |
105 | | - |
106 | | ---- |
107 | | - |
108 | | -## Citation |
109 | | - |
110 | | -If you use this data configuration, please cite the original dataset papers: |
| 29 | +## Image Data Annotation |
111 | 30 |
|
112 | | -```bibtex |
113 | | -@article{schuhmann2021laion, |
114 | | - title={LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs}, |
115 | | - author={Schuhmann, Christoph and others}, |
116 | | - year={2021} |
117 | | -} |
| 31 | +For image data, we primarily process LAION-400M and COYO-700M with the following pipeline: |
118 | 32 |
|
119 | | -@article{kakaobrain2022coyo-700m, |
120 | | - title={COYO-700M: Image-Text Pair Dataset}, |
121 | | - author={Kakao Brain}, |
122 | | - year={2022} |
123 | | -} |
| 33 | +**Deduplication:** We employ a Union-Find algorithm to strictly deduplicate the dataset. |
124 | 34 |
|
125 | | -@article{miech19howto100m, |
126 | | - title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips}, |
127 | | - author={Miech, Antoine and others}, |
128 | | - year={2019} |
129 | | -} |
| 35 | +**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into 2 million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal. |
130 | 36 |
|
131 | | -@article{chen2024panda70m, |
132 | | - title={Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers}, |
133 | | - author={Chen, Tsai-Shien and others}, |
134 | | - year={2024} |
135 | | -} |
136 | | -``` |
| 37 | +**OCR-based Fine-grained Tagging:** Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image. |
0 commit comments