Skip to content

Commit 5f843a7

Browse files
Copilotanxiangsir
andauthored
Simplify data card documentation (#75)
* Initial plan * Update data_card.md with simplified pretraining dataset table and annotation info Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com> * Improve readability of Image Data Annotation section Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com> * Fix numerical consistency in annotation section Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com> * Simplify data_card.md by removing detailed content Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 1cb8023 commit 5f843a7

1 file changed

Lines changed: 20 additions & 119 deletions

File tree

docs/data_card.md

Lines changed: 20 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,136 +1,37 @@
11
# Data Card: OneVision Encoder Training Data
22

3-
> **📦 Data Availability Notice:** The training data requires approximately **200TB** of storage. We are currently looking for suitable storage solutions. If you need access to the data immediately, please contact [anxiangsir@outlook.com](mailto:anxiangsir@outlook.com).
3+
## Overview
44

5+
This document describes the datasets used for training OneVision Encoder. The pretraining corpus combines large-scale image and video datasets for unified visual representation learning.
56

6-
## Overview
7+
## OneVision-Encoder Pretraining Dataset
78

8-
This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets.
9+
| Source | Samples | Type | Modality | Temporal | Curation |
10+
|--------|---------|------|----------|----------|----------|
11+
| **LAION-400M** | 250M | WebImages | Image | -- | Yes |
12+
| **COYO-700M** | 400M | WebImages | Image | -- | Yes |
13+
| **OBELICS** | 15M | Documents | Image | -- | Yes |
14+
| **Zero250M** | 15M | CuratedImages | Image | -- | Yes |
15+
| **ImageNet-21K** | 14M | Images | Image | -- | Yes |
16+
| **HowTo100M** | 50M | ExoVideo | Video | Short | No |
17+
| **Panda-70M** | 50M | ExoVideo | Video | Long | Yes |
18+
| **Kinetics-710** | 658K | ActionVideo | Video | Short | Yes |
19+
| **SSV2** | 221K | ActionVideo | Video | Short | Yes |
920

10-
## Dataset Summary
21+
### Dataset Summary
1122

1223
| Category | Total Samples |
1324
|----------|---------------|
1425
| **Image** | ~694M |
1526
| **Video** | ~100M+ |
1627
| **Total** | ~794M+ |
1728

18-
---
19-
20-
## Image Datasets
21-
22-
| Dataset | Samples | Description |
23-
|---------|---------|-------------|
24-
| **LAION-400M** | 250M | Large-scale image-text dataset curated from Common Crawl, filtered for high-quality image-text pairs |
25-
| **COYO-700M** | 400M | Comprehensive image-text dataset with diverse web-sourced content |
26-
| **OBELICS** | 15M | Interleaved image-text documents for multimodal understanding |
27-
| **Zero250M** | 15M | High-quality image dataset for visual representation learning |
28-
| **ImageNet-21K** | 14M | Large-scale hierarchical image dataset covering 21,841 synsets |
29-
30-
### Image Dataset Details
31-
32-
#### LAION-400M (250M samples used)
33-
- **Source**: Common Crawl web data
34-
- **Content**: Diverse web images with associated alt-text captions
35-
- **Usage**: Pre-training for general visual understanding
36-
37-
#### COYO-700M (400M samples used)
38-
- **Source**: Web-crawled image-text pairs
39-
- **Content**: Large-scale diverse visual content
40-
- **Usage**: Pre-training for broad visual coverage
41-
42-
#### OBELICS (15M samples)
43-
- **Source**: Curated multimodal documents
44-
- **Content**: Interleaved image-text documents
45-
- **Usage**: Learning from contextual image-text relationships
46-
47-
#### Zero250M (15M samples used)
48-
- **Source**: Curated image collection
49-
- **Content**: High-quality images for representation learning
50-
- **Usage**: Visual representation pre-training
51-
52-
#### ImageNet-21K (14M samples)
53-
- **Source**: ImageNet project
54-
- **Content**: Hierarchically organized images across 21,841 categories
55-
- **Usage**: Fine-grained visual recognition pre-training
56-
57-
---
58-
59-
## Video Datasets
60-
61-
| Dataset | Samples | Description |
62-
|---------|---------|-------------|
63-
| **HowTo100M** | 50M | Instructional videos with narrated activities |
64-
| **Panda-70M** | 50M | Large-scale video-text dataset with high-quality captions |
65-
| **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) |
66-
| **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) |
67-
68-
### Video Dataset Details
69-
70-
#### HowTo100M
71-
- **Source**: YouTube instructional videos
72-
- **Content**: How-to videos with automatic speech recognition transcripts
73-
- **Usage**: Learning temporal dynamics and action understanding
74-
75-
#### Panda-70M
76-
- **Source**: Curated video-text pairs
77-
- **Content**: High-quality video clips with detailed captions
78-
- **Usage**: Video-language alignment pre-training
79-
80-
#### Kinetics-710 (K710)
81-
- **Source**: YouTube videos of human actions
82-
- **Content**: Human action video clips
83-
- **Usage**: Action recognition evaluation and fine-tuning
84-
85-
#### Something-Something V2 (SSv2)
86-
- **Source**: Crowdsourced human actions
87-
- **Content**: Fine-grained hand-object interactions
88-
- **Usage**: Temporal reasoning evaluation and fine-tuning
89-
90-
---
91-
92-
93-
## Data Licensing
94-
95-
Please refer to the original dataset licenses for usage terms:
96-
97-
- **LAION-400M**: CC-BY 4.0
98-
- **COYO-700M**: CC-BY 4.0
99-
- **OBELICS**: Various (see original source)
100-
- **ImageNet-21K**: ImageNet License
101-
- **HowTo100M**: Various (YouTube content)
102-
- **Panda-70M**: Various (see original source)
103-
- **Kinetics-710**: Various (YouTube content)
104-
- **Something-Something V2**: Non-commercial research use
105-
106-
---
107-
108-
## Citation
109-
110-
If you use this data configuration, please cite the original dataset papers:
29+
## Image Data Annotation
11130

112-
```bibtex
113-
@article{schuhmann2021laion,
114-
title={LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs},
115-
author={Schuhmann, Christoph and others},
116-
year={2021}
117-
}
31+
For image data, we primarily process LAION-400M and COYO-700M with the following pipeline:
11832

119-
@article{kakaobrain2022coyo-700m,
120-
title={COYO-700M: Image-Text Pair Dataset},
121-
author={Kakao Brain},
122-
year={2022}
123-
}
33+
**Deduplication:** We employ a Union-Find algorithm to strictly deduplicate the dataset.
12434

125-
@article{miech19howto100m,
126-
title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips},
127-
author={Miech, Antoine and others},
128-
year={2019}
129-
}
35+
**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into 2 million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal.
13036

131-
@article{chen2024panda70m,
132-
title={Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
133-
author={Chen, Tsai-Shien and others},
134-
year={2024}
135-
}
136-
```
37+
**OCR-based Fine-grained Tagging:** Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.

0 commit comments

Comments
 (0)