Skip to content

Commit 567bea7

Browse files
Copilotanxiangsir
andcommitted
Improve readability of Image Data Annotation section
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 8fa506e commit 567bea7

1 file changed

Lines changed: 7 additions & 1 deletion

File tree

docs/data_card.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,13 @@ The table below shows the pretraining dataset composition. We use "ExoVideo" to
3535

3636
## Image Data Annotation
3737

38-
**Image Data Annotation.** For image data, we primarily process LAION-400M and COYO-700M. First, we employ a Union-Find algorithm to strictly deduplicate the dataset. Subsequently, we utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into two million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal. Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.
38+
For image data, we primarily process LAION-400M and COYO-700M with the following pipeline:
39+
40+
**Deduplication:** We employ a Union-Find algorithm to strictly deduplicate the dataset.
41+
42+
**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into two million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal.
43+
44+
**OCR-based Fine-grained Tagging:** Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.
3945

4046
---
4147

0 commit comments

Comments
 (0)