Skip to content

Commit 95b206f

Browse files
Copilotanxiangsir
andcommitted
Fix numerical consistency in annotation section
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 567bea7 commit 95b206f

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

docs/data_card.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ For image data, we primarily process LAION-400M and COYO-700M with the following
3939

4040
**Deduplication:** We employ a Union-Find algorithm to strictly deduplicate the dataset.
4141

42-
**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into two million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal.
42+
**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into 2 million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal.
4343

4444
**OCR-based Fine-grained Tagging:** Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.
4545

0 commit comments

Comments
 (0)