Improve readability of Image Data Annotation section

Copilot · anxiangsir · Copilot · commit 567bea7f0f50 · 2026-02-05T05:23:03.000Z
Co-authored-by: anxiangsir &lt;31175974+anxiangsir@users.noreply.github.com&gt;
diff --git a/docs/data_card.md b/docs/data_card.md
@@ -35,7 +35,13 @@ The table below shows the pretraining dataset composition. We use "ExoVideo" to
 
 ## Image Data Annotation
 
-**Image Data Annotation.** For image data, we primarily process LAION-400M and COYO-700M. First, we employ a Union-Find algorithm to strictly deduplicate the dataset. Subsequently, we utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into two million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal. Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.
+For image data, we primarily process LAION-400M and COYO-700M with the following pipeline:
+
+**Deduplication:** We employ a Union-Find algorithm to strictly deduplicate the dataset.
+
+**Clustering and Multi-label Annotation:** We utilize the metaclip-h14-fullcc2.5b model to extract image features and cluster all images into two million classes. Based on this clustering, each image sample is annotated with the nearest Top-10 class centers as its multi-label supervision signal.
+
+**OCR-based Fine-grained Tagging:** Furthermore, we incorporate the OBELICS and Zero250M datasets. We utilize PaddleOCR to recognize text within images and perform word segmentation on the recognized content; the resulting vocabulary is used as multi-labels to construct a supervision signal containing exactly 100 fine-grained tags per image.
 
 ---