Add model card and data card documentation

Copilot · anxiangsir · Copilot · commit 66b27db89216 · 2025-12-25T02:07:17.000Z
Co-authored-by: anxiangsir &lt;31175974+anxiangsir@users.noreply.github.com&gt;
diff --git a/docs/datacard.md b/docs/datacard.md
@@ -0,0 +1,146 @@
+# Data Card: OneVision Encoder Training Data
+
+## Overview
+
+This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets, totaling approximately 754 million samples.
+
+## Dataset Summary
+
+| Category | Total Samples |
+|----------|---------------|
+| **Image** | ~694M |
+| **Video** | ~60M+ |
+| **Total** | ~754M+ |
+
+---
+
+## Image Datasets
+
+| Dataset | Samples | Description |
+|---------|---------|-------------|
+| **LAION-400M** | 250M | Large-scale image-text dataset curated from Common Crawl, filtered for high-quality image-text pairs |
+| **COYO-700M** | 400M | Comprehensive image-text dataset with diverse web-sourced content |
+| **OBELICS** | 15M | Interleaved image-text documents for multimodal understanding |
+| **Zero250M** | 15M | High-quality image dataset for visual representation learning |
+| **ImageNet-21K** | 14M | Large-scale hierarchical image dataset covering 21,841 synsets |
+
+### Image Dataset Details
+
+#### LAION-400M (250M samples used)
+- **Source**: Common Crawl web data
+- **Content**: Diverse web images with associated alt-text captions
+- **Usage**: Pre-training for general visual understanding
+
+#### COYO-700M (400M samples used)
+- **Source**: Web-crawled image-text pairs
+- **Content**: Large-scale diverse visual content
+- **Usage**: Pre-training for broad visual coverage
+
+#### OBELICS (15M samples)
+- **Source**: Curated multimodal documents
+- **Content**: Interleaved image-text documents
+- **Usage**: Learning from contextual image-text relationships
+
+#### Zero250M (15M samples used)
+- **Source**: Curated image collection
+- **Content**: High-quality images for representation learning
+- **Usage**: Visual representation pre-training
+
+#### ImageNet-21K (14M samples)
+- **Source**: ImageNet project
+- **Content**: Hierarchically organized images across 21,841 categories
+- **Usage**: Fine-grained visual recognition pre-training
+
+---
+
+## Video Datasets
+
+| Dataset | Samples | Description |
+|---------|---------|-------------|
+| **HowTo100M** | 30M | Instructional videos with narrated activities |
+| **Panda-70M** | 30M | Large-scale video-text dataset with high-quality captions |
+| **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) |
+| **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) |
+
+### Video Dataset Details
+
+#### HowTo100M (30M samples)
+- **Source**: YouTube instructional videos
+- **Content**: How-to videos with automatic speech recognition transcripts
+- **Usage**: Learning temporal dynamics and action understanding
+
+#### Panda-70M (30M samples)
+- **Source**: Curated video-text pairs
+- **Content**: High-quality video clips with detailed captions
+- **Usage**: Video-language alignment pre-training
+
+#### Kinetics-710 (K710)
+- **Source**: YouTube videos of human actions
+- **Content**: Human action video clips
+- **Usage**: Action recognition evaluation and fine-tuning
+
+#### Something-Something V2 (SSv2)
+- **Source**: Crowdsourced human actions
+- **Content**: Fine-grained hand-object interactions
+- **Usage**: Temporal reasoning evaluation and fine-tuning
+
+---
+
+## Data Processing
+
+### Image Processing
+- Native resolution support up to 448×448
+- CLIP-style preprocessing
+- No tiling or cropping for native resolution matching
+
+### Video Processing
+- Frame sampling with temporal saliency detection
+- Codec-style patch extraction for efficient processing
+- Support for dense temporal sampling (up to 64 frames)
+
+---
+
+## Data Licensing
+
+Please refer to the original dataset licenses for usage terms:
+
+- **LAION-400M**: CC-BY 4.0
+- **COYO-700M**: CC-BY 4.0
+- **OBELICS**: Various (see original source)
+- **ImageNet-21K**: ImageNet License
+- **HowTo100M**: Various (YouTube content)
+- **Panda-70M**: Various (see original source)
+- **Kinetics-710**: Various (YouTube content)
+- **Something-Something V2**: Non-commercial research use
+
+---
+
+## Citation
+
+If you use this data configuration, please cite the original dataset papers:
+
+```bibtex
+@article{schuhmann2021laion,
+  title={LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs},
+  author={Schuhmann, Christoph and others},
+  year={2021}
+}
+
+@article{kakaobrain2022coyo-700m,
+  title={COYO-700M: Image-Text Pair Dataset},
+  author={Kakao Brain},
+  year={2022}
+}
+
+@article{miech19howto100m,
+  title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips},
+  author={Miech, Antoine and others},
+  year={2019}
+}
+
+@article{chen2024panda70m,
+  title={Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
+  author={Chen, Tsai-Shien and others},
+  year={2024}
+}
+```
diff --git a/docs/model_card.md b/docs/model_card.md
@@ -0,0 +1,131 @@
+# Model Card: OneVision Encoder Large
+
+## Model Overview
+
+**OneVision Encoder Large** is a vision transformer that resolves the fundamental trade-off in video understanding: processing more frames captures richer temporal information but increases computation quadratically. Using principles from HEVC video compression, it implements codec-style patch selection that identifies temporally-salient regions—areas with motion, object interactions, or semantic changes—and processes only these informative patches.
+
+### Model Details
+
+| Property | Value |
+|----------|-------|
+| **Model Type** | Vision Transformer (ViT) |
+| **Architecture** | HEVC-Style Vision Transformer |
+| **Hidden Size** | 1024 |
+| **Intermediate Size** | 4096 |
+| **Number of Layers** | 24 |
+| **Number of Attention Heads** | 16 |
+| **Patch Size** | 16 |
+| **Image Resolution** | 448×448 (pre-trained) |
+| **Video Resolution** | 224×224 with 256 tokens per frame |
+| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
+| **Normalization** | Layer Normalization |
+| **Activation Function** | GELU |
+| **Attention Implementation** | Flash Attention 2 |
+| **License** | Apache 2.0 |
+
+## Key Features
+
+- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
+- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
+- **Global Contrastive Learning**: Trained with a 2M concept bank for better-separated semantic clusters.
+- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
+- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
+
+## Intended Use
+
+### Primary Use Cases
+
+- **Video Understanding**: Action recognition, video captioning, video question answering
+- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
+- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
+
+### Downstream Tasks
+
+- Video benchmarks: MVBench, VideoMME, Perception Test
+- Image understanding: DocVQA, ChartQA, OCRBench
+- Action recognition: SSv2, UCF101, Kinetics
+
+## Quick Start
+
+```python
+from transformers import AutoModel, AutoImageProcessor
+from PIL import Image
+import torch
+
+# Load model and preprocessor
+model = AutoModel.from_pretrained(
+    "lmms-lab/onevision-encoder-large",
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2"
+).to("cuda").eval()
+
+preprocessor = AutoImageProcessor.from_pretrained(
+    "lmms-lab/onevision-encoder-large",
+    trust_remote_code=True
+)
+
+# Image inference: [B, C, H, W]
+image = Image.open("path/to/your/image.jpg")
+pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
+with torch.no_grad():
+    outputs = model(pixel_values)
+    # outputs.last_hidden_state: [B, num_patches, hidden_size]
+    # outputs.pooler_output: [B, hidden_size]
+
+# Video inference: [B, C, T, H, W] with visible_indices
+num_frames, frame_tokens, target_frames = 16, 256, 64
+frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
+video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
+video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
+
+# Build visible_indices for temporal sampling
+frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
+visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
+
+with torch.no_grad():
+    outputs = model(video, visible_indices=visible_indices)
+```
+
+## Training
+
+### Training Data
+
+See [datacard.md](datacard.md) for detailed information about the training datasets.
+
+### Training Procedure & Tips
+
+1. **Pre-training**: Global contrastive learning with 2M concept bank for discriminative embeddings
+2. **Scale-up is the final step** - Maximize model capabilities before scaling, and ensure generalization phenomena emerge
+3. **Avoid direct supervision from existing models** - Indirect usage is preferred over direct distillation, which may limit scaling capabilities
+4. **Progressive training when resources are limited** - Start with low resolution/frame rate, then gradually fine-tune to higher settings
+
+## Evaluation Results
+
+### Attentive Probe Results
+
+Performance evaluated using Attentive Probe evaluation with single clip input, trained for 10 epochs across 8 action recognition datasets.
+
+### LMM Probe Results
+
+Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning with native-resolution strategy.
+
+## Limitations
+
+- The model is pre-trained at specific resolutions (448×448 for images, 224×224 for video)
+- Performance may vary on domains significantly different from training data
+- Video processing requires proper temporal sampling configuration
+
+## Citation
+
+```bibtex
+@misc{onevision-encoder,
+  title={OneVision Encoder: HEVC-Style Vision Transformer},
+  author={EvolvingLMMs-Lab},
+  year={2024},
+  url={https://github.com/EvolvingLMMs-Lab/OneVision-Encoder}
+}
+```
+
+## Contact
+
+For questions and issues, please open an issue on the [GitHub repository](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder).