Merge pull request #16 from EvolvingLMMs-Lab/anxiang_v2

anxiangsir · web-flow · commit 51a7aa3b7bd2 · 2025-12-25T15:52:09.000+08:00
updated
diff --git a/README.md b/README.md
@@ -219,23 +219,15 @@ with torch.no_grad():
 
 ### Codec Input
 
-> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
+Add codec-style input documentation for temporal saliency-based patch selection.
 
 ---
 
 ## 🚀 Training
 
-### Single Node
+### Single Node & Multi Node
 
-```bash
-torchrun --nproc_per_node 8 -m training.train --help
-```
-
-### Multi Node
-
-For multi-node distributed training, configure your training script according to your cluster setup. See example scripts in the `shells/` directory.
-
----
+Training configurations and hyperparameters will be documented soon.  For now, please refer to `--help` for available options.
 
 ## 📊 Evaluation
 
diff --git a/docs/datacard.md b/docs/datacard.md
@@ -9,8 +9,8 @@ This document describes the datasets used for training OneVision Encoder. The tr
 | Category | Total Samples |
 |----------|---------------|
 | **Image** | ~694M |
-| **Video** | ~60M+ |
-| **Total** | ~754M+ |
+| **Video** | ~100M+ |
+| **Total** | ~794M+ |
 
 ---
 
@@ -57,19 +57,19 @@ This document describes the datasets used for training OneVision Encoder. The tr
 
 | Dataset | Samples | Description |
 |---------|---------|-------------|
-| **HowTo100M** | 30M | Instructional videos with narrated activities |
-| **Panda-70M** | 30M | Large-scale video-text dataset with high-quality captions |
+| **HowTo100M** | 50M | Instructional videos with narrated activities |
+| **Panda-70M** | 50M | Large-scale video-text dataset with high-quality captions |
 | **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) |
 | **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) |
 
 ### Video Dataset Details
 
-#### HowTo100M (30M samples)
+#### HowTo100M
 - **Source**: YouTube instructional videos
 - **Content**: How-to videos with automatic speech recognition transcripts
 - **Usage**: Learning temporal dynamics and action understanding
 
-#### Panda-70M (30M samples)
+#### Panda-70M
 - **Source**: Curated video-text pairs
 - **Content**: High-quality video clips with detailed captions
 - **Usage**: Video-language alignment pre-training
diff --git a/docs/model_card.md b/docs/model_card.md
@@ -86,18 +86,6 @@ with torch.no_grad():
     outputs = model(video, visible_indices=visible_indices)
 ```
 
-## Training
-
-### Training Data
-
-See [datacard.md](datacard.md) for detailed information about the training datasets.
-
-### Training Procedure & Tips
-
-1. **Pre-training**: Global contrastive learning with 2M concept bank for discriminative embeddings
-2. **Scale-up is the final step** - Maximize model capabilities before scaling, and ensure generalization phenomena emerge
-3. **Avoid direct supervision from existing models** - Indirect usage is preferred over direct distillation, which may limit scaling capabilities
-4. **Progressive training when resources are limited** - Start with low resolution/frame rate, then gradually fine-tune to higher settings
 
 ## Evaluation Results
 
diff --git a/onevision_encoder/README.md b/onevision_encoder/README.md
@@ -0,0 +1,91 @@
+---
+license: apache-2.0
+---
+
+
+### ⚡ Quick Start
+
+> **Note:** This model supports native resolution input. For optimal performance:
+> - **Image**: 448×448 resolution (pre-trained)
+> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
+>
+> Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab/onevision-encoder-large).
+
+```python
+from transformers import AutoModel, AutoImageProcessor
+from PIL import Image
+import torch
+
+# Load model and preprocessor
+model = AutoModel.from_pretrained(
+    "lmms-lab/onevision-encoder-large",
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2"
+).to("cuda").eval()
+
+preprocessor = AutoImageProcessor.from_pretrained(
+    "lmms-lab/onevision-encoder-large",
+    trust_remote_code=True
+)
+
+# Image inference: [B, C, H, W]
+image = Image.open("path/to/your/image.jpg")  # Replace with your image path
+pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
+with torch.no_grad():
+    outputs = model(pixel_values)
+    # outputs.last_hidden_state: [B, num_patches, hidden_size]
+    # outputs.pooler_output: [B, hidden_size]
+
+# Video inference: [B, C, T, H, W] with visible_indices
+num_frames, frame_tokens, target_frames = 16, 256, 64
+# Load video frames and preprocess each frame (replace with your video frame paths)
+frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
+video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
+# Reshape from [T, C, H, W] to [B, C, T, H, W]
+video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
+
+# Build visible_indices for temporal sampling
+frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
+visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
+# visible_indices example (with 256 tokens per frame):
+#   Frame 0 (pos=0):  indices [0, 1, 2, ..., 255]
+#   Frame 1 (pos=4):  indices [1024, 1025, 1026, ..., 1279]
+#   Frame 2 (pos=8):  indices [2048, 2049, 2050, ..., 2303]
+#   ...
+#   Frame 15 (pos=63): indices [16128, 16129, ..., 16383]
+
+with torch.no_grad():
+    outputs = model(video, visible_indices=visible_indices)
+```
+
+
+### LMM Probe Results
+
+Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.
+
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
+    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
+    <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
+  </picture>
+</p>
+
+### Attentive Probe Results
+
+Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
+
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
+    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
+    <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
+  </picture>
+</p>
+
+
+### Codec Input
+
+> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
+
+---