Skip to content

Commit 51a7aa3

Browse files
authored
Merge pull request #16 from EvolvingLMMs-Lab/anxiang_v2
updated
2 parents cb00f9f + 7f8c3af commit 51a7aa3

4 files changed

Lines changed: 100 additions & 29 deletions

File tree

README.md

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -219,23 +219,15 @@ with torch.no_grad():
219219

220220
### Codec Input
221221

222-
> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
222+
Add codec-style input documentation for temporal saliency-based patch selection.
223223

224224
---
225225

226226
## 🚀 Training
227227

228-
### Single Node
228+
### Single Node & Multi Node
229229

230-
```bash
231-
torchrun --nproc_per_node 8 -m training.train --help
232-
```
233-
234-
### Multi Node
235-
236-
For multi-node distributed training, configure your training script according to your cluster setup. See example scripts in the `shells/` directory.
237-
238-
---
230+
Training configurations and hyperparameters will be documented soon. For now, please refer to `--help` for available options.
239231

240232
## 📊 Evaluation
241233

docs/datacard.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ This document describes the datasets used for training OneVision Encoder. The tr
99
| Category | Total Samples |
1010
|----------|---------------|
1111
| **Image** | ~694M |
12-
| **Video** | ~60M+ |
13-
| **Total** | ~754M+ |
12+
| **Video** | ~100M+ |
13+
| **Total** | ~794M+ |
1414

1515
---
1616

@@ -57,19 +57,19 @@ This document describes the datasets used for training OneVision Encoder. The tr
5757

5858
| Dataset | Samples | Description |
5959
|---------|---------|-------------|
60-
| **HowTo100M** | 30M | Instructional videos with narrated activities |
61-
| **Panda-70M** | 30M | Large-scale video-text dataset with high-quality captions |
60+
| **HowTo100M** | 50M | Instructional videos with narrated activities |
61+
| **Panda-70M** | 50M | Large-scale video-text dataset with high-quality captions |
6262
| **Kinetics-710** | - | Human action recognition benchmark (for evaluation/fine-tuning) |
6363
| **Something-Something V2 (SSv2)** | - | Fine-grained temporal reasoning benchmark (for evaluation/fine-tuning) |
6464

6565
### Video Dataset Details
6666

67-
#### HowTo100M (30M samples)
67+
#### HowTo100M
6868
- **Source**: YouTube instructional videos
6969
- **Content**: How-to videos with automatic speech recognition transcripts
7070
- **Usage**: Learning temporal dynamics and action understanding
7171

72-
#### Panda-70M (30M samples)
72+
#### Panda-70M
7373
- **Source**: Curated video-text pairs
7474
- **Content**: High-quality video clips with detailed captions
7575
- **Usage**: Video-language alignment pre-training

docs/model_card.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -86,18 +86,6 @@ with torch.no_grad():
8686
outputs = model(video, visible_indices=visible_indices)
8787
```
8888

89-
## Training
90-
91-
### Training Data
92-
93-
See [datacard.md](datacard.md) for detailed information about the training datasets.
94-
95-
### Training Procedure & Tips
96-
97-
1. **Pre-training**: Global contrastive learning with 2M concept bank for discriminative embeddings
98-
2. **Scale-up is the final step** - Maximize model capabilities before scaling, and ensure generalization phenomena emerge
99-
3. **Avoid direct supervision from existing models** - Indirect usage is preferred over direct distillation, which may limit scaling capabilities
100-
4. **Progressive training when resources are limited** - Start with low resolution/frame rate, then gradually fine-tune to higher settings
10189

10290
## Evaluation Results
10391

onevision_encoder/README.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
license: apache-2.0
3+
---
4+
5+
6+
### ⚡ Quick Start
7+
8+
> **Note:** This model supports native resolution input. For optimal performance:
9+
> - **Image**: 448×448 resolution (pre-trained)
10+
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
11+
>
12+
> Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab/onevision-encoder-large).
13+
14+
```python
15+
from transformers import AutoModel, AutoImageProcessor
16+
from PIL import Image
17+
import torch
18+
19+
# Load model and preprocessor
20+
model = AutoModel.from_pretrained(
21+
"lmms-lab/onevision-encoder-large",
22+
trust_remote_code=True,
23+
attn_implementation="flash_attention_2"
24+
).to("cuda").eval()
25+
26+
preprocessor = AutoImageProcessor.from_pretrained(
27+
"lmms-lab/onevision-encoder-large",
28+
trust_remote_code=True
29+
)
30+
31+
# Image inference: [B, C, H, W]
32+
image = Image.open("path/to/your/image.jpg") # Replace with your image path
33+
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
34+
with torch.no_grad():
35+
outputs = model(pixel_values)
36+
# outputs.last_hidden_state: [B, num_patches, hidden_size]
37+
# outputs.pooler_output: [B, hidden_size]
38+
39+
# Video inference: [B, C, T, H, W] with visible_indices
40+
num_frames, frame_tokens, target_frames = 16, 256, 64
41+
# Load video frames and preprocess each frame (replace with your video frame paths)
42+
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
43+
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
44+
# Reshape from [T, C, H, W] to [B, C, T, H, W]
45+
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
46+
47+
# Build visible_indices for temporal sampling
48+
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
49+
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
50+
# visible_indices example (with 256 tokens per frame):
51+
# Frame 0 (pos=0): indices [0, 1, 2, ..., 255]
52+
# Frame 1 (pos=4): indices [1024, 1025, 1026, ..., 1279]
53+
# Frame 2 (pos=8): indices [2048, 2049, 2050, ..., 2303]
54+
# ...
55+
# Frame 15 (pos=63): indices [16128, 16129, ..., 16383]
56+
57+
with torch.no_grad():
58+
outputs = model(video, visible_indices=visible_indices)
59+
```
60+
61+
62+
### LMM Probe Results
63+
64+
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.
65+
66+
<p align="center">
67+
<picture>
68+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
69+
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
70+
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
71+
</picture>
72+
</p>
73+
74+
### Attentive Probe Results
75+
76+
Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
77+
78+
<p align="center">
79+
<picture>
80+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
81+
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
82+
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
83+
</picture>
84+
</p>
85+
86+
87+
### Codec Input
88+
89+
> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
90+
91+
---

0 commit comments

Comments
 (0)