|
| 1 | +--- |
| 2 | +license: apache-2.0 |
| 3 | +--- |
| 4 | + |
| 5 | + |
| 6 | +### ⚡ Quick Start |
| 7 | + |
| 8 | +> **Note:** This model supports native resolution input. For optimal performance: |
| 9 | +> - **Image**: 448×448 resolution (pre-trained) |
| 10 | +> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained) |
| 11 | +> |
| 12 | +> Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab/onevision-encoder-large). |
| 13 | +
|
| 14 | +```python |
| 15 | +from transformers import AutoModel, AutoImageProcessor |
| 16 | +from PIL import Image |
| 17 | +import torch |
| 18 | + |
| 19 | +# Load model and preprocessor |
| 20 | +model = AutoModel.from_pretrained( |
| 21 | + "lmms-lab/onevision-encoder-large", |
| 22 | + trust_remote_code=True, |
| 23 | + attn_implementation="flash_attention_2" |
| 24 | +).to("cuda").eval() |
| 25 | + |
| 26 | +preprocessor = AutoImageProcessor.from_pretrained( |
| 27 | + "lmms-lab/onevision-encoder-large", |
| 28 | + trust_remote_code=True |
| 29 | +) |
| 30 | + |
| 31 | +# Image inference: [B, C, H, W] |
| 32 | +image = Image.open("path/to/your/image.jpg") # Replace with your image path |
| 33 | +pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda") |
| 34 | +with torch.no_grad(): |
| 35 | + outputs = model(pixel_values) |
| 36 | + # outputs.last_hidden_state: [B, num_patches, hidden_size] |
| 37 | + # outputs.pooler_output: [B, hidden_size] |
| 38 | + |
| 39 | +# Video inference: [B, C, T, H, W] with visible_indices |
| 40 | +num_frames, frame_tokens, target_frames = 16, 256, 64 |
| 41 | +# Load video frames and preprocess each frame (replace with your video frame paths) |
| 42 | +frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)] |
| 43 | +video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"] |
| 44 | +# Reshape from [T, C, H, W] to [B, C, T, H, W] |
| 45 | +video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda") |
| 46 | + |
| 47 | +# Build visible_indices for temporal sampling |
| 48 | +frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() |
| 49 | +visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1) |
| 50 | +# visible_indices example (with 256 tokens per frame): |
| 51 | +# Frame 0 (pos=0): indices [0, 1, 2, ..., 255] |
| 52 | +# Frame 1 (pos=4): indices [1024, 1025, 1026, ..., 1279] |
| 53 | +# Frame 2 (pos=8): indices [2048, 2049, 2050, ..., 2303] |
| 54 | +# ... |
| 55 | +# Frame 15 (pos=63): indices [16128, 16129, ..., 16383] |
| 56 | + |
| 57 | +with torch.no_grad(): |
| 58 | + outputs = model(video, visible_indices=visible_indices) |
| 59 | +``` |
| 60 | + |
| 61 | + |
| 62 | +### LMM Probe Results |
| 63 | + |
| 64 | +Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability. |
| 65 | + |
| 66 | +<p align="center"> |
| 67 | + <picture> |
| 68 | + <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png"> |
| 69 | + <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png"> |
| 70 | + <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;"> |
| 71 | + </picture> |
| 72 | +</p> |
| 73 | + |
| 74 | +### Attentive Probe Results |
| 75 | + |
| 76 | +Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations. |
| 77 | + |
| 78 | +<p align="center"> |
| 79 | + <picture> |
| 80 | + <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png"> |
| 81 | + <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png"> |
| 82 | + <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;"> |
| 83 | + </picture> |
| 84 | +</p> |
| 85 | + |
| 86 | + |
| 87 | +### Codec Input |
| 88 | + |
| 89 | +> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection. |
| 90 | +
|
| 91 | +--- |
0 commit comments