|
29 | 29 | ## 📖 Table of Contents |
30 | 30 |
|
31 | 31 | - [Introduction](#-introduction) |
| 32 | +- [Codec Style Patch Selection](#-codec-style-patch-selection) |
32 | 33 | - [Setup](#-setup) |
33 | 34 | - [Quick Start](#-quick-start) |
34 | 35 | - [Training](#-training) |
@@ -103,6 +104,53 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained |
103 | 104 |
|
104 | 105 | --- |
105 | 106 |
|
| 107 | +## 🎯 Codec Style Patch Selection |
| 108 | + |
| 109 | +OneVision Encoder implements a codec-inspired patch selection mechanism that intelligently identifies and processes only the most informative patches from video frames. This approach is inspired by HEVC (High-Efficiency Video Coding) and enables efficient video understanding by focusing computation on temporally salient regions. |
| 110 | + |
| 111 | +### Implementation in `llava_next` |
| 112 | + |
| 113 | +The codec style patch selection is implemented across several key components in the [`llava_next`](llava_next) directory: |
| 114 | + |
| 115 | +#### 1. **Patch Selection Pipeline** ([`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/)) |
| 116 | + |
| 117 | +- **Stage 1** ([`stage1.py`](llava_next/Compressed_Video_Reader/tool/stage1.py)): Extracts codec information from videos |
| 118 | + - Computes fused Motion Vector (MV) and Residual energy per frame |
| 119 | + - Performs global top-k selection over temporal-spatial patches |
| 120 | + - Outputs `visidx_thw.npy` containing selected patch indices |
| 121 | + |
| 122 | +- **Stage 2** ([`stage2.py`](llava_next/Compressed_Video_Reader/tool/stage2.py)): Packs selected patches into training format |
| 123 | + - Generates mosaic images from selected patches |
| 124 | + - Creates `positions_thw.npy` files with [t, h, w] coordinates for each patch |
| 125 | + |
| 126 | +#### 2. **Training Integration** ([`llava/train/train.py`](llava_next/llava/train/train.py)) |
| 127 | + |
| 128 | +The training pipeline loads codec patch positions (lines 1267-1268): |
| 129 | +```python |
| 130 | +if "positions_thw" in sources[0]: |
| 131 | + patch_positions = torch.tensor(np.load(sources[0]["positions_thw"])).unsqueeze(0) |
| 132 | +``` |
| 133 | + |
| 134 | +#### 3. **Model Architecture** ([`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py)) |
| 135 | + |
| 136 | +The model passes patch positions to the vision encoder (line 199): |
| 137 | +```python |
| 138 | +def encode_images(self, images, grid_thw=None, patch_positions=None): |
| 139 | + ... |
| 140 | + image_features = vision_tower(images, patch_positions=patch_positions) |
| 141 | +``` |
| 142 | + |
| 143 | +### How It Works |
| 144 | + |
| 145 | +1. **Temporal Saliency Detection**: Analyzes all frames to identify regions with motion, appearance variations, and semantic changes |
| 146 | +2. **Selective Patch Extraction**: Extracts only salient patches in a zigzag order, achieving 75-98% compression |
| 147 | +3. **3D Position Encoding**: Uses [t, h, w] coordinates to maintain spatiotemporal relationships |
| 148 | +4. **Efficient Processing**: Processes many frames sparsely instead of few frames densely |
| 149 | + |
| 150 | +For detailed usage instructions, see the [LLaVA-Next README](llava_next/README.md). |
| 151 | + |
| 152 | +--- |
| 153 | + |
106 | 154 | ### LMM Probe Results |
107 | 155 |
|
108 | 156 | We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability. |
|
0 commit comments