|
29 | 29 | ## 📖 Table of Contents |
30 | 30 |
|
31 | 31 | - [Introduction](#-introduction) |
32 | | -- [Codec Style Patch Selection](#-codec-style-patch-selection) |
33 | 32 | - [Setup](#-setup) |
34 | 33 | - [Quick Start](#-quick-start) |
35 | 34 | - [Training](#-training) |
36 | 35 | - [Evaluation](#-evaluation) |
| 36 | +- [Codec Style Patch Selection](#-codec-style-patch-selection) |
37 | 37 | - [Contributors](#-contributors) |
38 | 38 | - [License](#-license) |
39 | 39 | - [Documentation](#-documentation) |
@@ -104,59 +104,6 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained |
104 | 104 |
|
105 | 105 | --- |
106 | 106 |
|
107 | | -## 🎬 Codec Style Patch Selection |
108 | | - |
109 | | -OneVision Encoder implements a codec-inspired patch selection mechanism that intelligently identifies and processes only the most informative patches from video frames. This approach is inspired by HEVC (High-Efficiency Video Coding) and enables efficient video understanding by focusing computation on temporally salient regions. |
110 | | - |
111 | | -### Implementation in `llava_next` |
112 | | - |
113 | | -The codec style patch selection is implemented across several key components in the [`llava_next`](llava_next) directory: |
114 | | - |
115 | | -### 1. Patch Selection Pipeline |
116 | | - |
117 | | -Location: [`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/) |
118 | | - |
119 | | -- **Stage 1** ([`stage1.py`](llava_next/Compressed_Video_Reader/tool/stage1.py)): Extracts codec information from videos |
120 | | - - Computes fused Motion Vector (MV) and Residual energy per frame |
121 | | - - Performs global top-k selection over temporal-spatial patches |
122 | | - - Outputs `visidx_thw.npy` containing selected patch indices |
123 | | - |
124 | | -- **Stage 2** ([`stage2.py`](llava_next/Compressed_Video_Reader/tool/stage2.py)): Packs selected patches into training format |
125 | | - - Generates mosaic images from selected patches |
126 | | - - Creates `positions_thw.npy` files with [t, h, w] coordinates for each patch |
127 | | - |
128 | | -### 2. Training Integration |
129 | | - |
130 | | -Location: [`llava/train/train.py`](llava_next/llava/train/train.py) |
131 | | - |
132 | | -The training pipeline loads codec patch positions (lines 1267-1268): |
133 | | -```python |
134 | | -if "positions_thw" in sources[0]: |
135 | | - patch_positions = torch.tensor(np.load(sources[0]["positions_thw"])).unsqueeze(0) |
136 | | -``` |
137 | | - |
138 | | -### 3. Model Architecture |
139 | | - |
140 | | -Location: [`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py) |
141 | | - |
142 | | -The model passes patch positions to the vision encoder (line 199): |
143 | | -```python |
144 | | -def encode_images(self, images, grid_thw=None, patch_positions=None): |
145 | | - ... |
146 | | - image_features = vision_tower(images, patch_positions=patch_positions) |
147 | | -``` |
148 | | - |
149 | | -### How It Works |
150 | | - |
151 | | -1. **Temporal Saliency Detection**: Analyzes all frames to identify regions with motion, appearance variations, and semantic changes |
152 | | -2. **Selective Patch Extraction**: Extracts only salient patches in a zigzag order, achieving 75-98% compression |
153 | | -3. **3D Position Encoding**: Uses [t, h, w] coordinates to maintain spatiotemporal relationships |
154 | | -4. **Efficient Processing**: Processes many frames sparsely instead of few frames densely |
155 | | - |
156 | | -For detailed usage instructions, see the [LLaVA-Next README](llava_next/README.md). |
157 | | - |
158 | | ---- |
159 | | - |
160 | 107 | ### LMM Probe Results |
161 | 108 |
|
162 | 109 | We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability. |
@@ -514,6 +461,22 @@ bash shells_eval_ap/eval_ov_encoder_large_2kpatches_codec.sh |
514 | 461 |
|
515 | 462 | </details> |
516 | 463 |
|
| 464 | +--- |
| 465 | + |
| 466 | +## 🎬 Codec Style Patch Selection |
| 467 | + |
| 468 | +The codec-inspired patch selection mechanism identifies and processes only the most informative patches from video frames, inspired by HEVC video coding. |
| 469 | + |
| 470 | +**Implementation in [`llava_next`](llava_next):** |
| 471 | + |
| 472 | +- **Pipeline**: [`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/) - Stage 1 extracts codec info (MV/Residual energy), Stage 2 packs patches with position coordinates |
| 473 | +- **Training**: [`llava/train/train.py`](llava_next/llava/train/train.py) - Loads `positions_thw.npy` patch positions |
| 474 | +- **Model**: [`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py) - Passes positions to vision encoder |
| 475 | + |
| 476 | +For detailed usage, see the [LLaVA-Next README](llava_next/README.md). |
| 477 | + |
| 478 | +--- |
| 479 | + |
517 | 480 | ## 👥 Contributors |
518 | 481 |
|
519 | 482 | <!-- Add contributor list here --> |
|
0 commit comments