Skip to content

Commit fe4365d

Browse files
Copilotanxiangsir
andcommitted
Add codec style patch selection section to README with implementation details
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 3ec3fea commit fe4365d

1 file changed

Lines changed: 48 additions & 0 deletions

File tree

README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
## 📖 Table of Contents
3030

3131
- [Introduction](#-introduction)
32+
- [Codec Style Patch Selection](#-codec-style-patch-selection)
3233
- [Setup](#-setup)
3334
- [Quick Start](#-quick-start)
3435
- [Training](#-training)
@@ -103,6 +104,53 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
103104

104105
---
105106

107+
## 🎯 Codec Style Patch Selection
108+
109+
OneVision Encoder implements a codec-inspired patch selection mechanism that intelligently identifies and processes only the most informative patches from video frames. This approach is inspired by HEVC (High-Efficiency Video Coding) and enables efficient video understanding by focusing computation on temporally salient regions.
110+
111+
### Implementation in `llava_next`
112+
113+
The codec style patch selection is implemented across several key components in the [`llava_next`](llava_next) directory:
114+
115+
#### 1. **Patch Selection Pipeline** ([`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/))
116+
117+
- **Stage 1** ([`stage1.py`](llava_next/Compressed_Video_Reader/tool/stage1.py)): Extracts codec information from videos
118+
- Computes fused Motion Vector (MV) and Residual energy per frame
119+
- Performs global top-k selection over temporal-spatial patches
120+
- Outputs `visidx_thw.npy` containing selected patch indices
121+
122+
- **Stage 2** ([`stage2.py`](llava_next/Compressed_Video_Reader/tool/stage2.py)): Packs selected patches into training format
123+
- Generates mosaic images from selected patches
124+
- Creates `positions_thw.npy` files with [t, h, w] coordinates for each patch
125+
126+
#### 2. **Training Integration** ([`llava/train/train.py`](llava_next/llava/train/train.py))
127+
128+
The training pipeline loads codec patch positions (lines 1267-1268):
129+
```python
130+
if "positions_thw" in sources[0]:
131+
patch_positions = torch.tensor(np.load(sources[0]["positions_thw"])).unsqueeze(0)
132+
```
133+
134+
#### 3. **Model Architecture** ([`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py))
135+
136+
The model passes patch positions to the vision encoder (line 199):
137+
```python
138+
def encode_images(self, images, grid_thw=None, patch_positions=None):
139+
...
140+
image_features = vision_tower(images, patch_positions=patch_positions)
141+
```
142+
143+
### How It Works
144+
145+
1. **Temporal Saliency Detection**: Analyzes all frames to identify regions with motion, appearance variations, and semantic changes
146+
2. **Selective Patch Extraction**: Extracts only salient patches in a zigzag order, achieving 75-98% compression
147+
3. **3D Position Encoding**: Uses [t, h, w] coordinates to maintain spatiotemporal relationships
148+
4. **Efficient Processing**: Processes many frames sparsely instead of few frames densely
149+
150+
For detailed usage instructions, see the [LLaVA-Next README](llava_next/README.md).
151+
152+
---
153+
106154
### LMM Probe Results
107155

108156
We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability.

0 commit comments

Comments
 (0)