Skip to content

Commit 5cca3e1

Browse files
Copilotanxiangsir
andcommitted
Shorten and move codec patch selection section down
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 6b97e12 commit 5cca3e1

File tree

1 file changed

+17
-54
lines changed

1 file changed

+17
-54
lines changed

README.md

Lines changed: 17 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,11 @@
2929
## 📖 Table of Contents
3030

3131
- [Introduction](#-introduction)
32-
- [Codec Style Patch Selection](#-codec-style-patch-selection)
3332
- [Setup](#-setup)
3433
- [Quick Start](#-quick-start)
3534
- [Training](#-training)
3635
- [Evaluation](#-evaluation)
36+
- [Codec Style Patch Selection](#-codec-style-patch-selection)
3737
- [Contributors](#-contributors)
3838
- [License](#-license)
3939
- [Documentation](#-documentation)
@@ -104,59 +104,6 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
104104

105105
---
106106

107-
## 🎬 Codec Style Patch Selection
108-
109-
OneVision Encoder implements a codec-inspired patch selection mechanism that intelligently identifies and processes only the most informative patches from video frames. This approach is inspired by HEVC (High-Efficiency Video Coding) and enables efficient video understanding by focusing computation on temporally salient regions.
110-
111-
### Implementation in `llava_next`
112-
113-
The codec style patch selection is implemented across several key components in the [`llava_next`](llava_next) directory:
114-
115-
### 1. Patch Selection Pipeline
116-
117-
Location: [`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/)
118-
119-
- **Stage 1** ([`stage1.py`](llava_next/Compressed_Video_Reader/tool/stage1.py)): Extracts codec information from videos
120-
- Computes fused Motion Vector (MV) and Residual energy per frame
121-
- Performs global top-k selection over temporal-spatial patches
122-
- Outputs `visidx_thw.npy` containing selected patch indices
123-
124-
- **Stage 2** ([`stage2.py`](llava_next/Compressed_Video_Reader/tool/stage2.py)): Packs selected patches into training format
125-
- Generates mosaic images from selected patches
126-
- Creates `positions_thw.npy` files with [t, h, w] coordinates for each patch
127-
128-
### 2. Training Integration
129-
130-
Location: [`llava/train/train.py`](llava_next/llava/train/train.py)
131-
132-
The training pipeline loads codec patch positions (lines 1267-1268):
133-
```python
134-
if "positions_thw" in sources[0]:
135-
patch_positions = torch.tensor(np.load(sources[0]["positions_thw"])).unsqueeze(0)
136-
```
137-
138-
### 3. Model Architecture
139-
140-
Location: [`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py)
141-
142-
The model passes patch positions to the vision encoder (line 199):
143-
```python
144-
def encode_images(self, images, grid_thw=None, patch_positions=None):
145-
...
146-
image_features = vision_tower(images, patch_positions=patch_positions)
147-
```
148-
149-
### How It Works
150-
151-
1. **Temporal Saliency Detection**: Analyzes all frames to identify regions with motion, appearance variations, and semantic changes
152-
2. **Selective Patch Extraction**: Extracts only salient patches in a zigzag order, achieving 75-98% compression
153-
3. **3D Position Encoding**: Uses [t, h, w] coordinates to maintain spatiotemporal relationships
154-
4. **Efficient Processing**: Processes many frames sparsely instead of few frames densely
155-
156-
For detailed usage instructions, see the [LLaVA-Next README](llava_next/README.md).
157-
158-
---
159-
160107
### LMM Probe Results
161108

162109
We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability.
@@ -514,6 +461,22 @@ bash shells_eval_ap/eval_ov_encoder_large_2kpatches_codec.sh
514461

515462
</details>
516463

464+
---
465+
466+
## 🎬 Codec Style Patch Selection
467+
468+
The codec-inspired patch selection mechanism identifies and processes only the most informative patches from video frames, inspired by HEVC video coding.
469+
470+
**Implementation in [`llava_next`](llava_next):**
471+
472+
- **Pipeline**: [`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/) - Stage 1 extracts codec info (MV/Residual energy), Stage 2 packs patches with position coordinates
473+
- **Training**: [`llava/train/train.py`](llava_next/llava/train/train.py) - Loads `positions_thw.npy` patch positions
474+
- **Model**: [`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py) - Passes positions to vision encoder
475+
476+
For detailed usage, see the [LLaVA-Next README](llava_next/README.md).
477+
478+
---
479+
517480
## 👥 Contributors
518481

519482
<!-- Add contributor list here -->

0 commit comments

Comments
 (0)