Skip to content

Commit 6b97e12

Browse files
Copilotanxiangsir
andcommitted
Improve codec patch selection section structure and emoji
Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent fe4365d commit 6b97e12

File tree

1 file changed

+10
-4
lines changed

1 file changed

+10
-4
lines changed

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,15 +104,17 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
104104

105105
---
106106

107-
## 🎯 Codec Style Patch Selection
107+
## 🎬 Codec Style Patch Selection
108108

109109
OneVision Encoder implements a codec-inspired patch selection mechanism that intelligently identifies and processes only the most informative patches from video frames. This approach is inspired by HEVC (High-Efficiency Video Coding) and enables efficient video understanding by focusing computation on temporally salient regions.
110110

111111
### Implementation in `llava_next`
112112

113113
The codec style patch selection is implemented across several key components in the [`llava_next`](llava_next) directory:
114114

115-
#### 1. **Patch Selection Pipeline** ([`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/))
115+
### 1. Patch Selection Pipeline
116+
117+
Location: [`Compressed_Video_Reader/tool/`](llava_next/Compressed_Video_Reader/tool/)
116118

117119
- **Stage 1** ([`stage1.py`](llava_next/Compressed_Video_Reader/tool/stage1.py)): Extracts codec information from videos
118120
- Computes fused Motion Vector (MV) and Residual energy per frame
@@ -123,15 +125,19 @@ The codec style patch selection is implemented across several key components in
123125
- Generates mosaic images from selected patches
124126
- Creates `positions_thw.npy` files with [t, h, w] coordinates for each patch
125127

126-
#### 2. **Training Integration** ([`llava/train/train.py`](llava_next/llava/train/train.py))
128+
### 2. Training Integration
129+
130+
Location: [`llava/train/train.py`](llava_next/llava/train/train.py)
127131

128132
The training pipeline loads codec patch positions (lines 1267-1268):
129133
```python
130134
if "positions_thw" in sources[0]:
131135
patch_positions = torch.tensor(np.load(sources[0]["positions_thw"])).unsqueeze(0)
132136
```
133137

134-
#### 3. **Model Architecture** ([`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py))
138+
### 3. Model Architecture
139+
140+
Location: [`llava/model/llava_arch.py`](llava_next/llava/model/llava_arch.py)
135141

136142
The model passes patch positions to the vision encoder (line 199):
137143
```python

0 commit comments

Comments
 (0)