1414| ** Intermediate Size** | 4096 |
1515| ** Number of Layers** | 24 |
1616| ** Number of Attention Heads** | 16 |
17- | ** Patch Size** | 16 |
17+ | ** Patch Size** | 14 |
1818| ** Image Resolution** | 448×448 (pre-trained) |
1919| ** Video Resolution** | 224×224 with 256 tokens per frame |
2020| ** Positional Encoding** | 3D RoPE (4:6:6 split for T:H: W ) |
@@ -42,12 +42,12 @@ For single image input, the ViT processes data in the standard 4D tensor format
4242```
4343Input: [B, C, H, W] → e.g., [1, 3, 448, 448]
4444 ↓
45- Patch Embedding (Conv2d with kernel=16 , stride=16 )
45+ Patch Embedding (Conv2d with kernel=14 , stride=14 )
4646 ↓
4747 Flatten: [B, num_patches, hidden_size]
48- e.g., [1, 784 , 1024] for 448×448 image
48+ e.g., [1, 1024 , 1024] for 448×448 image
4949 ↓
50- 3D RoPE Position Encoding (T=1, H=28 , W=28 )
50+ 3D RoPE Position Encoding (T=1, H=32 , W=32 )
5151 ↓
5252 Transformer Encoder (24 layers)
5353 ↓
@@ -69,7 +69,7 @@ Input: [B, C, T, H, W] → e.g., [1, 3, 16, 224, 224]
6969 Patch Embedding (per-frame Conv2d)
7070 ↓
7171 Flatten: [B, T × H_patches × W_patches, hidden_size]
72- e.g., [1, 16 × 14 × 14 , 1024] = [1, 3136 , 1024]
72+ e.g., [1, 16 × 16 × 16 , 1024] = [1, 4096 , 1024]
7373 ↓
7474 Build visible_indices for temporal mapping
7575 ↓
@@ -87,7 +87,7 @@ The `visible_indices` tensor maps actual frame positions to a virtual temporal g
8787``` python
8888# Example: 16 frames sampled from a video, mapped to 64 virtual frame positions
8989num_frames = 16 # Actual number of sampled frames
90- frame_tokens = 256 # Patches per frame (16×16 for 256×256 with patch_size=16 )
90+ frame_tokens = 256 # Patches per frame (16×16 for 224×224 with patch_size=14 )
9191target_frames = 64 # Virtual temporal grid size (model's RoPE temporal dimension)
9292
9393# Map 16 actual frames to positions in the 64-frame virtual grid
@@ -162,7 +162,7 @@ visible_indices = compute_codec_visible_indices(
162162 video_path,
163163 K = K_keep,
164164 mv_compensate = " similarity" , # Camera motion compensation
165- patch_size = 16
165+ patch_size = 14
166166)
167167
168168# Process with the model
@@ -174,8 +174,8 @@ outputs = model(video, visible_indices=visible_indices)
174174
175175| Mode | Input Shape | visible_indices | Output Shape | Use Case |
176176| ------| -------------| -----------------| --------------| ----------|
177- | ** Image** | ` [B, 3, H, W] ` | All patches | ` [B, (H/16 )×(W/16 ), 1024] ` | Single image understanding |
178- | ** Video Chunk** | ` [B, 3, T, H, W] ` | Frame-mapped | ` [B, T×(H/16 )×(W/16 ), 1024] ` | Uniform temporal sampling |
177+ | ** Image** | ` [B, 3, H, W] ` | All patches | ` [B, (H/14 )×(W/14 ), 1024] ` | Single image understanding |
178+ | ** Video Chunk** | ` [B, 3, T, H, W] ` | Frame-mapped | ` [B, T×(H/14 )×(W/14 ), 1024] ` | Uniform temporal sampling |
179179| ** Codec-Style** | ` [B, 3, T, H, W] ` | Top-K salient | ` [B, K, 1024] ` | Efficient dense temporal |
180180
181181### 3D RoPE Position Encoding
0 commit comments