Skip to content

Commit ec6a0cc

Browse files
Copilotanxiangsir
andauthored
Fix patch_size documentation in model_card.md from 16 to 14 (#27)
* Initial plan * Fix patch_size documentation in model_card.md from 16 to 14 Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: anxiangsir <31175974+anxiangsir@users.noreply.github.com>
1 parent 9662d3d commit ec6a0cc

1 file changed

Lines changed: 9 additions & 9 deletions

File tree

docs/model_card.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
| **Intermediate Size** | 4096 |
1515
| **Number of Layers** | 24 |
1616
| **Number of Attention Heads** | 16 |
17-
| **Patch Size** | 16 |
17+
| **Patch Size** | 14 |
1818
| **Image Resolution** | 448×448 (pre-trained) |
1919
| **Video Resolution** | 224×224 with 256 tokens per frame |
2020
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
@@ -42,12 +42,12 @@ For single image input, the ViT processes data in the standard 4D tensor format
4242
```
4343
Input: [B, C, H, W] → e.g., [1, 3, 448, 448]
4444
45-
Patch Embedding (Conv2d with kernel=16, stride=16)
45+
Patch Embedding (Conv2d with kernel=14, stride=14)
4646
4747
Flatten: [B, num_patches, hidden_size]
48-
e.g., [1, 784, 1024] for 448×448 image
48+
e.g., [1, 1024, 1024] for 448×448 image
4949
50-
3D RoPE Position Encoding (T=1, H=28, W=28)
50+
3D RoPE Position Encoding (T=1, H=32, W=32)
5151
5252
Transformer Encoder (24 layers)
5353
@@ -69,7 +69,7 @@ Input: [B, C, T, H, W] → e.g., [1, 3, 16, 224, 224]
6969
Patch Embedding (per-frame Conv2d)
7070
7171
Flatten: [B, T × H_patches × W_patches, hidden_size]
72-
e.g., [1, 16 × 14 × 14, 1024] = [1, 3136, 1024]
72+
e.g., [1, 16 × 16 × 16, 1024] = [1, 4096, 1024]
7373
7474
Build visible_indices for temporal mapping
7575
@@ -87,7 +87,7 @@ The `visible_indices` tensor maps actual frame positions to a virtual temporal g
8787
```python
8888
# Example: 16 frames sampled from a video, mapped to 64 virtual frame positions
8989
num_frames = 16 # Actual number of sampled frames
90-
frame_tokens = 256 # Patches per frame (16×16 for 256×256 with patch_size=16)
90+
frame_tokens = 256 # Patches per frame (16×16 for 224×224 with patch_size=14)
9191
target_frames = 64 # Virtual temporal grid size (model's RoPE temporal dimension)
9292

9393
# Map 16 actual frames to positions in the 64-frame virtual grid
@@ -162,7 +162,7 @@ visible_indices = compute_codec_visible_indices(
162162
video_path,
163163
K=K_keep,
164164
mv_compensate="similarity", # Camera motion compensation
165-
patch_size=16
165+
patch_size=14
166166
)
167167

168168
# Process with the model
@@ -174,8 +174,8 @@ outputs = model(video, visible_indices=visible_indices)
174174

175175
| Mode | Input Shape | visible_indices | Output Shape | Use Case |
176176
|------|-------------|-----------------|--------------|----------|
177-
| **Image** | `[B, 3, H, W]` | All patches | `[B, (H/16)×(W/16), 1024]` | Single image understanding |
178-
| **Video Chunk** | `[B, 3, T, H, W]` | Frame-mapped | `[B, T×(H/16)×(W/16), 1024]` | Uniform temporal sampling |
177+
| **Image** | `[B, 3, H, W]` | All patches | `[B, (H/14)×(W/14), 1024]` | Single image understanding |
178+
| **Video Chunk** | `[B, 3, T, H, W]` | Frame-mapped | `[B, T×(H/14)×(W/14), 1024]` | Uniform temporal sampling |
179179
| **Codec-Style** | `[B, 3, T, H, W]` | Top-K salient | `[B, K, 1024]` | Efficient dense temporal |
180180

181181
### 3D RoPE Position Encoding

0 commit comments

Comments
 (0)