docs: Add detailed documentation for ViT unified input processing (image, video chunk sampling, codec-style)

Copilot · anxiangsir · Copilot · commit fd80a715a7ac · 2025-12-29T01:56:20.000Z
Co-authored-by: anxiangsir &lt;31175974+anxiangsir@users.noreply.github.com&gt;
diff --git a/docs/model_card.md b/docs/model_card.md
@@ -31,6 +31,183 @@
 - **Native Resolution Support**: Supports native resolution input without tiling or cropping.
 - **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
 
+## Unified Input Processing
+
+OneVision Encoder uses a unified architecture to process three types of visual inputs—images, video chunks (uniform frame sampling), and codec-style sparse patches—through the same Vision Transformer backbone. The key insight is that all inputs are converted to a sequence of patch tokens with 3D position encodings, enabling a single model to handle diverse visual modalities.
+
+### Image Processing
+
+For single image input, the ViT processes data in the standard 4D tensor format `[B, C, H, W]`:
+
+```
+Input: [B, C, H, W] → e.g., [1, 3, 448, 448]
+                           ↓
+                   Patch Embedding (Conv2d with kernel=16, stride=16)
+                           ↓
+              Flatten: [B, num_patches, hidden_size]
+                       e.g., [1, 784, 1024] for 448×448 image
+                           ↓
+                   3D RoPE Position Encoding (T=1, H=28, W=28)
+                           ↓
+                   Transformer Encoder (24 layers)
+                           ↓
+              Output: [B, num_patches, hidden_size]
+```
+
+**Key points:**
+- Images are internally treated as single-frame videos with `T=1`
+- Position encoding uses the same 3D RoPE with temporal dimension fixed at 1
+- All patches are processed (no masking), resulting in `(H/patch_size) × (W/patch_size)` tokens
+
+### Video Chunk Sampling
+
+For video input with uniform frame sampling, the ViT processes 5D tensor format `[B, C, T, H, W]`:
+
+```
+Input: [B, C, T, H, W] → e.g., [1, 3, 16, 224, 224]
+                             ↓
+                    Patch Embedding (per-frame Conv2d)
+                             ↓
+               Flatten: [B, T × H_patches × W_patches, hidden_size]
+                        e.g., [1, 16 × 14 × 14, 1024] = [1, 3136, 1024]
+                             ↓
+                    Build visible_indices for temporal mapping
+                             ↓
+                    3D RoPE Position Encoding with frame positions
+                             ↓
+                    Transformer Encoder (24 layers)
+                             ↓
+               Output: [B, num_visible_patches, hidden_size]
+```
+
+**The `visible_indices` mechanism:**
+
+The `visible_indices` tensor maps actual frame positions to a virtual temporal grid (e.g., 64 virtual frames), enabling proper temporal position encoding even with sparse frame sampling:
+
+```python
+# Example: 16 frames sampled from a video, mapped to 64 virtual frame positions
+num_frames = 16          # Actual number of sampled frames
+frame_tokens = 256       # Patches per frame (16×16 for 256×256 with patch_size=16)
+target_frames = 64       # Virtual temporal grid size (model's RoPE temporal dimension)
+
+# Map 16 actual frames to positions in the 64-frame virtual grid
+frame_pos = torch.linspace(0, target_frames - 1, num_frames).long()
+# frame_pos = [0, 4, 8, 12, 17, 21, 25, 29, 34, 38, 42, 46, 51, 55, 59, 63]
+
+# Build visible_indices: each frame's patches get position encoding based on frame_pos
+visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + 
+                   torch.arange(frame_tokens)).reshape(1, -1)
+# Shape: [1, 4096] (16 frames × 256 patches)
+```
+
+This enables the model to understand temporal relationships even when frames are not densely sampled.
+
+### Codec-Style Input
+
+Codec-style input is the most sophisticated processing mode, inspired by HEVC video compression. Instead of processing all patches from all frames, it selectively processes only temporally-salient patches identified through motion and residual analysis.
+
+```
+Input Video: 64 frames
+        ↓
+┌───────────────────────────────────────────────┐
+│  HEVC Feature Extraction                      │
+│  ├── Motion Vectors (MV): quarter-pel motion  │
+│  └── Residuals: prediction error signals      │
+└───────────────────────────────────────────────┘
+        ↓
+┌───────────────────────────────────────────────┐
+│  Temporal Saliency Detection                  │
+│  ├── MV Energy: camera-compensated motion mag │
+│  ├── Residual Energy: prediction error mag    │
+│  └── Fused Energy: weighted combination       │
+└───────────────────────────────────────────────┘
+        ↓
+┌───────────────────────────────────────────────┐
+│  Top-K Patch Selection                        │
+│  ├── Score each patch by fused energy         │
+│  ├── Select K most salient patches            │
+│  └── Build sparse visible_indices             │
+└───────────────────────────────────────────────┘
+        ↓
+┌───────────────────────────────────────────────┐
+│  ViT Processing with Sparse visible_indices   │
+│  ├── Input: [B, C, T, H, W] full video        │
+│  ├── visible_indices: [B, K] selected patches │
+│  └── Output: [B, K, hidden_size]              │
+└───────────────────────────────────────────────┘
+```
+
+**Detailed Codec Processing Pipeline:**
+
+1. **Motion Vector Analysis**: Extract motion vectors from HEVC codec at quarter-pixel precision. Apply camera motion compensation (median, similarity, or affine model) to isolate object motion from camera movement.
+
+2. **Residual Analysis**: Extract prediction residuals that capture texture changes and fine-grained motion not captured by block-based motion compensation.
+
+3. **Energy Fusion**: Combine MV energy and residual energy with configurable weights to produce a unified saliency map.
+
+4. **Top-K Selection**: Rank all patches (across all frames) by their saliency scores and select the top K patches. This achieves 75-98% compression while retaining critical temporal dynamics.
+
+5. **Sparse Processing**: The selected patches are processed by the ViT with proper 3D position encodings, enabling the model to understand the spatiotemporal context of each selected patch.
+
+**Example codec-style inference:**
+
+```python
+# Codec-style: select 2048 most salient patches from 64 frames
+# (equivalent to 8 full frames worth of tokens)
+K_keep = 2048  # 256 patches/frame × 8 frames equivalent
+
+# visible_indices are computed by the codec saliency detection
+# Each index points to a specific (frame, h, w) position in the patch grid
+visible_indices = compute_codec_visible_indices(
+    video_path,
+    K=K_keep,
+    mv_compensate="similarity",  # Camera motion compensation
+    patch_size=16
+)
+
+# Process with the model
+outputs = model(video, visible_indices=visible_indices)
+# Output: [B, 2048, 1024] - features for 2048 selected patches
+```
+
+### Comparison of Input Modes
+
+| Mode | Input Shape | visible_indices | Output Shape | Use Case |
+|------|-------------|-----------------|--------------|----------|
+| **Image** | `[B, 3, H, W]` | All patches | `[B, (H/16)×(W/16), 1024]` | Single image understanding |
+| **Video Chunk** | `[B, 3, T, H, W]` | Frame-mapped | `[B, T×(H/16)×(W/16), 1024]` | Uniform temporal sampling |
+| **Codec-Style** | `[B, 3, T, H, W]` | Top-K salient | `[B, K, 1024]` | Efficient dense temporal |
+
+### 3D RoPE Position Encoding
+
+All three input modes share the same 3D Rotary Position Embedding (RoPE) with a 4:6:6 split:
+
+- **Temporal (T)**: 4/16 of head dimension → captures frame ordering
+- **Height (H)**: 6/16 of head dimension → captures vertical position  
+- **Width (W)**: 6/16 of head dimension → captures horizontal position
+
+```python
+# 3D position encoding construction
+head_dim = hidden_size // num_heads  # 1024 // 16 = 64
+half = head_dim // 2  # 32
+
+# Split dimensions with 4:6:6 ratio (4+6+6 = 16 units total)
+unit = half // 16  # 32 // 16 = 2
+t_size = 4 * unit  # 4 * 2 = 8 dims for temporal
+h_size = 6 * unit  # 6 * 2 = 12 dims for height
+w_size = 6 * unit  # 6 * 2 = 12 dims for width
+# Total: 8 + 12 + 12 = 32 = half of head_dim
+
+# Compute frequencies for each dimension
+freqs = concat([
+    freq_temporal[t_ids],   # Based on frame index
+    freq_height[h_ids],     # Based on patch row
+    freq_width[w_ids]       # Based on patch column
+])
+```
+
+This unified position encoding allows the model to maintain consistent spatial and temporal understanding across all input modalities.
+
 ## Intended Use
 
 ### Primary Use Cases