Skip to content

Commit 39ce74c

Browse files
committed
updated
1 parent 10bdce0 commit 39ce74c

46 files changed

Lines changed: 5 additions & 19160 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 5 additions & 352 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
<h1 align="center">
1010
OneVision-Encoder: HEVC-Style Vision Transformer
1111
</h1>
12-
---
1312

1413
## 📖 Table of Contents
1514

@@ -25,7 +24,7 @@
2524

2625
## 🔍 Introduction
2726

28-
LLaVA-ViT is a vision encoder designed for multimodal large language models, featuring efficient video representation with sparse video input. This project provides training code, data processing tools, and model evaluation utilities.
27+
OneVision Encoder is a vision encoder designed for multimodal large language models, featuring efficient video representation with sparse video input. This project provides training code, data processing tools, and model evaluation utilities.
2928

3029
### Input Method Comparison
3130

@@ -97,8 +96,8 @@ docker tag $(docker images -q | head -n 1) llava_vit:25.11.22
9796

9897
```bash
9998
docker run -it --gpus all --ipc host --net host --privileged \
100-
-v "$(pwd)":/workspace/LLaVA-ViT \
101-
-w /workspace/LLaVA-ViT \
99+
-v "$(pwd)":/workspace/OneVision Encoder \
100+
-w /workspace/OneVision Encoder \
102101
llava_vit:25.11.22 bash
103102
```
104103

@@ -110,10 +109,10 @@ docker run -it --gpus all --ipc host --net host --privileged \
110109
```bash
111110
docker run -it --gpus all --ipc host --net host --privileged --cap-add IPC_LOCK \
112111
--ulimit memlock=-1 --ulimit stack=67108864 --rm \
113-
-v "$(pwd)":/workspace/LLaVA-ViT -v /train_tmp:/train_tmp \
112+
-v "$(pwd)":/workspace/OneVision Encoder -v /train_tmp:/train_tmp \
114113
-v /vlm:/vlm -v /video_vit:/video_vit -v /rice_ocr:/rice_ocr \
115114
-v /data_0:/data_0 -v /data_1:/data_1 -v /data_2:/data_2 -v /data_3:/data_3 \
116-
-w /workspace/LLaVA-ViT/ \
115+
-w /workspace/OneVision Encoder/ \
117116
-e NCCL_TIMEOUT=1800 -e CUDA_DEVICE_MAX_CONNECTIONS=1 -e NCCL_SOCKET_IFNAME=eth0 -e NCCL_IB_GID_INDEX=3 -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA="mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1" -e NCCL_NET_GDR_LEVEL=2 -e NCCL_IB_QPS_PER_CONNECTION=4 -e NCCL_IB_TC=160 -e NCCL_IB_TIMEOUT=22 -e NCCL_CROSS_NIC=1 -e NCCL_MIN_NCHANNELS=8 -e NCCL_MAX_NCHANNELS=16 \
118117
-e http_proxy=http://172.16.5.77:8889 -e https_proxy=http://172.16.5.77:8889 \
119118
llava_vit:25.11.22 bash -c "service ssh restart; bash"
@@ -169,352 +168,6 @@ torchrun --nproc_per_node 8 --master_port 15555 \
169168

170169
---
171170

172-
## 📦 Packing ViT Model
173-
174-
LLaVA-ViT provides a packing model (`LlavaViTPackingModel`) for efficient variable-length sequence processing with FlashAttention support, similar to Qwen2VL's vision encoder.
175-
176-
> **Detailed documentation**: See [`model_factory/README_PACKING.md`](model_factory/README_PACKING.md) for complete usage guide.
177-
178-
### Requirements
179-
180-
```bash
181-
# FlashAttention 2 is required
182-
pip install flash-attn --no-build-isolation
183-
```
184-
185-
### Understanding `patch_positions`
186-
187-
The `patch_positions` parameter allows you to explicitly specify the RoPE (Rotary Position Embedding) positions for each patch. This is essential for:
188-
- Achieving consistent outputs between the source model and the packing model
189-
- Processing videos with non-uniform frame sampling (e.g., uniform sampling from long videos)
190-
- Enabling flexible spatial-temporal position encoding
191-
192-
#### `patch_positions` Format
193-
194-
`patch_positions` is a tensor of shape `(seq_len, 3)` where each row contains `[t, h, w]`:
195-
- `t`: Temporal position (frame index)
196-
- `h`: Height position (patch row index)
197-
- `w`: Width position (patch column index)
198-
199-
### How to Prepare `patch_positions`
200-
201-
#### Method 1: Using `compute_patch_positions_from_grid_thw` (Recommended for Images)
202-
203-
For simple image processing where patches are arranged sequentially:
204-
205-
```python
206-
from model_factory.vit_preview_v0_packing_hf import (
207-
LlavaViTPackingModel,
208-
compute_patch_positions_from_grid_thw,
209-
)
210-
import torch
211-
212-
# For a 224x224 image with patch_size=16
213-
# h_patches = w_patches = 224 // 16 = 14
214-
grid_thw = torch.tensor([[1, 14, 14]], dtype=torch.long, device='cuda') # [t=1, h=14, w=14]
215-
216-
# Compute patch positions automatically
217-
patch_positions = compute_patch_positions_from_grid_thw(grid_thw)
218-
# Shape: (196, 3) for 14*14=196 patches
219-
# Values: [[0, 0, 0], [0, 0, 1], ..., [0, 13, 13]]
220-
# [t, h, w] for each patch
221-
222-
# Forward pass
223-
outputs = model(
224-
hidden_states=hidden_states,
225-
grid_thw=grid_thw,
226-
patch_positions=patch_positions,
227-
)
228-
```
229-
230-
#### Method 2: Using Interpolated Temporal Positions (For Video)
231-
232-
For video with uniform frame sampling (e.g., 8 frames from a 64-frame context):
233-
234-
```python
235-
from model_factory.convert_llava_vit_packing_to_hf import (
236-
interpolate_frame_indices,
237-
compute_patch_positions_with_interpolated_temporal,
238-
)
239-
import torch
240-
241-
# Example: 8 frames uniformly sampled from 64-frame context
242-
num_frames = 8
243-
target_frames = 64 # The source model's expected temporal context
244-
h_patches, w_patches = 14, 14 # For 224x224 image with patch_size=16
245-
246-
# Step 1: Compute interpolated frame indices
247-
frame_indices = torch.arange(num_frames).unsqueeze(0).cuda() # [1, 8] = [[0,1,2,3,4,5,6,7]]
248-
total_frames = torch.tensor([num_frames]).cuda() # [8]
249-
250-
interpolated_indices = interpolate_frame_indices(frame_indices, total_frames, target_frames)
251-
# Result: [[0, 9, 18, 27, 36, 45, 54, 63]] - evenly spaced in 64-frame context
252-
253-
# Step 2: Compute patch positions with interpolated temporal positions
254-
patch_positions = compute_patch_positions_with_interpolated_temporal(
255-
interpolated_indices, h_patches, w_patches, device='cuda'
256-
)
257-
# Shape: (num_frames * h_patches * w_patches, 3) = (8*14*14, 3) = (1568, 3)
258-
# Each row: [t_interpolated, h, w]
259-
# The temporal values are 0, 9, 18, 27, 36, 45, 54, 63 (interpolated to 64-frame context)
260-
261-
# Create grid_thw for actual frames
262-
grid_thw = torch.tensor([[num_frames, h_patches, w_patches]], dtype=torch.long, device='cuda')
263-
264-
# Forward pass
265-
outputs = model(
266-
hidden_states=hidden_states,
267-
grid_thw=grid_thw,
268-
patch_positions=patch_positions,
269-
)
270-
```
271-
272-
#### Method 3: Manual Construction (Advanced)
273-
274-
For custom spatial-temporal positions:
275-
276-
```python
277-
import torch
278-
279-
def manual_patch_positions(t_frames, h_patches, w_patches, device='cuda'):
280-
"""
281-
Manually construct patch_positions tensor.
282-
283-
Patch ordering: [frame_0_patches, frame_1_patches, ..., frame_t_patches]
284-
Within each frame: row-major order (h varies slower than w)
285-
"""
286-
positions = []
287-
for t in range(t_frames):
288-
for h in range(h_patches):
289-
for w in range(w_patches):
290-
positions.append([t, h, w])
291-
return torch.tensor(positions, dtype=torch.long, device=device)
292-
293-
# Example: 8 frames at 14x14 patches
294-
patch_positions = manual_patch_positions(8, 14, 14)
295-
# Shape: (1568, 3)
296-
# Values: [[0,0,0], [0,0,1], ..., [0,13,13], [1,0,0], ..., [7,13,13]]
297-
```
298-
299-
### Complete Example: Image Processing
300-
301-
```python
302-
import torch
303-
from PIL import Image
304-
import torchvision.transforms as T
305-
from model_factory.vit_preview_v0_packing_hf import (
306-
LlavaViTPackingModel,
307-
compute_patch_positions_from_grid_thw,
308-
)
309-
310-
# Load model
311-
model = LlavaViTPackingModel.from_pretrained("path/to/model", torch_dtype=torch.bfloat16)
312-
model = model.cuda().eval()
313-
314-
# Prepare image
315-
patch_size = 16
316-
image = Image.open("image.jpg").resize((448, 448))
317-
transform = T.Compose([
318-
T.ToTensor(),
319-
T.Normalize(mean=[0.48145466, 0.4578275, 0.40821073],
320-
std=[0.26862954, 0.26130258, 0.27577711]),
321-
])
322-
pixel_tensor = transform(image) # (3, 448, 448)
323-
324-
# Calculate patch dimensions
325-
channels, height, width = pixel_tensor.shape
326-
h_patches = height // patch_size # 28
327-
w_patches = width // patch_size # 28
328-
329-
# Reshape to patches: (C, H, W) -> (seq_len, patch_dim)
330-
patches = pixel_tensor.view(channels, h_patches, patch_size, w_patches, patch_size)
331-
patches = patches.permute(1, 3, 0, 2, 4).contiguous() # (h, w, C, pH, pW)
332-
hidden_states = patches.view(h_patches * w_patches, patch_size * patch_size * channels)
333-
hidden_states = hidden_states.cuda().bfloat16()
334-
335-
# Prepare grid_thw and patch_positions
336-
grid_thw = torch.tensor([[1, h_patches, w_patches]], dtype=torch.long, device='cuda')
337-
patch_positions = compute_patch_positions_from_grid_thw(grid_thw)
338-
339-
# Forward pass
340-
with torch.no_grad():
341-
outputs = model(
342-
hidden_states=hidden_states,
343-
grid_thw=grid_thw,
344-
patch_positions=patch_positions,
345-
)
346-
347-
print(f"Output shape: {outputs.last_hidden_state.shape}") # (784, hidden_size)
348-
print(f"Pooler shape: {outputs.pooler_output.shape}") # (1, hidden_size)
349-
```
350-
351-
### Complete Example: Video Processing
352-
353-
```python
354-
import torch
355-
from PIL import Image
356-
import torchvision.transforms as T
357-
from model_factory.vit_preview_v0_packing_hf import LlavaViTPackingModel
358-
from model_factory.convert_llava_vit_packing_to_hf import (
359-
interpolate_frame_indices,
360-
compute_patch_positions_with_interpolated_temporal,
361-
)
362-
363-
# Load model
364-
model = LlavaViTPackingModel.from_pretrained("path/to/model", torch_dtype=torch.bfloat16)
365-
model = model.cuda().eval()
366-
367-
# Video parameters
368-
patch_size = 16
369-
num_frames = 8
370-
frame_size = 224
371-
target_frames = 64 # Source model's temporal context
372-
373-
# Load video frames (example: list of PIL Images)
374-
frames = [Image.open(f"frame_{i}.jpg").resize((frame_size, frame_size)) for i in range(num_frames)]
375-
376-
transform = T.Compose([
377-
T.ToTensor(),
378-
T.Normalize(mean=[0.48145466, 0.4578275, 0.40821073],
379-
std=[0.26862954, 0.26130258, 0.27577711]),
380-
])
381-
382-
# Calculate patch dimensions
383-
h_patches = frame_size // patch_size # 14
384-
w_patches = frame_size // patch_size # 14
385-
386-
# Process frames and reshape to patches
387-
all_patches = []
388-
for frame in frames:
389-
pixel_tensor = transform(frame) # (3, 224, 224)
390-
channels = pixel_tensor.shape[0]
391-
patches = pixel_tensor.view(channels, h_patches, patch_size, w_patches, patch_size)
392-
patches = patches.permute(1, 3, 0, 2, 4).contiguous() # (h, w, C, pH, pW)
393-
frame_patches = patches.view(h_patches * w_patches, patch_size * patch_size * channels)
394-
all_patches.append(frame_patches)
395-
396-
hidden_states = torch.cat(all_patches, dim=0) # (num_frames * h * w, patch_dim)
397-
hidden_states = hidden_states.cuda().bfloat16()
398-
399-
# Compute interpolated temporal positions for video
400-
frame_indices = torch.arange(num_frames).unsqueeze(0).cuda()
401-
total_frames_tensor = torch.tensor([num_frames]).cuda()
402-
interpolated_indices = interpolate_frame_indices(frame_indices, total_frames_tensor, target_frames)
403-
404-
# Compute patch_positions with interpolated temporal values
405-
patch_positions = compute_patch_positions_with_interpolated_temporal(
406-
interpolated_indices, h_patches, w_patches, device='cuda'
407-
)
408-
409-
# grid_thw uses actual frame count
410-
grid_thw = torch.tensor([[num_frames, h_patches, w_patches]], dtype=torch.long, device='cuda')
411-
412-
# Forward pass
413-
with torch.no_grad():
414-
outputs = model(
415-
hidden_states=hidden_states,
416-
grid_thw=grid_thw,
417-
patch_positions=patch_positions,
418-
)
419-
420-
print(f"Output shape: {outputs.last_hidden_state.shape}") # (1568, hidden_size)
421-
print(f"Pooler shape: {outputs.pooler_output.shape}") # (1, hidden_size)
422-
```
423-
424-
### Model Conversion
425-
426-
Convert weights from source model to packing model format:
427-
428-
```bash
429-
python model_factory/convert_llava_vit_packing_to_hf.py \
430-
llava_vit_large_ln \
431-
/path/to/backbone.pt \
432-
--output_dir /path/to/output
433-
```
434-
435-
The conversion script automatically verifies both image and video consistency between the source and packing models.
436-
437-
---
438-
439-
## 👥 Contributors
440-
441-
Thanks so much to all of our amazing contributors!
442-
443-
<!-- readme: collaborators,contributors -start -->
444-
<table>
445-
<tbody>
446-
<tr>
447-
<td align="center">
448-
<a href="https://github.com/GeoffreyChen777">
449-
<img src="https://avatars.githubusercontent.com/u/14183213?v=4" width="80;" alt="GeoffreyChen777"/>
450-
<br />
451-
<sub><b>GeoffreyChen777</b></sub>
452-
</a>
453-
</td>
454-
<td align="center">
455-
<a href="https://github.com/Luodian">
456-
<img src="https://avatars.githubusercontent.com/u/15847405?v=4" width="80;" alt="Luodian"/>
457-
<br />
458-
<sub><b>Luodian</b></sub>
459-
</a>
460-
</td>
461-
<td align="center">
462-
<a href="https://github.com/ZhangYuanhan-AI">
463-
<img src="https://avatars.githubusercontent.com/u/18485270?v=4" width="80;" alt="ZhangYuanhan-AI"/>
464-
<br />
465-
<sub><b>ZhangYuanhan-AI</b></sub>
466-
</a>
467-
</td>
468-
<td align="center">
469-
<a href="https://github.com/anxiangsir">
470-
<img src="https://avatars.githubusercontent.com/u/31175974?v=4" width="80;" alt="anxiangsir"/>
471-
<br />
472-
<sub><b>anxiangsir</b></sub>
473-
</a>
474-
</td>
475-
<td align="center">
476-
<a href="https://github.com/yiyexy">
477-
<img src="https://avatars.githubusercontent.com/u/35927125?v=4" width="80;" alt="yiyexy"/>
478-
<br />
479-
<sub><b>yiyexy</b></sub>
480-
</a>
481-
</td>
482-
<td align="center">
483-
<a href="https://github.com/manyuan97">
484-
<img src="https://avatars.githubusercontent.com/u/70136737?v=4" width="80;" alt="manyuan97"/>
485-
<br />
486-
<sub><b>manyuan97</b></sub>
487-
</a>
488-
</td>
489-
<td align="center">
490-
<a href="https://github.com/YunyaoYan">
491-
<img src="https://avatars.githubusercontent.com/u/109638667?v=4" width="80;" alt="YunyaoYan"/>
492-
<br />
493-
<sub><b>YunyaoYan</b></sub>
494-
</a>
495-
</td>
496-
<td align="center">
497-
<a href="https://github.com/FeilongTangmonash">
498-
<img src="https://avatars.githubusercontent.com/u/152372878?v=4" width="80;" alt="FeilongTangmonash"/>
499-
<br />
500-
<sub><b>FeilongTangmonash</b></sub>
501-
</a>
502-
</td>
503-
</tr>
504-
<tr>
505-
<td align="center">
506-
<a href="https://github.com/wkzhang636">
507-
<img src="https://avatars.githubusercontent.com/u/194186498?v=4" width="80;" alt="wkzhang636"/>
508-
<br />
509-
<sub><b>wkzhang636</b></sub>
510-
</a>
511-
</td>
512-
</tr>
513-
<tbody>
514-
</table>
515-
<!-- readme: collaborators,contributors -end -->
516-
517-
---
518171

519172
## 📄 License
520173

0 commit comments

Comments
 (0)