99<h1 align =" center " >
1010 OneVision-Encoder: HEVC-Style Vision Transformer
1111</h1 >
12- ---
1312
1413## 📖 Table of Contents
1514
2524
2625## 🔍 Introduction
2726
28- LLaVA-ViT is a vision encoder designed for multimodal large language models, featuring efficient video representation with sparse video input. This project provides training code, data processing tools, and model evaluation utilities.
27+ OneVision Encoder is a vision encoder designed for multimodal large language models, featuring efficient video representation with sparse video input. This project provides training code, data processing tools, and model evaluation utilities.
2928
3029### Input Method Comparison
3130
@@ -97,8 +96,8 @@ docker tag $(docker images -q | head -n 1) llava_vit:25.11.22
9796
9897``` bash
9998docker run -it --gpus all --ipc host --net host --privileged \
100- -v " $( pwd) " :/workspace/LLaVA-ViT \
101- -w /workspace/LLaVA-ViT \
99+ -v " $( pwd) " :/workspace/OneVision Encoder \
100+ -w /workspace/OneVision Encoder \
102101 llava_vit:25.11.22 bash
103102```
104103
@@ -110,10 +109,10 @@ docker run -it --gpus all --ipc host --net host --privileged \
110109``` bash
111110docker run -it --gpus all --ipc host --net host --privileged --cap-add IPC_LOCK \
112111 --ulimit memlock=-1 --ulimit stack=67108864 --rm \
113- -v " $( pwd) " :/workspace/LLaVA-ViT -v /train_tmp:/train_tmp \
112+ -v " $( pwd) " :/workspace/OneVision Encoder -v /train_tmp:/train_tmp \
114113 -v /vlm:/vlm -v /video_vit:/video_vit -v /rice_ocr:/rice_ocr \
115114 -v /data_0:/data_0 -v /data_1:/data_1 -v /data_2:/data_2 -v /data_3:/data_3 \
116- -w /workspace/LLaVA-ViT / \
115+ -w /workspace/OneVision Encoder / \
117116 -e NCCL_TIMEOUT=1800 -e CUDA_DEVICE_MAX_CONNECTIONS=1 -e NCCL_SOCKET_IFNAME=eth0 -e NCCL_IB_GID_INDEX=3 -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=" mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1" -e NCCL_NET_GDR_LEVEL=2 -e NCCL_IB_QPS_PER_CONNECTION=4 -e NCCL_IB_TC=160 -e NCCL_IB_TIMEOUT=22 -e NCCL_CROSS_NIC=1 -e NCCL_MIN_NCHANNELS=8 -e NCCL_MAX_NCHANNELS=16 \
118117 -e http_proxy=http://172.16.5.77:8889 -e https_proxy=http://172.16.5.77:8889 \
119118 llava_vit:25.11.22 bash -c " service ssh restart; bash"
@@ -169,352 +168,6 @@ torchrun --nproc_per_node 8 --master_port 15555 \
169168
170169---
171170
172- ## 📦 Packing ViT Model
173-
174- LLaVA-ViT provides a packing model (` LlavaViTPackingModel ` ) for efficient variable-length sequence processing with FlashAttention support, similar to Qwen2VL's vision encoder.
175-
176- > ** Detailed documentation** : See [ ` model_factory/README_PACKING.md ` ] ( model_factory/README_PACKING.md ) for complete usage guide.
177-
178- ### Requirements
179-
180- ``` bash
181- # FlashAttention 2 is required
182- pip install flash-attn --no-build-isolation
183- ```
184-
185- ### Understanding ` patch_positions `
186-
187- The ` patch_positions ` parameter allows you to explicitly specify the RoPE (Rotary Position Embedding) positions for each patch. This is essential for:
188- - Achieving consistent outputs between the source model and the packing model
189- - Processing videos with non-uniform frame sampling (e.g., uniform sampling from long videos)
190- - Enabling flexible spatial-temporal position encoding
191-
192- #### ` patch_positions ` Format
193-
194- ` patch_positions ` is a tensor of shape ` (seq_len, 3) ` where each row contains ` [t, h, w] ` :
195- - ` t ` : Temporal position (frame index)
196- - ` h ` : Height position (patch row index)
197- - ` w ` : Width position (patch column index)
198-
199- ### How to Prepare ` patch_positions `
200-
201- #### Method 1: Using ` compute_patch_positions_from_grid_thw ` (Recommended for Images)
202-
203- For simple image processing where patches are arranged sequentially:
204-
205- ``` python
206- from model_factory.vit_preview_v0_packing_hf import (
207- LlavaViTPackingModel,
208- compute_patch_positions_from_grid_thw,
209- )
210- import torch
211-
212- # For a 224x224 image with patch_size=16
213- # h_patches = w_patches = 224 // 16 = 14
214- grid_thw = torch.tensor([[1 , 14 , 14 ]], dtype = torch.long, device = ' cuda' ) # [t=1, h=14, w=14]
215-
216- # Compute patch positions automatically
217- patch_positions = compute_patch_positions_from_grid_thw(grid_thw)
218- # Shape: (196, 3) for 14*14=196 patches
219- # Values: [[0, 0, 0], [0, 0, 1], ..., [0, 13, 13]]
220- # [t, h, w] for each patch
221-
222- # Forward pass
223- outputs = model(
224- hidden_states = hidden_states,
225- grid_thw = grid_thw,
226- patch_positions = patch_positions,
227- )
228- ```
229-
230- #### Method 2: Using Interpolated Temporal Positions (For Video)
231-
232- For video with uniform frame sampling (e.g., 8 frames from a 64-frame context):
233-
234- ``` python
235- from model_factory.convert_llava_vit_packing_to_hf import (
236- interpolate_frame_indices,
237- compute_patch_positions_with_interpolated_temporal,
238- )
239- import torch
240-
241- # Example: 8 frames uniformly sampled from 64-frame context
242- num_frames = 8
243- target_frames = 64 # The source model's expected temporal context
244- h_patches, w_patches = 14 , 14 # For 224x224 image with patch_size=16
245-
246- # Step 1: Compute interpolated frame indices
247- frame_indices = torch.arange(num_frames).unsqueeze(0 ).cuda() # [1, 8] = [[0,1,2,3,4,5,6,7]]
248- total_frames = torch.tensor([num_frames]).cuda() # [8]
249-
250- interpolated_indices = interpolate_frame_indices(frame_indices, total_frames, target_frames)
251- # Result: [[0, 9, 18, 27, 36, 45, 54, 63]] - evenly spaced in 64-frame context
252-
253- # Step 2: Compute patch positions with interpolated temporal positions
254- patch_positions = compute_patch_positions_with_interpolated_temporal(
255- interpolated_indices, h_patches, w_patches, device = ' cuda'
256- )
257- # Shape: (num_frames * h_patches * w_patches, 3) = (8*14*14, 3) = (1568, 3)
258- # Each row: [t_interpolated, h, w]
259- # The temporal values are 0, 9, 18, 27, 36, 45, 54, 63 (interpolated to 64-frame context)
260-
261- # Create grid_thw for actual frames
262- grid_thw = torch.tensor([[num_frames, h_patches, w_patches]], dtype = torch.long, device = ' cuda' )
263-
264- # Forward pass
265- outputs = model(
266- hidden_states = hidden_states,
267- grid_thw = grid_thw,
268- patch_positions = patch_positions,
269- )
270- ```
271-
272- #### Method 3: Manual Construction (Advanced)
273-
274- For custom spatial-temporal positions:
275-
276- ``` python
277- import torch
278-
279- def manual_patch_positions (t_frames , h_patches , w_patches , device = ' cuda' ):
280- """
281- Manually construct patch_positions tensor.
282-
283- Patch ordering: [frame_0_patches, frame_1_patches, ..., frame_t_patches]
284- Within each frame: row-major order (h varies slower than w)
285- """
286- positions = []
287- for t in range (t_frames):
288- for h in range (h_patches):
289- for w in range (w_patches):
290- positions.append([t, h, w])
291- return torch.tensor(positions, dtype = torch.long, device = device)
292-
293- # Example: 8 frames at 14x14 patches
294- patch_positions = manual_patch_positions(8 , 14 , 14 )
295- # Shape: (1568, 3)
296- # Values: [[0,0,0], [0,0,1], ..., [0,13,13], [1,0,0], ..., [7,13,13]]
297- ```
298-
299- ### Complete Example: Image Processing
300-
301- ``` python
302- import torch
303- from PIL import Image
304- import torchvision.transforms as T
305- from model_factory.vit_preview_v0_packing_hf import (
306- LlavaViTPackingModel,
307- compute_patch_positions_from_grid_thw,
308- )
309-
310- # Load model
311- model = LlavaViTPackingModel.from_pretrained(" path/to/model" , torch_dtype = torch.bfloat16)
312- model = model.cuda().eval()
313-
314- # Prepare image
315- patch_size = 16
316- image = Image.open(" image.jpg" ).resize((448 , 448 ))
317- transform = T.Compose([
318- T.ToTensor(),
319- T.Normalize(mean = [0.48145466 , 0.4578275 , 0.40821073 ],
320- std = [0.26862954 , 0.26130258 , 0.27577711 ]),
321- ])
322- pixel_tensor = transform(image) # (3, 448, 448)
323-
324- # Calculate patch dimensions
325- channels, height, width = pixel_tensor.shape
326- h_patches = height // patch_size # 28
327- w_patches = width // patch_size # 28
328-
329- # Reshape to patches: (C, H, W) -> (seq_len, patch_dim)
330- patches = pixel_tensor.view(channels, h_patches, patch_size, w_patches, patch_size)
331- patches = patches.permute(1 , 3 , 0 , 2 , 4 ).contiguous() # (h, w, C, pH, pW)
332- hidden_states = patches.view(h_patches * w_patches, patch_size * patch_size * channels)
333- hidden_states = hidden_states.cuda().bfloat16()
334-
335- # Prepare grid_thw and patch_positions
336- grid_thw = torch.tensor([[1 , h_patches, w_patches]], dtype = torch.long, device = ' cuda' )
337- patch_positions = compute_patch_positions_from_grid_thw(grid_thw)
338-
339- # Forward pass
340- with torch.no_grad():
341- outputs = model(
342- hidden_states = hidden_states,
343- grid_thw = grid_thw,
344- patch_positions = patch_positions,
345- )
346-
347- print (f " Output shape: { outputs.last_hidden_state.shape} " ) # (784, hidden_size)
348- print (f " Pooler shape: { outputs.pooler_output.shape} " ) # (1, hidden_size)
349- ```
350-
351- ### Complete Example: Video Processing
352-
353- ``` python
354- import torch
355- from PIL import Image
356- import torchvision.transforms as T
357- from model_factory.vit_preview_v0_packing_hf import LlavaViTPackingModel
358- from model_factory.convert_llava_vit_packing_to_hf import (
359- interpolate_frame_indices,
360- compute_patch_positions_with_interpolated_temporal,
361- )
362-
363- # Load model
364- model = LlavaViTPackingModel.from_pretrained(" path/to/model" , torch_dtype = torch.bfloat16)
365- model = model.cuda().eval()
366-
367- # Video parameters
368- patch_size = 16
369- num_frames = 8
370- frame_size = 224
371- target_frames = 64 # Source model's temporal context
372-
373- # Load video frames (example: list of PIL Images)
374- frames = [Image.open(f " frame_ { i} .jpg " ).resize((frame_size, frame_size)) for i in range (num_frames)]
375-
376- transform = T.Compose([
377- T.ToTensor(),
378- T.Normalize(mean = [0.48145466 , 0.4578275 , 0.40821073 ],
379- std = [0.26862954 , 0.26130258 , 0.27577711 ]),
380- ])
381-
382- # Calculate patch dimensions
383- h_patches = frame_size // patch_size # 14
384- w_patches = frame_size // patch_size # 14
385-
386- # Process frames and reshape to patches
387- all_patches = []
388- for frame in frames:
389- pixel_tensor = transform(frame) # (3, 224, 224)
390- channels = pixel_tensor.shape[0 ]
391- patches = pixel_tensor.view(channels, h_patches, patch_size, w_patches, patch_size)
392- patches = patches.permute(1 , 3 , 0 , 2 , 4 ).contiguous() # (h, w, C, pH, pW)
393- frame_patches = patches.view(h_patches * w_patches, patch_size * patch_size * channels)
394- all_patches.append(frame_patches)
395-
396- hidden_states = torch.cat(all_patches, dim = 0 ) # (num_frames * h * w, patch_dim)
397- hidden_states = hidden_states.cuda().bfloat16()
398-
399- # Compute interpolated temporal positions for video
400- frame_indices = torch.arange(num_frames).unsqueeze(0 ).cuda()
401- total_frames_tensor = torch.tensor([num_frames]).cuda()
402- interpolated_indices = interpolate_frame_indices(frame_indices, total_frames_tensor, target_frames)
403-
404- # Compute patch_positions with interpolated temporal values
405- patch_positions = compute_patch_positions_with_interpolated_temporal(
406- interpolated_indices, h_patches, w_patches, device = ' cuda'
407- )
408-
409- # grid_thw uses actual frame count
410- grid_thw = torch.tensor([[num_frames, h_patches, w_patches]], dtype = torch.long, device = ' cuda' )
411-
412- # Forward pass
413- with torch.no_grad():
414- outputs = model(
415- hidden_states = hidden_states,
416- grid_thw = grid_thw,
417- patch_positions = patch_positions,
418- )
419-
420- print (f " Output shape: { outputs.last_hidden_state.shape} " ) # (1568, hidden_size)
421- print (f " Pooler shape: { outputs.pooler_output.shape} " ) # (1, hidden_size)
422- ```
423-
424- ### Model Conversion
425-
426- Convert weights from source model to packing model format:
427-
428- ``` bash
429- python model_factory/convert_llava_vit_packing_to_hf.py \
430- llava_vit_large_ln \
431- /path/to/backbone.pt \
432- --output_dir /path/to/output
433- ```
434-
435- The conversion script automatically verifies both image and video consistency between the source and packing models.
436-
437- ---
438-
439- ## 👥 Contributors
440-
441- Thanks so much to all of our amazing contributors!
442-
443- <!-- readme: collaborators,contributors -start -->
444- <table >
445- <tbody>
446- <tr>
447- <td align="center">
448- <a href="https://github.com/GeoffreyChen777">
449- <img src="https://avatars.githubusercontent.com/u/14183213?v=4" width="80;" alt="GeoffreyChen777"/>
450- <br />
451- <sub><b>GeoffreyChen777</b></sub>
452- </a>
453- </td>
454- <td align="center">
455- <a href="https://github.com/Luodian">
456- <img src="https://avatars.githubusercontent.com/u/15847405?v=4" width="80;" alt="Luodian"/>
457- <br />
458- <sub><b>Luodian</b></sub>
459- </a>
460- </td>
461- <td align="center">
462- <a href="https://github.com/ZhangYuanhan-AI">
463- <img src="https://avatars.githubusercontent.com/u/18485270?v=4" width="80;" alt="ZhangYuanhan-AI"/>
464- <br />
465- <sub><b>ZhangYuanhan-AI</b></sub>
466- </a>
467- </td>
468- <td align="center">
469- <a href="https://github.com/anxiangsir">
470- <img src="https://avatars.githubusercontent.com/u/31175974?v=4" width="80;" alt="anxiangsir"/>
471- <br />
472- <sub><b>anxiangsir</b></sub>
473- </a>
474- </td>
475- <td align="center">
476- <a href="https://github.com/yiyexy">
477- <img src="https://avatars.githubusercontent.com/u/35927125?v=4" width="80;" alt="yiyexy"/>
478- <br />
479- <sub><b>yiyexy</b></sub>
480- </a>
481- </td>
482- <td align="center">
483- <a href="https://github.com/manyuan97">
484- <img src="https://avatars.githubusercontent.com/u/70136737?v=4" width="80;" alt="manyuan97"/>
485- <br />
486- <sub><b>manyuan97</b></sub>
487- </a>
488- </td>
489- <td align="center">
490- <a href="https://github.com/YunyaoYan">
491- <img src="https://avatars.githubusercontent.com/u/109638667?v=4" width="80;" alt="YunyaoYan"/>
492- <br />
493- <sub><b>YunyaoYan</b></sub>
494- </a>
495- </td>
496- <td align="center">
497- <a href="https://github.com/FeilongTangmonash">
498- <img src="https://avatars.githubusercontent.com/u/152372878?v=4" width="80;" alt="FeilongTangmonash"/>
499- <br />
500- <sub><b>FeilongTangmonash</b></sub>
501- </a>
502- </td>
503- </tr>
504- <tr>
505- <td align="center">
506- <a href="https://github.com/wkzhang636">
507- <img src="https://avatars.githubusercontent.com/u/194186498?v=4" width="80;" alt="wkzhang636"/>
508- <br />
509- <sub><b>wkzhang636</b></sub>
510- </a>
511- </td>
512- </tr>
513- <tbody>
514- </table >
515- <!-- readme: collaborators,contributors -end -->
516-
517- ---
518171
519172## 📄 License
520173
0 commit comments