EvolvingLMMs-Lab
diff --git a/‎README.md‎
Lines changed: 24 additions & 59 deletions b/‎README.md‎
Lines changed: 24 additions & 59 deletions
diff --git a/‎dataloader/ap_dataloader_dali.py‎
Lines changed: 33 additions & 25 deletions b/‎dataloader/ap_dataloader_dali.py‎
Lines changed: 33 additions & 25 deletions
diff --git a/‎dataloader/data_v2_ocr.py‎
Lines changed: 0 additions & 24 deletions b/‎dataloader/data_v2_ocr.py‎
Lines changed: 0 additions & 24 deletions
diff --git a/‎dataset/dataset_onevision_encoder.py‎
Lines changed: 2 additions & 3 deletions b/‎dataset/dataset_onevision_encoder.py‎
Lines changed: 2 additions & 3 deletions
@@ -1,6 +1,3 @@
-<!-- <p align="center">
-  <img alt="OneVision Encoder" src="asset/onevision_encoder.png" width="1200" style="max-width: 100%;">
-</p> -->
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="asset/logo_dark.png">
   <source media="(prefers-color-scheme: light)" srcset="asset/logo_light.png">
@@ -21,7 +18,6 @@
 
 </div>
 
-
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/method_github_dark.png">
@@ -52,7 +48,6 @@ We introduce OneVision Encoder, a vision transformer that resolves this trade-of
 
 Coupled with global contrastive learning over a 2M-scale concept memory bank, OneVision Encoder achieves state-of-the-art performance across major video benchmarks (MVBench, VideoMME, Perception Test), while also delivering strong results on image understanding tasks (DocVQA, ChartQA, and OCRBench).
 
-
 ### Key Features
 
 - **Unified Vision Foundation**: A single base model for consistent understanding of images, videos, and OCR.
@@ -98,7 +93,6 @@ The visualization below illustrates four different video processing pipelines.
 
 Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained by batch size, as negative samples are drawn only from the current batch, typically limited to 32K–64K examples. This restriction yields a narrow and incomplete view of the embedding space, often resulting in suboptimal representation learning. In contrast, our approach maintains a global concept bank comprising 2M clustered centers, allowing each training sample to contrast against a diverse and representative set of negatives independent of batch composition. This global contrasting mechanism leads to more discriminative embeddings and well-separated semantic clusters.
 
-
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_dark.gif">
@@ -107,10 +101,8 @@ Standard contrastive learning methods (e.g., CLIP) are fundamentally constrained
   </picture>
 </p>
 
-
 ---
 
-
 ### LMM Probe Results
 
 We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT, proceeding directly to Stage-2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames that match the model’s native resolution are fed directly into the network without tiling or cropping, allowing us to fully evaluate the ViT’s native-resolution modeling capability.
@@ -123,25 +115,22 @@ We train the model on a mixed dataset comprising 740K samples from LLaVA-OneVisi
   </picture>
 </p>
 
-
-
-
-
 ## ⚡ Quick Start
 
 > [!IMPORTANT]
 > **Transformers Version Compatibility:**
-> - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()` 
+>
+> - ✅ **`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()`
 > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
 
-
 > **Note:** This model supports native resolution input. For optimal performance:
+>
 > - **Image**: 448×448 resolution (pre-trained)
 > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
 >
 > Use CLIP preprocessing from the [model repository](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).
 
-### Using AutoModel (Recommended: transformers==4.53.1)
+### Using AutoModel (Recommended: transformers==4.57.3)
 
 ```python
 from transformers import AutoModel, AutoImageProcessor
@@ -169,31 +158,32 @@ with torch.no_grad():
     # outputs.pooler_output: [B, hidden_size]
 
 # Video inference: [B, C, T, H, W] with patch_positions
-import math
-num_frames, frame_tokens, target_frames = 16, 256, 64
-patches_per_side = int(math.sqrt(frame_tokens))  # 16 for 256 tokens
+num_frames, target_frames = 16, 64
+patch_size = 14
 # Load video frames and preprocess each frame (replace with your video frame paths)
 frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
 video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
 # Reshape from [T, C, H, W] to [B, C, T, H, W]
 video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
 
 # Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
-# Each position is (t, h, w) where t is temporal index, h/w are spatial patch coordinates
-frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()  # [num_frames]
-per = torch.arange(frame_tokens).cuda()  # [frame_tokens]
-
-# Temporal positions: frame index for each patch
-t_positions = frame_pos.unsqueeze(-1).expand(-1, frame_tokens).reshape(1, -1)  # [1, num_frames * frame_tokens]
-# Spatial positions: h and w within each frame's patch grid
-h_positions = (per // patches_per_side).unsqueeze(0).expand(num_frames, -1).reshape(1, -1)
-w_positions = (per % patches_per_side).unsqueeze(0).expand(num_frames, -1).reshape(1, -1)
-# Stack to create patch_positions: [B, num_frames * frame_tokens, 3]
-patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1)
-# patch_positions example (with 256 tokens per frame, 16x16 patch grid):
-#   patch_positions[0, 0:4, :]   -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]  # Frame 0 (t=0), first 4 patches
-#   patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]  # Frame 1 (t=4, since 16 frames map to 64 positions)
-#   Each [t, h, w] represents: t=temporal frame index (0-63), h=row in patch grid, w=column in patch grid
+frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()  # [T]
+grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size  # patch grid
+frame_tokens = grid_h * grid_w
+
+t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1)  # [T * frame_tokens]
+h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
+h_positions = h_positions.repeat(num_frames)  # [T * frame_tokens]
+w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
+w_positions = w_positions.repeat(num_frames)  # [T * frame_tokens]
+
+patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
+# patch_positions example (256 tokens per frame, 16x16 patch grid):
+#   Each row is [t, h, w].
+#   First 4 patches of frame 0 (t=0):
+#     patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
+#   First 4 patches of frame 1 (t=4):
+#     patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
 
 with torch.no_grad():
     outputs = model(video, patch_positions=patch_positions)
@@ -242,7 +232,6 @@ pip install -r requirements.txt
 
 ### Option 2 (Docker)
 
-
 ```bash
 docker build -t onevision-encoder:2601 .
 
@@ -252,7 +241,6 @@ docker run -it --rm --gpus all --ipc host --net host --privileged \
     onevision-encoder:2601 bash
 ```
 
-
 ### Install Package
 
 Inside the container, install the package in editable mode:
@@ -283,21 +271,18 @@ git clone https://huggingface.co/lmms-lab-encoder/onevision-encoder-large-si
 
 Download the pretraining data and prepare the data directory as per the instructions in `data/README.md`.
 
-
 More documentation will be added soon.
 
 ```bash
 bash shells/ov_encoder_large_stage2_residual_8gpus.sh
 ```
 
-
 Training configurations and hyperparameters will be documented soon. For now, please refer to `--help` for available options.
 
 ## 📊 Evaluation
 
 ### Attentive Probe Evaluation
 
-
 #### Chunk-wise Sampling Evaluation
 
 To evaluate the encoder with uniform frame sampling, first navigate to the evaluation directory:
@@ -314,6 +299,7 @@ bash shells_eval_ap/eval_ov_encoder_large_16frames.sh
 ```
 
 **Sampling-Specific Parameters:**
+
 - `frames_token_num`: Number of tokens per frame (e.g., 256 tokens for standard sampling).
 
 #### OV-Encoder Codec Evaluation
@@ -330,27 +316,6 @@ Then run the following command:
 bash shells_eval_ap/eval_ov_encoder_large_2kpatches_codec.sh
 ```
 
-**Codec-Specific Parameters:**
-- `K_keep`: Number of patches to keep.
-- `cache_dir` (optional): Directory for cached codec patches. Use this to specify where codec-selected patches are stored/loaded when you want to persist or reuse them.
-
-#### Shared Parameters
-
-The following parameters are common to both evaluation methods:
-
-- `dataset`: Dataset to evaluate on (e.g., `diving48`, `ssv2`, `kinetics400`). Prepare the dataset according to the Attentive Probe format.
-- `num_frames`: Total number of frames in the video sequence (e.g., 8 for sampling, 64 for codec).
-- `model_weight`: Path to the pre-trained model. Use `lmms-lab-encoder/onevision-encoder-large` to load directly from HuggingFace, or provide a local path.
-- `model_name`: Model architecture name (e.g., `hf_llava_vit_large_ln`).
-- `embedding_size`: Size of the embedding dimension (e.g., 1024).
-- `batch_size`: Training batch size (varies by evaluation type).
-- `default_lr_list`: Learning rate for the probe training.
-- `default_weight_decay`: Weight decay for optimization.
-- `eval_freq`: Evaluation frequency during training.
-- `dali_py_num_workers`: Number of DALI data loading workers.
-- `data_root`: Root directory containing your prepared dataset (codec evaluation only).
-
-
 ## 👥 Contributors
 
 <!-- Add contributor list here -->
 
@@ -1,8 +1,3 @@
-#
-# Created by anxiangsir
-# Date: 2025-11-13 12:26:36 (UTC)
-#
-
 import os
 import warnings
 from typing import Any, Dict, List, Tuple
@@ -13,14 +8,16 @@
 import nvidia.dali.types as types
 from nvidia.dali.pipeline import pipeline_def
 from nvidia.dali.plugin.pytorch import DALIGenericIterator
+
+
 try:
     import cv2
+
     _HAS_CV2 = True
 except ImportError:
     _HAS_CV2 = False
 
 
-
 # ----------------------------------------------------------------------------
 # 1. DALI Iterator Wrapper (modified - returns indices, total_frames and file_name)
 # ----------------------------------------------------------------------------
@@ -47,13 +44,14 @@ def __len__(self) -> int:
     def reset(self):
         self.iter.reset()
 
+
 # ----------------------------------------------------------------------------
 # 2. DALI External Source for Video Data (modified - returns indices, total_frames and file_name)
 # ----------------------------------------------------------------------------
 class VideoExternalSource:
     def __init__(self, mode: str, source_params: Dict[str, Any]):
         self.mode = mode
-        self.file_list: List[Tuple[str, int]]   = source_params["file_list"]
+        self.file_list: List[Tuple[str, int]] = source_params["file_list"]
         self.num_shards: int = source_params["num_shards"]
         self.shard_id: int = source_params["shard_id"]
         self.batch_size: int = source_params["batch_size"]
@@ -73,7 +71,6 @@ def __init__(self, mode: str, source_params: Dict[str, Any]):
         self.fallback_example = self.file_list[0] if self.file_list else ("", 0)
 
     def _get_frame_indices(self, num_frames: int) -> List[int]:
-
         if num_frames < self.sequence_length:
             indices = list(range(num_frames))
             indices += [num_frames - 1] * (self.sequence_length - num_frames)
@@ -110,7 +107,8 @@ def __call__(self, sample_info) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.
         except Exception as e:
             warnings.warn(f"Failed to load video: {video_path}, error: {e}. Using fallback.")
             fallback_path, _ = self.fallback_example
-            if not fallback_path: raise IOError(f"Fallback video path is empty!")
+            if not fallback_path:
+                raise IOError(f"Fallback video path is empty!")
             video_data, frame_indices, total_frames = self._load_video_data(fallback_path)
 
         return video_data, np.int64([int(video_label)]), frame_indices, np.int64([total_frames])
@@ -198,7 +196,7 @@ def dali_video_pipeline(mode: str, source_params: Dict[str, Any]):
         batch=False,
         parallel=True,
         dtype=[types.UINT8, types.INT64, types.INT64, types.INT64],
-        layout=["FHWC", "C", "C", "C"]  # Empty layout for file_name byte array (variable length)
+        layout=["FHWC", "C", "C", "C"],  # Empty layout for file_name byte array (variable length)
     )
 
     videos = videos.gpu()
@@ -209,6 +207,7 @@ def dali_video_pipeline(mode: str, source_params: Dict[str, Any]):
     videos = preprocess_videos(videos, mode, input_size, mean, std)
     return videos, labels, indices, total_frames
 
+
 # ----------------------------------------------------------------------------
 # 4. Main Dataloader Function (modified - output_map adds indices, total_frames and file_name)
 # ----------------------------------------------------------------------------
@@ -229,8 +228,7 @@ def get_dali_dataloader(
     seed: int = 0,
     feature_extract: bool = True,
 ) -> DALIWarper:
-    """
-    """
+    """ """
     print(f"[{mode} loader] Reading for: {data_csv_path}")
     file_list = []
     try:
@@ -249,26 +247,36 @@ def get_dali_dataloader(
     world_size = int(os.getenv("WORLD_SIZE", "1"))
 
     source_params = {
-        "num_shards": world_size, "shard_id": rank, "file_list": file_list,
-        "batch_size": batch_size, "sequence_length": sequence_length, "seed": seed + rank,
-        "use_rgb": use_rgb, "input_size": input_size, "short_side_size": short_side_size,
-        "mean": mean, "std": std,
-        "decord_num_threads": decord_num_threads, "feature_extract": feature_extract
+        "num_shards": world_size,
+        "shard_id": rank,
+        "file_list": file_list,
+        "batch_size": batch_size,
+        "sequence_length": sequence_length,
+        "seed": seed + rank,
+        "use_rgb": use_rgb,
+        "input_size": input_size,
+        "short_side_size": short_side_size,
+        "mean": mean,
+        "std": std,
+        "decord_num_threads": decord_num_threads,
+        "feature_extract": feature_extract,
     }
 
     pipe = dali_video_pipeline(
-        batch_size=batch_size, num_threads=dali_num_threads, device_id=local_rank,
-        seed=seed + rank, py_num_workers=dali_py_num_workers, py_start_method="forkserver",
-        prefetch_queue_depth=2, mode=mode, source_params=source_params,
+        batch_size=batch_size,
+        num_threads=dali_num_threads,
+        device_id=local_rank,
+        seed=seed + rank,
+        py_num_workers=dali_py_num_workers,
+        py_start_method="forkserver",
+        prefetch_queue_depth=2,
+        mode=mode,
+        source_params=source_params,
     )
     pipe.build()
 
     # ===> output_map adds "indices", "total_frames" and "file_name" <===
-    dali_iter = DALIGenericIterator(
-        pipelines=[pipe],
-        output_map=["videos", "labels", "indices", "total_frames"],
-        auto_reset=True
-    )
+    dali_iter = DALIGenericIterator(pipelines=[pipe], output_map=["videos", "labels", "indices", "total_frames"], auto_reset=True)
     steps_per_epoch = len(file_list) // world_size // batch_size
     dataloader = DALIWarper(dali_iter=dali_iter, steps_per_epoch=steps_per_epoch)
 
 
@@ -204,27 +204,3 @@ def dali_dataloader(
         label_select=label_select,
     )
     return dataloader
-
-
-if __name__ == "__main__":
-    import cv2
-    import numpy as np
-
-    loader = dali_dataloader(
-        "/data_4/coyo_ocr_v0/train_00", 4, [336, 336], workers=4, is_training=True, mean=[0, 0, 0], std=[1, 1, 1], label_select=None, seed=1437, num_shards=None, shard_id=None, max_side=336
-    )
-
-    image_list = []
-    big_img = np.zeros((3360, 3360, 3), dtype=np.uint8)
-    for image, label in loader:
-        print(label)
-        image = image[0].permute(1, 2, 0).cpu().numpy()
-        image_list.append(image)
-        if len(image_list) == 100:
-            break
-
-    for i in range(100):
-        row = i // 10
-        col = i % 10
-        big_img[row * 336 : (row + 1) * 336, col * 336 : (col + 1) * 336] = image_list[i]
-    cv2.imwrite("output_big_image.jpg", big_img[:, :, ::-1])
@@ -142,8 +142,8 @@ def onevision_encoder_si_cfs_single_node():
     WARNING: This is NOT a recommended approach as it can cause severe data imbalance.
     """
     patterns = [
-        "/datasets_ov_encoder/coyo400m/*.rec",
-        "/datasets_ov_encoder/laion260m/*.rec",
+        "datasets_ov_encoder/coyo400m/*.rec",
+        "datasets_ov_encoder/laion260m/*.rec",
     ]
 
     all_files = [f for pattern in patterns for f in glob.glob(pattern)]
@@ -277,7 +277,6 @@ def onevision_encoder_video_codec():
     """
     assert world_size <= 128
     list_mp4_label_path = f"train_how_to_100m_panda70m_k710_square_with_index_filtered_split_128/part_{rank:03d}"
-
     return Property(
         name="onevision_encoder_video_codec",
         prefixes=[list_mp4_label_path],