Skip to content

Commit ca3dbe6

Browse files
committed
[restructure] lots of updates
1 parent b4fbc1a commit ca3dbe6

76 files changed

Lines changed: 551 additions & 4330 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 24 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -42,19 +42,6 @@ Combined with global contrastive learning using a 2M concept bank, OneVision Enc
4242
</picture>
4343
</p>
4444

45-
### Cluster Discrimination Visualization
46-
47-
Standard contrastive learning (e.g., CLIP) is limited by batch size—negative samples are drawn only from the current batch, typically 32K-64K examples. This creates a narrow view of the embedding space and leads to suboptimal representations. Our approach maintains a global concept bank of 2M clustered centers, enabling each training sample to contrast against a diverse, representative set of negatives regardless of batch composition. This produces more discriminative embeddings with better-separated semantic clusters.
48-
49-
50-
<p align="center">
51-
<picture>
52-
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_dark.gif">
53-
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_light.gif">
54-
<img alt="Training Loss Visualization" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_light.gif" width="800" style="max-width: 100%;">
55-
</picture>
56-
</p>
57-
5845
### Video Processing Pipeline
5946

6047
The visualization below demonstrates our complete video processing pipeline. The animation shows four key stages: (1) Original Video - a continuous 64-frame stream capturing the full temporal context, (2) Uniform Frame Sampling - traditional approach selecting 4-8 evenly-spaced frames, which is simple but lossy and misses inter-frame motion, (3) Temporal Saliency Detection - analysis of all 64 frames to identify regions with high temporal information such as motion, appearance changes, and semantic events, and (4) Codec-Style Patch Extraction - extraction of only the salient patches in zigzag order, achieving 75-98% compression while preserving temporal dynamics.
@@ -72,6 +59,19 @@ The visualization below demonstrates our complete video processing pipeline. The
7259
</tr>
7360
</table>
7461

62+
### Cluster Discrimination Visualization
63+
64+
Standard contrastive learning (e.g., CLIP) is limited by batch size—negative samples are drawn only from the current batch, typically 32K-64K examples. This creates a narrow view of the embedding space and leads to suboptimal representations. Our approach maintains a global concept bank of 2M clustered centers, enabling each training sample to contrast against a diverse, representative set of negatives regardless of batch composition. This produces more discriminative embeddings with better-separated semantic clusters.
65+
66+
67+
<p align="center">
68+
<picture>
69+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_dark.gif">
70+
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_light.gif">
71+
<img alt="Training Loss Visualization" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/loss_github_light.gif" width="800" style="max-width: 100%;">
72+
</picture>
73+
</p>
74+
7575
### Pre-training Tips
7676

7777
1. **Scale-up is the final step** - Maximize model capabilities before scaling, and ensure generalization phenomena emerge
@@ -149,23 +149,20 @@ docker run -it --gpus all --ipc host --net host --privileged \
149149
docker run -it --gpus all --ipc host --net host --privileged --cap-add IPC_LOCK \
150150
--ulimit memlock=-1 --ulimit stack=67108864 --rm \
151151
-v "$(pwd)":/workspace/OneVision-Encoder \
152-
-v /train_tmp:/train_tmp \
153-
-v /vlm:/vlm -v /video_vit:/video_vit -v /rice_ocr:/rice_ocr \
154-
-v /data_0:/data_0 -v /data_1:/data_1 -v /data_2:/data_2 -v /data_3:/data_3 \
155152
-w /workspace/OneVision-Encoder \
156153
-e NCCL_TIMEOUT=1800 \
157154
-e CUDA_DEVICE_MAX_CONNECTIONS=1 \
158-
-e NCCL_SOCKET_IFNAME=eth0 \
159-
-e NCCL_IB_GID_INDEX=3 \
160-
-e NCCL_IB_DISABLE=0 \
161-
-e NCCL_IB_HCA="mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1" \
162-
-e NCCL_NET_GDR_LEVEL=2 \
163-
-e NCCL_IB_QPS_PER_CONNECTION=4 \
164-
-e NCCL_IB_TC=160 \
165-
-e NCCL_IB_TIMEOUT=22 \
166-
-e NCCL_CROSS_NIC=1 \
167-
-e NCCL_MIN_NCHANNELS=8 \
168-
-e NCCL_MAX_NCHANNELS=16 \
155+
-e NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-eth0} \
156+
-e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX:-3} \
157+
-e NCCL_IB_DISABLE=${NCCL_IB_DISABLE:-0} \
158+
-e NCCL_IB_HCA="${NCCL_IB_HCA:-mlx5_0}" \
159+
-e NCCL_NET_GDR_LEVEL=${NCCL_NET_GDR_LEVEL:-2} \
160+
-e NCCL_IB_QPS_PER_CONNECTION=${NCCL_IB_QPS_PER_CONNECTION:-4} \
161+
-e NCCL_IB_TC=${NCCL_IB_TC:-160} \
162+
-e NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-22} \
163+
-e NCCL_CROSS_NIC=${NCCL_CROSS_NIC:-1} \
164+
-e NCCL_MIN_NCHANNELS=${NCCL_MIN_NCHANNELS:-8} \
165+
-e NCCL_MAX_NCHANNELS=${NCCL_MAX_NCHANNELS:-16} \
169166
llava_vit:25.11.22 bash -c "service ssh restart; bash"
170167
```
171168

dataloader/data_decord_video.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -458,7 +458,7 @@ def _save_image(img_hwc_uint8, path):
458458
def main():
459459
parser = argparse.ArgumentParser(description="Quick DALI dataloader visual check")
460460
group = parser.add_mutually_exclusive_group(required=False)
461-
group.add_argument("--file-list", type=str, default="/video_vit/train_UniViT/mp4_list.txt")
461+
group.add_argument("--file-list", type=str, default="${DATA_DIR}")
462462
parser.add_argument("--outdir", type=str, default="./dali_debug_out", help="Output dir to save frames")
463463
parser.add_argument("--batch-size", type=int, default=1)
464464
parser.add_argument("--sequence-length", type=int, default=8)
@@ -479,7 +479,7 @@ def main():
479479
if not file_list:
480480
raise SystemExit("file_list is empty")
481481

482-
labels = np.load("/video_vit/train_UniViT/list_merged.npy")
482+
labels = np.load("/video_vit/list_merged.npy")
483483

484484
# Important: set num_shards=1, shard_id=0 to avoid None handling issues in source_params
485485
dataloader = dali_dataloader(

dataloader/data_decord_video_fix_ip_fix_size.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,9 +127,9 @@ def sparse_sampling_get_frameid_data(
127127
return video_data
128128

129129
def get_label_and_visible_indices(self, video_path):
130-
# label: /video_vit/dataset/clips_square_aug_k710_ssv2/6/3/rank_068_sample_0000145663_label.npy
131-
# video: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2/6/3/rank_068_sample_0000145663.mp4
132-
# residual: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2_residual/6/3/rank_068_sample_0000145663.visidx.npy
130+
# /path/to/data/...
131+
# /path/to/data/...
132+
# /path/to/data/...
133133

134134
label_path = video_path.replace("clips_square_aug_k710_ssv2_hevc_v2", "clips_square_aug_k710_ssv2")
135135
label_path = label_path.replace(".mp4", "_label.npy")
@@ -161,9 +161,9 @@ def __call__(self, sample_info):
161161

162162
video_path = self.file_list[sample_idx]
163163

164-
# label: /video_vit/dataset/clips_square_aug_k710_ssv2/6/3/rank_068_sample_0000145663_label.npy
165-
# video: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2/6/3/rank_068_sample_0000145663.mp4
166-
# residual: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2_residual/6/3/rank_068_sample_0000145663.visidx.npy
164+
# /path/to/data/...
165+
# /path/to/data/...
166+
# /path/to/data/...
167167
test_info = None
168168
# print(video_path)
169169
try:

dataloader/data_decord_video_fix_ip_fix_size_residual_mv.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -104,9 +104,9 @@ def sparse_sampling_get_frameid_data(
104104
return video_data
105105

106106
def get_label_and_visible_indices(self, video_path):
107-
# label: /video_vit/dataset/clips_square_aug_k710_ssv2/6/3/rank_068_sample_0000145663_label.npy
108-
# video: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2/6/3/rank_068_sample_0000145663.mp4
109-
# residual: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2_residual/6/3/rank_068_sample_0000145663.visidx.npy
107+
# /path/to/data/...
108+
# /path/to/data/...
109+
# /path/to/data/...
110110

111111
label_path = video_path.replace("clips_square_aug_k710_ssv2_hevc_v2", "clips_square_aug_k710_ssv2")
112112
label_path = label_path.replace(".mp4", "_label.npy")
@@ -138,9 +138,9 @@ def __call__(self, sample_info):
138138

139139
video_path = self.file_list[sample_idx]
140140

141-
# label: /video_vit/dataset/clips_square_aug_k710_ssv2/6/3/rank_068_sample_0000145663_label.npy
142-
# video: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2/6/3/rank_068_sample_0000145663.mp4
143-
# residual: /video_vit/dataset/clips_square_aug_k710_ssv2_hevc_v2_residual/6/3/rank_068_sample_0000145663.visidx.npy
141+
# /path/to/data/...
142+
# /path/to/data/...
143+
# /path/to/data/...
144144
test_info = None
145145
# print(video_path)
146146

0 commit comments

Comments
 (0)