feat: update Docker setup and dependencies for OneVision Encoder

anxiangsir · anxiangsir · commit 1caf68dbb033 · 2025-12-29T21:00:15.000+08:00
- Rename Docker image to `onevision-encoder:2601`
- Add `--rm` flag to docker run command for cleaner cleanup
- Refactor Dockerfile to improve layer caching and reduce size:
  - Install system deps and static ffmpeg binary
  - Use requirements.txt for Python dependencies
  - Set environment variables for better container behavior
- Add `torchmetrics` to requirements.txt
- Minor formatting fixes in README.md
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@
 
 ## 🔍 Introduction
 
-Video understanding models face a fundamental trade-off: incorporating more frames enables richer temporal reasoning but increases computational cost quadratically. 
+Video understanding models face a fundamental trade-off: incorporating more frames enables richer temporal reasoning but increases computational cost quadratically.
 Conventional approaches mitigate this by sparsely sampling frames, however, this strategy discards fine-grained motion dynamics and treats all spatial regions uniformly, resulting in wasted computation on static content.
 
 We introduce OneVision Encoder, a vision transformer that resolves this trade-off by drawing inspiration from HEVC (High-Efficiency Video Coding). Rather than densely processing all patches from a few frames, OneVision Encoder sparsely selects informative patches from many frames. This codec-inspired patch selection mechanism identifies temporally salient regions (e.g., motion, object interactions, and semantic changes) and allocates computation exclusively to these informative areas.
@@ -64,7 +64,7 @@ Coupled with global contrastive learning over a 2M-scale concept memory bank, On
 
 ### Video Processing Pipeline
 
-The visualization below illustrates four different video processing pipelines. 
+The visualization below illustrates four different video processing pipelines.
 (1) Original Video: a continuous 64-frame sequence that preserves the complete temporal context.
 (2) Uniform Frame Sampling: a conventional strategy that selects 4–8 evenly spaced frames; while simple and efficient, it is inherently lossy and fails to capture fine-grained inter-frame motion.
 (3) Temporal Saliency Detection: a global analysis of all 64 frames to identify regions rich in temporal information, including motion patterns, appearance variations, and semantic events.
@@ -208,14 +208,14 @@ More documentation will be added soon.
 
 
 ```bash
-docker build -t ov_encoder:25.12 .
+docker build -t onevision-encoder:2601 .
 ```
 
 ```bash
-docker run -it --gpus all --ipc host --net host --privileged \
+docker run -it --rm --gpus all --ipc host --net host --privileged \
     -v "$(pwd)":/workspace/OneVision-Encoder \
     -w /workspace/OneVision-Encoder \
-    ov_encoder:25.12 bash
+    onevision-encoder:2601 bash
 ```
 
 
diff --git a/dockerfile b/dockerfile
@@ -1,21 +1,35 @@
 FROM pytorch/pytorch:2.7.0-cuda11.8-cudnn9-runtime
 
-# Set up environment variables (optional, but often recommended)
-ENV DEBIAN_FRONTEND=noninteractive
+# Set up environment variables
+ENV DEBIAN_FRONTEND=noninteractive \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
 
-# Install Python packages
-RUN pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda110 \
-    && pip install --no-cache-dir decord==0.6.0 \
-    && pip install --no-cache-dir timm \
-    && pip install --no-cache-dir transformers==4.53.1 \
-    && pip install --no-cache-dir tensorboard \
-    && pip install --no-cache-dir easydict
+# Install system dependencies and ffmpeg in one layer
+RUN set -eux; \
+    apt-get update && apt-get install -y --no-install-recommends \
+        curl \
+        ca-certificates \
+        xz-utils \
+    && rm -rf /var/lib/apt/lists/* \
+    && cd /tmp \
+    && curl -L https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz -o ffmpeg.tar.xz \
+    && tar -xf ffmpeg.tar.xz \
+    && cd ffmpeg-*-static \
+    && install -m 0755 ffmpeg /usr/local/bin/ffmpeg \
+    && install -m 0755 ffprobe /usr/local/bin/ffprobe \
+    && cd / \
+    && rm -rf /tmp/ffmpeg* \
+    && ffprobe -version
 
-# (Optional) Set working directory
-WORKDIR /workspace
+# Copy requirements file first (for better caching)
+COPY requirements.txt /tmp/requirements.txt
 
-# (Optional) Copy your code into the container
-# COPY . /workspace
+# Install Python packages in optimized order
+RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
+    pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda110 && \
+    pip install --no-cache-dir -r /tmp/requirements.txt && \
+    rm /tmp/requirements.txt
 
-# (Optional) Set entrypoint or CMD
-# CMD ["python", "your_script.py"]
+# Set default command
+CMD ["/bin/bash"]
diff --git a/requirements.txt b/requirements.txt
@@ -3,4 +3,5 @@ easydict
 huggingface-hub>=0.23.2,<1.0
 tensorboard
 timm
-transformers==4.53.1
+transformers==4.53.1
+torchmetrics