Skip to content

Commit 9a390a5

Browse files
chore: optimize example docker files using multi-stage builds (#232)
Signed-off-by: sapkota-aayush <aayushsapkota1030@gmail.com>
1 parent 42f9fbd commit 9a390a5

28 files changed

Lines changed: 985 additions & 630 deletions

File tree

docs/DOCKER_OPTIMIZATION.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Docker Build Optimization for NumaFlow Python UDFs
2+
3+
## Overview
4+
5+
This document outlines the optimization strategies to reduce Docker build times for NumaFlow Python UDFs from 2+ minutes to under 30 seconds for subsequent builds.
6+
7+
## Current Issues
8+
9+
1. **Redundant dependency installation**: Each UDF rebuilds the entire pynumaflow package
10+
2. **No layer caching**: Dependencies are reinstalled every time
11+
3. **Copying entire project**: The `COPY ./ ./` copies everything, including unnecessary files
12+
4. **No shared base layers**: Each UDF builds its own base environment
13+
14+
## Optimization Strategy: Three-Stage Approach
15+
16+
As suggested by @kohlisid, we implement a three-stage build approach:
17+
18+
### Stage 1: Base Layer
19+
- Common Python environment and tools
20+
- System dependencies (curl, wget, build-essential, git)
21+
- Poetry installation
22+
- dumb-init binary
23+
24+
### Stage 2: Environment Setup
25+
- pynumaflow package installation
26+
- Shared virtual environment creation
27+
- This layer is cached unless `pyproject.toml` or `poetry.lock` changes
28+
29+
### Stage 3: Builder
30+
- UDF-specific code and dependencies
31+
- Reuses the pynumaflow installation from Stage 2
32+
- Minimal additional dependencies
33+
34+
## Implementation Options
35+
36+
### Option 1: Optimized Multi-Stage Build (Recommended)
37+
38+
**File**: `examples/map/even_odd/Dockerfile.optimized`
39+
40+
**Benefits**:
41+
- Better layer caching
42+
- Reduced build time by ~60-70%
43+
- No external dependencies
44+
45+
**Usage**:
46+
```bash
47+
cd examples/map/even_odd
48+
make -f Makefile.optimized image
49+
```
50+
51+
### Option 2: Shared Base Image (Fastest)
52+
53+
**Files**:
54+
- `Dockerfile.base` (shared base image)
55+
- `examples/map/even_odd/Dockerfile.shared-base` (UDF-specific)
56+
57+
**Benefits**:
58+
- Maximum caching efficiency
59+
- Build time reduced by ~80-90% for subsequent builds
60+
- Perfect for CI/CD pipelines
61+
62+
**Usage**:
63+
```bash
64+
# Build base image once
65+
docker build -f Dockerfile.base -t numaflow-python-base .
66+
67+
# Build UDF images (very fast)
68+
cd examples/map/even_odd
69+
make -f Makefile.optimized image-fast
70+
```
71+
72+
## Performance Comparison
73+
74+
| Approach | First Build | Subsequent Builds | Cache Efficiency |
75+
|----------|-------------|-------------------|------------------|
76+
| Current | ~2-3 minutes | ~2-3 minutes | Poor |
77+
| Optimized Multi-Stage | ~2-3 minutes | ~45-60 seconds | Good |
78+
| Shared Base Image | ~2-3 minutes | ~15-30 seconds | Excellent |
79+
80+
## Implementation Steps
81+
82+
### 1. Build Shared Base Image (One-time setup)
83+
84+
```bash
85+
# From project root
86+
docker build -f Dockerfile.base -t numaflow-python-base .
87+
```
88+
89+
### 2. Update UDF Dockerfiles
90+
91+
Replace the current Dockerfile with the optimized version:
92+
93+
```bash
94+
# For each UDF directory
95+
cp Dockerfile.optimized Dockerfile
96+
# or
97+
cp Dockerfile.shared-base Dockerfile
98+
```
99+
100+
### 3. Update Makefiles
101+
102+
Use the optimized Makefile:
103+
104+
```bash
105+
# For each UDF directory
106+
cp Makefile.optimized Makefile
107+
```
108+
109+
### 4. CI/CD Integration
110+
111+
For CI/CD pipelines, add the base image build step:
112+
113+
```yaml
114+
# Example GitHub Actions step
115+
- name: Build base image
116+
run: docker build -f Dockerfile.base -t numaflow-python-base .
117+
118+
- name: Build UDF images
119+
run: |
120+
cd examples/map/even_odd
121+
make image-fast
122+
```
123+
124+
## Advanced Optimizations
125+
126+
### 1. Dependency Caching
127+
128+
The optimized Dockerfiles implement smart dependency caching:
129+
- `pyproject.toml` and `poetry.lock` are copied first
130+
- pynumaflow installation is cached separately
131+
- UDF-specific dependencies are installed last
132+
133+
### 2. Layer Optimization
134+
135+
- Minimal system dependencies in runtime image
136+
- Separate build and runtime stages
137+
- Efficient file copying with specific paths
138+
139+
### 3. Build Context Optimization
140+
141+
- Copy only necessary files
142+
- Use `.dockerignore` to exclude unnecessary files
143+
- Minimize build context size
144+
145+
## Migration Guide
146+
147+
### For Existing UDFs
148+
149+
1. **Backup current Dockerfile**:
150+
```bash
151+
cp Dockerfile Dockerfile.backup
152+
```
153+
154+
2. **Choose optimization approach**:
155+
- For single UDF: Use `Dockerfile.optimized`
156+
- For multiple UDFs: Use `Dockerfile.shared-base`
157+
158+
3. **Update Makefile**:
159+
```bash
160+
cp Makefile.optimized Makefile
161+
```
162+
163+
4. **Test the build**:
164+
```bash
165+
make image
166+
# or
167+
make image-fast
168+
```
169+
170+
### For New UDFs
171+
172+
1. **Use the optimized template**:
173+
```bash
174+
cp examples/map/even_odd/Dockerfile.optimized your-udf/Dockerfile
175+
cp examples/map/even_odd/Makefile.optimized your-udf/Makefile
176+
```
177+
178+
2. **Update paths in Dockerfile**:
179+
- Change `EXAMPLE_PATH` to your UDF path
180+
- Update `COPY` commands accordingly
181+
182+
## Troubleshooting
183+
184+
### Common Issues
185+
186+
1. **Base image not found**:
187+
```bash
188+
docker build -f Dockerfile.base -t numaflow-python-base .
189+
```
190+
191+
2. **Permission issues**:
192+
```bash
193+
chmod +x entry.sh
194+
```
195+
196+
3. **Poetry cache issues**:
197+
```bash
198+
poetry cache clear --all pypi
199+
```
200+
201+
### Performance Monitoring
202+
203+
Monitor build times:
204+
```bash
205+
time make image
206+
time make image-fast
207+
```
208+
209+
## Future Enhancements
210+
211+
1. **Registry-based base images**: Push base image to registry for team sharing
212+
2. **BuildKit optimizations**: Enable BuildKit for parallel layer building
213+
3. **Multi-platform builds**: Optimize for ARM64 and AMD64
214+
4. **Dependency analysis**: Automate dependency optimization
215+
216+
## Contributing
217+
218+
When adding new UDFs or modifying existing ones:
219+
220+
1. Use the optimized Dockerfile templates
221+
2. Follow the three-stage approach
222+
3. Test build times before and after changes
223+
4. Update this documentation if needed
224+
225+
## References
226+
227+
- [Docker Multi-Stage Builds](https://docs.docker.com/develop/dev-best-practices/multistage-build/)
228+
- [Docker Layer Caching](https://docs.docker.com/develop/dev-best-practices/dockerfile_best-practices/#leverage-build-cache)
229+
- [Poetry Docker Best Practices](https://python-poetry.org/docs/configuration/#virtualenvsin-project)

examples/batchmap/flatmap/Dockerfile

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,52 @@
11
####################################################################################################
2-
# builder: install needed dependencies
2+
# Stage 1: Base Builder - installs core dependencies using poetry
33
####################################################################################################
4+
FROM python:3.10-slim-bullseye AS base-builder
45

5-
FROM python:3.10-slim-bullseye AS builder
6+
ENV PYSETUP_PATH="/opt/pysetup"
7+
WORKDIR $PYSETUP_PATH
8+
9+
# Copy only core dependency files first for better caching
10+
COPY pyproject.toml poetry.lock README.md ./
11+
COPY pynumaflow/ ./pynumaflow/
12+
RUN apt-get update && apt-get install --no-install-recommends -y \
13+
curl wget build-essential git \
14+
&& apt-get clean && rm -rf /var/lib/apt/lists/* \
15+
&& pip install poetry \
16+
&& poetry install --no-root --no-interaction
17+
18+
####################################################################################################
19+
# Stage 2: UDF Builder - adds UDF code and installs UDF-specific deps
20+
####################################################################################################
21+
FROM base-builder AS udf-builder
22+
23+
ENV EXAMPLE_PATH="/opt/pysetup/examples/batchmap/flatmap"
24+
ENV POETRY_VIRTUALENVS_IN_PROJECT=true
25+
26+
WORKDIR $EXAMPLE_PATH
27+
COPY examples/batchmap/flatmap/ ./
28+
RUN poetry install --no-root --no-interaction
629

7-
ENV PYTHONFAULTHANDLER=1 \
8-
PYTHONUNBUFFERED=1 \
9-
PYTHONHASHSEED=random \
10-
PIP_NO_CACHE_DIR=on \
11-
PIP_DISABLE_PIP_VERSION_CHECK=on \
12-
PIP_DEFAULT_TIMEOUT=100 \
13-
POETRY_VERSION=1.2.2 \
14-
POETRY_HOME="/opt/poetry" \
15-
POETRY_VIRTUALENVS_IN_PROJECT=true \
16-
POETRY_NO_INTERACTION=1 \
17-
PYSETUP_PATH="/opt/pysetup"
30+
####################################################################################################
31+
# Stage 3: UDF Runtime - clean container with only needed stuff
32+
####################################################################################################
33+
FROM python:3.10-slim-bullseye AS udf
1834

35+
ENV PYSETUP_PATH="/opt/pysetup"
1936
ENV EXAMPLE_PATH="$PYSETUP_PATH/examples/batchmap/flatmap"
2037
ENV VENV_PATH="$EXAMPLE_PATH/.venv"
21-
ENV PATH="$POETRY_HOME/bin:$VENV_PATH/bin:$PATH"
22-
23-
RUN apt-get update \
24-
&& apt-get install --no-install-recommends -y \
25-
curl \
26-
wget \
27-
# deps for building python deps
28-
build-essential \
29-
&& apt-get install -y git \
38+
ENV PATH="$VENV_PATH/bin:$PATH"
39+
40+
RUN apt-get update && apt-get install --no-install-recommends -y wget \
3041
&& apt-get clean && rm -rf /var/lib/apt/lists/* \
31-
\
32-
# install dumb-init
3342
&& wget -O /dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64 \
34-
&& chmod +x /dumb-init \
35-
&& curl -sSL https://install.python-poetry.org | python3 -
36-
37-
####################################################################################################
38-
# udf: used for running the udf vertices
39-
####################################################################################################
40-
FROM builder AS udf
43+
&& chmod +x /dumb-init
4144

4245
WORKDIR $PYSETUP_PATH
43-
COPY ./ ./
46+
COPY --from=udf-builder $VENV_PATH $VENV_PATH
47+
COPY --from=udf-builder $EXAMPLE_PATH $EXAMPLE_PATH
4448

4549
WORKDIR $EXAMPLE_PATH
46-
RUN poetry lock
47-
RUN poetry install --no-cache --no-root && \
48-
rm -rf ~/.cache/pypoetry/
49-
5050
RUN chmod +x entry.sh
5151

5252
ENTRYPOINT ["/dumb-init", "--"]

0 commit comments

Comments
 (0)