Skip to content

Commit fa6bab9

Browse files
authored
feat: multi-Python worker images with startup version check (AE-2827) (#89)
* feat: multi-Python worker images with startup version check (AE-2827) Add Python 3.10 and 3.11 support to GPU worker images via side-by-side torch install in the existing runpod/pytorch base. 3.12 keeps the fast path (torch pre-installed) to avoid the ~7 GB reinstall cost on hot deployments; 3.10/3.11 images pay that cost once per cold start per DC. Sibling to flash#322 which landed the SDK-level plumbing. Tags follow the same ``py${VERSION}-${TAG}`` scheme already in use for CPU images. - Dockerfile / Dockerfile-lb (GPU): accept PYTHON_VERSION build arg; install torch from download.pytorch.org/whl/cu128 and repoint /usr/local/bin/python for non-3.12 targets; validate interpreter matches the arg during build. - Dockerfile-cpu / Dockerfile-lb-cpu (CPU): surface PYTHON_VERSION at runtime via FLASH_PYTHON_VERSION env so the worker's startup check can read it. - src/version.py: new ``assert_python_version_matches_image`` — raises PythonVersionMismatchError at handler boot when ``sys.version_info`` disagrees with the image's stamped FLASH_PYTHON_VERSION. Caught before user code runs; skipped when the env var is unset (local dev). - src/handler.py / src/lb_handler.py: call the assertion immediately after logging setup, before ``maybe_unpack()`` and handler import. - tests/unit/test_version.py: 4 new cases covering env-unset skip, match, mismatch raise, and message contents. - tests/unit/test_lb_handler.py: extend the mocked ``version`` module with ``assert_python_version_matches_image`` so fresh-import tests don't break. - .github/workflows/ci.yml: expand CI to build GPU and LB images across {3.10, 3.11, 3.12}; align prod CPU and LB-CPU default to 3.12 (matches flash's DEFAULT_PYTHON_VERSION). * fix(dockerfile): bootstrap pip via get-pip.py for non-3.12 GPU builds Ubuntu 22.04's system python3.10 has ensurepip disabled by Debian policy, which broke the side-by-side torch install for 3.10 GPU images (CI: docker-test-gpu (3.10), docker-test-lb (3.10)). python3.11 is a separate interpreter without the disable, so only 3.10 was affected. Use urllib+get-pip.py instead of ensurepip — works for any interpreter regardless of distro patching, and urllib is stdlib so no curl dep. Also corrects the outdated deadsnakes comment on both Dockerfiles: the runpod/pytorch base image layers alt-Python 3.11/3.12 on top of the system 3.10, not via deadsnakes.
1 parent da395cf commit fa6bab9

10 files changed

Lines changed: 277 additions & 46 deletions

File tree

.github/workflows/ci.yml

Lines changed: 113 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,9 @@ jobs:
7373
docker-test:
7474
runs-on: ubuntu-latest
7575
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
76+
strategy:
77+
matrix:
78+
python-version: ["3.10", "3.11", "3.12"]
7679
steps:
7780
- name: Checkout repository
7881
uses: actions/checkout@v4
@@ -89,19 +92,22 @@ jobs:
8992
push: false
9093
tags: flash-cpu:test
9194
build-args: |
92-
PYTHON_VERSION=3.11
93-
cache-from: type=gha
94-
cache-to: type=gha,mode=max
95+
PYTHON_VERSION=${{ matrix.python-version }}
96+
cache-from: type=gha,scope=cpu-test-py${{ matrix.python-version }}
97+
cache-to: type=gha,mode=max,scope=cpu-test-py${{ matrix.python-version }}
9598
load: true
9699

97100
- name: Test CPU handler execution in Docker environment
98101
run: |
99-
echo "Testing CPU handler in Docker environment..."
102+
echo "Testing CPU handler (Python ${{ matrix.python-version }})..."
100103
docker run --rm flash-cpu:test ./test-handler.sh
101104
102105
docker-test-lb-cpu:
103106
runs-on: ubuntu-latest
104107
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
108+
strategy:
109+
matrix:
110+
python-version: ["3.10", "3.11", "3.12"]
105111
steps:
106112
- name: Checkout repository
107113
uses: actions/checkout@v4
@@ -118,24 +124,97 @@ jobs:
118124
push: false
119125
tags: flash-lb-cpu:test
120126
build-args: |
121-
PYTHON_VERSION=3.11
122-
cache-from: type=gha
123-
cache-to: type=gha,mode=max
127+
PYTHON_VERSION=${{ matrix.python-version }}
128+
cache-from: type=gha,scope=lb-cpu-test-py${{ matrix.python-version }}
129+
cache-to: type=gha,mode=max,scope=lb-cpu-test-py${{ matrix.python-version }}
124130
load: true
125131

126132
- name: Test LB handler execution in Docker environment
127133
run: |
128-
echo "Testing LB handler in Docker environment..."
134+
echo "Testing LB handler (Python ${{ matrix.python-version }})..."
129135
docker run --rm flash-lb-cpu:test ./test-lb-handler.sh
130136
137+
docker-test-gpu:
138+
runs-on: ubuntu-latest
139+
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
140+
strategy:
141+
fail-fast: false
142+
matrix:
143+
python-version: ["3.10", "3.11", "3.12"]
144+
steps:
145+
- name: Clear space
146+
run: |
147+
rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost "$AGENT_TOOLSDIRECTORY"
148+
docker system prune -af
149+
df -h
150+
151+
- name: Checkout repository
152+
uses: actions/checkout@v4
153+
154+
- name: Set up Docker Buildx
155+
uses: docker/setup-buildx-action@v3
156+
157+
- name: Build GPU Docker image
158+
uses: docker/build-push-action@v6
159+
with:
160+
context: .
161+
file: ./Dockerfile
162+
platforms: linux/amd64
163+
push: false
164+
tags: flash-gpu:test
165+
build-args: |
166+
PYTHON_VERSION=${{ matrix.python-version }}
167+
cache-from: type=gha,scope=gpu-test-py${{ matrix.python-version }}
168+
cache-to: type=gha,mode=max,scope=gpu-test-py${{ matrix.python-version }}
169+
170+
docker-test-lb:
171+
runs-on: ubuntu-latest
172+
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
173+
strategy:
174+
fail-fast: false
175+
matrix:
176+
python-version: ["3.10", "3.11", "3.12"]
177+
steps:
178+
- name: Clear space
179+
run: |
180+
rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost "$AGENT_TOOLSDIRECTORY"
181+
docker system prune -af
182+
df -h
183+
184+
- name: Checkout repository
185+
uses: actions/checkout@v4
186+
187+
- name: Set up Docker Buildx
188+
uses: docker/setup-buildx-action@v3
189+
190+
- name: Build GPU Load Balancer Docker image
191+
uses: docker/build-push-action@v6
192+
with:
193+
context: .
194+
file: ./Dockerfile-lb
195+
platforms: linux/amd64
196+
push: false
197+
tags: flash-lb:test
198+
build-args: |
199+
PYTHON_VERSION=${{ matrix.python-version }}
200+
cache-from: type=gha,scope=lb-test-py${{ matrix.python-version }}
201+
cache-to: type=gha,mode=max,scope=lb-test-py${{ matrix.python-version }}
202+
131203
docker-validation:
132204
runs-on: ubuntu-latest
133-
needs: [test, lint, docker-test, docker-test-lb-cpu]
205+
needs: [test, lint, docker-test, docker-test-lb-cpu, docker-test-gpu, docker-test-lb]
134206
if: always()
135207
steps:
136208
- name: Check all jobs succeeded
137209
run: |
138-
results=("${{ needs.test.result }}" "${{ needs.lint.result }}" "${{ needs.docker-test.result }}" "${{ needs.docker-test-lb-cpu.result }}")
210+
results=(
211+
"${{ needs.test.result }}"
212+
"${{ needs.lint.result }}"
213+
"${{ needs.docker-test.result }}"
214+
"${{ needs.docker-test-lb-cpu.result }}"
215+
"${{ needs.docker-test-gpu.result }}"
216+
"${{ needs.docker-test-lb.result }}"
217+
)
139218
for result in "${results[@]}"; do
140219
if [[ "$result" != "success" && "$result" != "skipped" ]]; then
141220
echo "One or more quality checks failed (got: $result)"
@@ -168,8 +247,13 @@ jobs:
168247
needs: [release]
169248
if: needs.release.outputs.release_created
170249
strategy:
250+
fail-fast: false
171251
matrix:
172252
include:
253+
- python-version: "3.10"
254+
is-default: false
255+
- python-version: "3.11"
256+
is-default: false
173257
- python-version: "3.12"
174258
is-default: true
175259
steps:
@@ -226,22 +310,25 @@ jobs:
226310
platforms: linux/amd64
227311
push: true
228312
tags: ${{ steps.tags.outputs.tags }}
229-
cache-from: type=gha,scope=gpu
230-
cache-to: type=gha,mode=max,scope=gpu
313+
build-args: |
314+
PYTHON_VERSION=${{ matrix.python-version }}
315+
cache-from: type=gha,scope=gpu-py${{ matrix.python-version }}
316+
cache-to: type=gha,mode=max,scope=gpu-py${{ matrix.python-version }}
231317

232318
docker-prod-cpu:
233319
runs-on: ubuntu-latest
234320
needs: [release]
235321
if: needs.release.outputs.release_created
236322
strategy:
323+
fail-fast: false
237324
matrix:
238325
include:
239326
- python-version: "3.10"
240327
is-default: false
241328
- python-version: "3.11"
242-
is-default: true
243-
- python-version: "3.12"
244329
is-default: false
330+
- python-version: "3.12"
331+
is-default: true
245332
steps:
246333
- name: Clear Space
247334
run: |
@@ -306,8 +393,13 @@ jobs:
306393
needs: [release]
307394
if: needs.release.outputs.release_created
308395
strategy:
396+
fail-fast: false
309397
matrix:
310398
include:
399+
- python-version: "3.10"
400+
is-default: false
401+
- python-version: "3.11"
402+
is-default: false
311403
- python-version: "3.12"
312404
is-default: true
313405
steps:
@@ -364,22 +456,25 @@ jobs:
364456
platforms: linux/amd64
365457
push: true
366458
tags: ${{ steps.tags.outputs.tags }}
367-
cache-from: type=gha,scope=lb
368-
cache-to: type=gha,mode=max,scope=lb
459+
build-args: |
460+
PYTHON_VERSION=${{ matrix.python-version }}
461+
cache-from: type=gha,scope=lb-py${{ matrix.python-version }}
462+
cache-to: type=gha,mode=max,scope=lb-py${{ matrix.python-version }}
369463

370464
docker-prod-lb-cpu:
371465
runs-on: ubuntu-latest
372466
needs: [release]
373467
if: needs.release.outputs.release_created
374468
strategy:
469+
fail-fast: false
375470
matrix:
376471
include:
377472
- python-version: "3.10"
378473
is-default: false
379474
- python-version: "3.11"
380-
is-default: true
381-
- python-version: "3.12"
382475
is-default: false
476+
- python-version: "3.12"
477+
is-default: true
383478
steps:
384479
- name: Clear Space
385480
run: |

Dockerfile

Lines changed: 37 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,37 @@
1-
# Base image provides Python 3.12 (from runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204)
1+
# Base image (runpod/pytorch:ubuntu2204) ships Ubuntu 22.04 system python3.10
2+
# plus alt-Python interpreters for 3.11/3.12 with torch pre-installed on 3.12.
3+
# For non-3.12 targets we reinstall torch from the CUDA 12.8 wheel index
4+
# (~7 GB overhead) and repoint /usr/local/bin/python so the worker CMD picks
5+
# up the correct interpreter.
26
FROM runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204
37

4-
# Use the base image's Python as-is to preserve pre-installed packages (torch, cuda libs).
5-
# The pytorch base image provides its own Python with torch already installed.
6-
# Symlinking to /usr/bin/python3.X would switch to a bare system Python without torch.
7-
# Validate that the base image provides the expected Python version.
8-
ARG EXPECTED_PYTHON_VERSION=3.12
9-
RUN python --version && \
10-
actual=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')") && \
11-
if [ "$actual" != "$EXPECTED_PYTHON_VERSION" ]; then \
12-
echo "ERROR: Expected Python $EXPECTED_PYTHON_VERSION but base image provides $actual" && exit 1; \
8+
# Target Python version for the worker runtime.
9+
ARG PYTHON_VERSION=3.12
10+
ARG TORCH_VERSION=2.9.1+cu128
11+
ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
12+
13+
# Expose the target version to the running worker for startup validation.
14+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
15+
16+
# Validate the base image provides the requested interpreter and activate it.
17+
# For non-3.12 targets, install torch for the selected Python and repoint
18+
# /usr/local/bin/python and python3 so downstream `python` invocations use it.
19+
# For 3.12 we keep the base image's python/torch untouched to avoid the
20+
# ~7 GB reinstall cost.
21+
#
22+
# pip bootstrap: Ubuntu 22.04's system python3.10 has ensurepip disabled by
23+
# Debian policy, so we install pip via get-pip.py (works for any interpreter
24+
# regardless of distro patching). urllib is stdlib, avoiding a curl dependency.
25+
RUN python${PYTHON_VERSION} --version \
26+
&& if [ "${PYTHON_VERSION}" != "3.12" ]; then \
27+
python${PYTHON_VERSION} -c "import urllib.request; urllib.request.urlretrieve('https://bootstrap.pypa.io/get-pip.py', '/tmp/get-pip.py')" \
28+
&& python${PYTHON_VERSION} /tmp/get-pip.py --no-cache-dir \
29+
&& rm -f /tmp/get-pip.py \
30+
&& python${PYTHON_VERSION} -m pip install --no-cache-dir \
31+
--index-url ${TORCH_INDEX_URL} \
32+
"torch==${TORCH_VERSION}" \
33+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python \
34+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python3; \
1335
fi
1436

1537
WORKDIR /app
@@ -41,20 +63,21 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y --no-ins
4163
&& rm -rf /var/lib/apt/lists/*
4264

4365
# Copy app code and install dependencies
44-
# Use --python to target the base image's Python (preserves torch in its site-packages)
66+
# Use --python to target the active interpreter (preserves torch in its site-packages)
4567
COPY README.md pyproject.toml uv.lock ./
4668
COPY src/ ./
4769
RUN uv export --format requirements-txt --no-dev --no-hashes > requirements.txt \
4870
&& uv pip install --python $(which python) --break-system-packages -r requirements.txt
4971

50-
# Install numpy for the base image's Python version.
72+
# Install numpy for the active Python version.
5173
# The runpod/pytorch image ships torch but not numpy. Flash build excludes numpy
5274
# from tarballs (BASE_IMAGE_PACKAGES) to save tarball space (~30 MB), so numpy
5375
# must be provided here in the base image.
5476
RUN python -m pip install --no-cache-dir numpy
5577

56-
# Verify torch and numpy are available from the base image
57-
RUN python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
78+
# Verify torch, numpy, and the expected Python version are available.
79+
RUN python -c "import sys; actual = f'{sys.version_info.major}.{sys.version_info.minor}'; expected = '${PYTHON_VERSION}'; assert actual == expected, f'Expected Python {expected}, got {actual}'; print(f'Python {actual} OK')" \
80+
&& python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
5881
&& python -c "import numpy; print(f'numpy {numpy.__version__}')"
5982

6083
CMD ["python", "handler.py"]

Dockerfile-cpu

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
ARG PYTHON_VERSION=3.12
22
FROM python:${PYTHON_VERSION}-slim
33

4+
# Re-declare after FROM so the value is visible in this build stage, and
5+
# expose it at runtime for the worker's startup version check.
6+
ARG PYTHON_VERSION
7+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
8+
49
WORKDIR /app
510

611
# Prevent interactive prompts during package installation

Dockerfile-lb

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,37 @@
1-
# Base image provides Python 3.12 (from runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204)
1+
# Base image (runpod/pytorch:ubuntu2204) ships Ubuntu 22.04 system python3.10
2+
# plus alt-Python interpreters for 3.11/3.12 with torch pre-installed on 3.12.
3+
# For non-3.12 targets we reinstall torch from the CUDA 12.8 wheel index
4+
# (~7 GB overhead) and repoint /usr/local/bin/python so the worker CMD picks
5+
# up the correct interpreter.
26
FROM runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204
37

4-
# Use the base image's Python as-is to preserve pre-installed packages (torch, cuda libs).
5-
# Validate that the base image provides the expected Python version.
6-
ARG EXPECTED_PYTHON_VERSION=3.12
7-
RUN python --version && \
8-
actual=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')") && \
9-
if [ "$actual" != "$EXPECTED_PYTHON_VERSION" ]; then \
10-
echo "ERROR: Expected Python $EXPECTED_PYTHON_VERSION but base image provides $actual" && exit 1; \
8+
# Target Python version for the worker runtime.
9+
ARG PYTHON_VERSION=3.12
10+
ARG TORCH_VERSION=2.9.1+cu128
11+
ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
12+
13+
# Expose the target version to the running worker for startup validation.
14+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
15+
16+
# Validate the base image provides the requested interpreter and activate it.
17+
# For non-3.12 targets, install torch for the selected Python and repoint
18+
# /usr/local/bin/python and python3 so downstream `python` invocations use it.
19+
# For 3.12 we keep the base image's python/torch untouched to avoid the
20+
# ~7 GB reinstall cost.
21+
#
22+
# pip bootstrap: Ubuntu 22.04's system python3.10 has ensurepip disabled by
23+
# Debian policy, so we install pip via get-pip.py (works for any interpreter
24+
# regardless of distro patching). urllib is stdlib, avoiding a curl dependency.
25+
RUN python${PYTHON_VERSION} --version \
26+
&& if [ "${PYTHON_VERSION}" != "3.12" ]; then \
27+
python${PYTHON_VERSION} -c "import urllib.request; urllib.request.urlretrieve('https://bootstrap.pypa.io/get-pip.py', '/tmp/get-pip.py')" \
28+
&& python${PYTHON_VERSION} /tmp/get-pip.py --no-cache-dir \
29+
&& rm -f /tmp/get-pip.py \
30+
&& python${PYTHON_VERSION} -m pip install --no-cache-dir \
31+
--index-url ${TORCH_INDEX_URL} \
32+
"torch==${TORCH_VERSION}" \
33+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python \
34+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python3; \
1135
fi
1236

1337
WORKDIR /app
@@ -39,20 +63,21 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y --no-ins
3963
&& rm -rf /var/lib/apt/lists/*
4064

4165
# Copy app code and install dependencies
42-
# Use --python to target the base image's Python (preserves torch in its site-packages)
66+
# Use --python to target the active interpreter (preserves torch in its site-packages)
4367
COPY README.md pyproject.toml uv.lock ./
4468
COPY src/ ./
4569
RUN uv export --format requirements-txt --no-dev --no-hashes > requirements.txt \
4670
&& uv pip install --python $(which python) --break-system-packages -r requirements.txt
4771

48-
# Install numpy for the base image's Python version.
72+
# Install numpy for the active Python version.
4973
# The runpod/pytorch image ships torch but not numpy. Flash build excludes numpy
5074
# from tarballs (BASE_IMAGE_PACKAGES) to save tarball space (~30 MB), so numpy
5175
# must be provided here in the base image.
5276
RUN python -m pip install --no-cache-dir numpy
5377

54-
# Verify torch and numpy are available from the base image
55-
RUN python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
78+
# Verify torch, numpy, and the expected Python version are available.
79+
RUN python -c "import sys; actual = f'{sys.version_info.major}.{sys.version_info.minor}'; expected = '${PYTHON_VERSION}'; assert actual == expected, f'Expected Python {expected}, got {actual}'; print(f'Python {actual} OK')" \
80+
&& python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
5681
&& python -c "import numpy; print(f'numpy {numpy.__version__}')"
5782

5883
EXPOSE 80

Dockerfile-lb-cpu

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
ARG PYTHON_VERSION=3.12
22
FROM python:${PYTHON_VERSION}-slim
33

4+
# Re-declare after FROM so the value is visible in this build stage, and
5+
# expose it at runtime for the worker's startup version check.
6+
ARG PYTHON_VERSION
7+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
8+
49
WORKDIR /app
510

611
# Prevent interactive prompts during package installation

0 commit comments

Comments
 (0)