Skip to content

Commit 161eb22

Browse files
committed
feat: multi-Python worker images with startup version check (AE-2827)
Add Python 3.10 and 3.11 support to GPU worker images via side-by-side torch install in the existing runpod/pytorch base. 3.12 keeps the fast path (torch pre-installed) to avoid the ~7 GB reinstall cost on hot deployments; 3.10/3.11 images pay that cost once per cold start per DC. Sibling to flash#322 which landed the SDK-level plumbing. Tags follow the same ``py${VERSION}-${TAG}`` scheme already in use for CPU images. - Dockerfile / Dockerfile-lb (GPU): accept PYTHON_VERSION build arg; install torch from download.pytorch.org/whl/cu128 and repoint /usr/local/bin/python for non-3.12 targets; validate interpreter matches the arg during build. - Dockerfile-cpu / Dockerfile-lb-cpu (CPU): surface PYTHON_VERSION at runtime via FLASH_PYTHON_VERSION env so the worker's startup check can read it. - src/version.py: new ``assert_python_version_matches_image`` — raises PythonVersionMismatchError at handler boot when ``sys.version_info`` disagrees with the image's stamped FLASH_PYTHON_VERSION. Caught before user code runs; skipped when the env var is unset (local dev). - src/handler.py / src/lb_handler.py: call the assertion immediately after logging setup, before ``maybe_unpack()`` and handler import. - tests/unit/test_version.py: 4 new cases covering env-unset skip, match, mismatch raise, and message contents. - tests/unit/test_lb_handler.py: extend the mocked ``version`` module with ``assert_python_version_matches_image`` so fresh-import tests don't break. - .github/workflows/ci.yml: expand CI to build GPU and LB images across {3.10, 3.11, 3.12}; align prod CPU and LB-CPU default to 3.12 (matches flash's DEFAULT_PYTHON_VERSION).
1 parent da395cf commit 161eb22

10 files changed

Lines changed: 263 additions & 46 deletions

File tree

.github/workflows/ci.yml

Lines changed: 113 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,9 @@ jobs:
7373
docker-test:
7474
runs-on: ubuntu-latest
7575
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
76+
strategy:
77+
matrix:
78+
python-version: ["3.10", "3.11", "3.12"]
7679
steps:
7780
- name: Checkout repository
7881
uses: actions/checkout@v4
@@ -89,19 +92,22 @@ jobs:
8992
push: false
9093
tags: flash-cpu:test
9194
build-args: |
92-
PYTHON_VERSION=3.11
93-
cache-from: type=gha
94-
cache-to: type=gha,mode=max
95+
PYTHON_VERSION=${{ matrix.python-version }}
96+
cache-from: type=gha,scope=cpu-test-py${{ matrix.python-version }}
97+
cache-to: type=gha,mode=max,scope=cpu-test-py${{ matrix.python-version }}
9598
load: true
9699

97100
- name: Test CPU handler execution in Docker environment
98101
run: |
99-
echo "Testing CPU handler in Docker environment..."
102+
echo "Testing CPU handler (Python ${{ matrix.python-version }})..."
100103
docker run --rm flash-cpu:test ./test-handler.sh
101104
102105
docker-test-lb-cpu:
103106
runs-on: ubuntu-latest
104107
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
108+
strategy:
109+
matrix:
110+
python-version: ["3.10", "3.11", "3.12"]
105111
steps:
106112
- name: Checkout repository
107113
uses: actions/checkout@v4
@@ -118,24 +124,97 @@ jobs:
118124
push: false
119125
tags: flash-lb-cpu:test
120126
build-args: |
121-
PYTHON_VERSION=3.11
122-
cache-from: type=gha
123-
cache-to: type=gha,mode=max
127+
PYTHON_VERSION=${{ matrix.python-version }}
128+
cache-from: type=gha,scope=lb-cpu-test-py${{ matrix.python-version }}
129+
cache-to: type=gha,mode=max,scope=lb-cpu-test-py${{ matrix.python-version }}
124130
load: true
125131

126132
- name: Test LB handler execution in Docker environment
127133
run: |
128-
echo "Testing LB handler in Docker environment..."
134+
echo "Testing LB handler (Python ${{ matrix.python-version }})..."
129135
docker run --rm flash-lb-cpu:test ./test-lb-handler.sh
130136
137+
docker-test-gpu:
138+
runs-on: ubuntu-latest
139+
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
140+
strategy:
141+
fail-fast: false
142+
matrix:
143+
python-version: ["3.10", "3.11", "3.12"]
144+
steps:
145+
- name: Clear space
146+
run: |
147+
rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost "$AGENT_TOOLSDIRECTORY"
148+
docker system prune -af
149+
df -h
150+
151+
- name: Checkout repository
152+
uses: actions/checkout@v4
153+
154+
- name: Set up Docker Buildx
155+
uses: docker/setup-buildx-action@v3
156+
157+
- name: Build GPU Docker image
158+
uses: docker/build-push-action@v6
159+
with:
160+
context: .
161+
file: ./Dockerfile
162+
platforms: linux/amd64
163+
push: false
164+
tags: flash-gpu:test
165+
build-args: |
166+
PYTHON_VERSION=${{ matrix.python-version }}
167+
cache-from: type=gha,scope=gpu-test-py${{ matrix.python-version }}
168+
cache-to: type=gha,mode=max,scope=gpu-test-py${{ matrix.python-version }}
169+
170+
docker-test-lb:
171+
runs-on: ubuntu-latest
172+
if: github.event_name != 'pull_request' || github.head_ref != 'release-please--branches--main'
173+
strategy:
174+
fail-fast: false
175+
matrix:
176+
python-version: ["3.10", "3.11", "3.12"]
177+
steps:
178+
- name: Clear space
179+
run: |
180+
rm -rf /usr/share/dotnet /opt/ghc /usr/local/share/boost "$AGENT_TOOLSDIRECTORY"
181+
docker system prune -af
182+
df -h
183+
184+
- name: Checkout repository
185+
uses: actions/checkout@v4
186+
187+
- name: Set up Docker Buildx
188+
uses: docker/setup-buildx-action@v3
189+
190+
- name: Build GPU Load Balancer Docker image
191+
uses: docker/build-push-action@v6
192+
with:
193+
context: .
194+
file: ./Dockerfile-lb
195+
platforms: linux/amd64
196+
push: false
197+
tags: flash-lb:test
198+
build-args: |
199+
PYTHON_VERSION=${{ matrix.python-version }}
200+
cache-from: type=gha,scope=lb-test-py${{ matrix.python-version }}
201+
cache-to: type=gha,mode=max,scope=lb-test-py${{ matrix.python-version }}
202+
131203
docker-validation:
132204
runs-on: ubuntu-latest
133-
needs: [test, lint, docker-test, docker-test-lb-cpu]
205+
needs: [test, lint, docker-test, docker-test-lb-cpu, docker-test-gpu, docker-test-lb]
134206
if: always()
135207
steps:
136208
- name: Check all jobs succeeded
137209
run: |
138-
results=("${{ needs.test.result }}" "${{ needs.lint.result }}" "${{ needs.docker-test.result }}" "${{ needs.docker-test-lb-cpu.result }}")
210+
results=(
211+
"${{ needs.test.result }}"
212+
"${{ needs.lint.result }}"
213+
"${{ needs.docker-test.result }}"
214+
"${{ needs.docker-test-lb-cpu.result }}"
215+
"${{ needs.docker-test-gpu.result }}"
216+
"${{ needs.docker-test-lb.result }}"
217+
)
139218
for result in "${results[@]}"; do
140219
if [[ "$result" != "success" && "$result" != "skipped" ]]; then
141220
echo "One or more quality checks failed (got: $result)"
@@ -168,8 +247,13 @@ jobs:
168247
needs: [release]
169248
if: needs.release.outputs.release_created
170249
strategy:
250+
fail-fast: false
171251
matrix:
172252
include:
253+
- python-version: "3.10"
254+
is-default: false
255+
- python-version: "3.11"
256+
is-default: false
173257
- python-version: "3.12"
174258
is-default: true
175259
steps:
@@ -226,22 +310,25 @@ jobs:
226310
platforms: linux/amd64
227311
push: true
228312
tags: ${{ steps.tags.outputs.tags }}
229-
cache-from: type=gha,scope=gpu
230-
cache-to: type=gha,mode=max,scope=gpu
313+
build-args: |
314+
PYTHON_VERSION=${{ matrix.python-version }}
315+
cache-from: type=gha,scope=gpu-py${{ matrix.python-version }}
316+
cache-to: type=gha,mode=max,scope=gpu-py${{ matrix.python-version }}
231317

232318
docker-prod-cpu:
233319
runs-on: ubuntu-latest
234320
needs: [release]
235321
if: needs.release.outputs.release_created
236322
strategy:
323+
fail-fast: false
237324
matrix:
238325
include:
239326
- python-version: "3.10"
240327
is-default: false
241328
- python-version: "3.11"
242-
is-default: true
243-
- python-version: "3.12"
244329
is-default: false
330+
- python-version: "3.12"
331+
is-default: true
245332
steps:
246333
- name: Clear Space
247334
run: |
@@ -306,8 +393,13 @@ jobs:
306393
needs: [release]
307394
if: needs.release.outputs.release_created
308395
strategy:
396+
fail-fast: false
309397
matrix:
310398
include:
399+
- python-version: "3.10"
400+
is-default: false
401+
- python-version: "3.11"
402+
is-default: false
311403
- python-version: "3.12"
312404
is-default: true
313405
steps:
@@ -364,22 +456,25 @@ jobs:
364456
platforms: linux/amd64
365457
push: true
366458
tags: ${{ steps.tags.outputs.tags }}
367-
cache-from: type=gha,scope=lb
368-
cache-to: type=gha,mode=max,scope=lb
459+
build-args: |
460+
PYTHON_VERSION=${{ matrix.python-version }}
461+
cache-from: type=gha,scope=lb-py${{ matrix.python-version }}
462+
cache-to: type=gha,mode=max,scope=lb-py${{ matrix.python-version }}
369463

370464
docker-prod-lb-cpu:
371465
runs-on: ubuntu-latest
372466
needs: [release]
373467
if: needs.release.outputs.release_created
374468
strategy:
469+
fail-fast: false
375470
matrix:
376471
include:
377472
- python-version: "3.10"
378473
is-default: false
379474
- python-version: "3.11"
380-
is-default: true
381-
- python-version: "3.12"
382475
is-default: false
476+
- python-version: "3.12"
477+
is-default: true
383478
steps:
384479
- name: Clear Space
385480
run: |

Dockerfile

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,30 @@
1-
# Base image provides Python 3.12 (from runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204)
1+
# Base image provides Python 3.9-3.13 via deadsnakes; only 3.12 has torch
2+
# pre-installed. For 3.10 and 3.11 we reinstall torch from the CUDA 12.8
3+
# wheel index (~7 GB overhead) and repoint /usr/local/bin/python so the
4+
# worker CMD picks up the correct interpreter.
25
FROM runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204
36

4-
# Use the base image's Python as-is to preserve pre-installed packages (torch, cuda libs).
5-
# The pytorch base image provides its own Python with torch already installed.
6-
# Symlinking to /usr/bin/python3.X would switch to a bare system Python without torch.
7-
# Validate that the base image provides the expected Python version.
8-
ARG EXPECTED_PYTHON_VERSION=3.12
9-
RUN python --version && \
10-
actual=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')") && \
11-
if [ "$actual" != "$EXPECTED_PYTHON_VERSION" ]; then \
12-
echo "ERROR: Expected Python $EXPECTED_PYTHON_VERSION but base image provides $actual" && exit 1; \
7+
# Target Python version for the worker runtime.
8+
ARG PYTHON_VERSION=3.12
9+
ARG TORCH_VERSION=2.9.1+cu128
10+
ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
11+
12+
# Expose the target version to the running worker for startup validation.
13+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
14+
15+
# Validate the base image provides the requested interpreter and activate it.
16+
# For non-3.12 targets, install torch for the selected Python and repoint
17+
# /usr/local/bin/python and python3 so downstream `python` invocations use it.
18+
# For 3.12 we keep the base image's python/torch untouched to avoid the
19+
# ~7 GB reinstall cost.
20+
RUN python${PYTHON_VERSION} --version \
21+
&& if [ "${PYTHON_VERSION}" != "3.12" ]; then \
22+
python${PYTHON_VERSION} -m ensurepip --upgrade \
23+
&& python${PYTHON_VERSION} -m pip install --no-cache-dir \
24+
--index-url ${TORCH_INDEX_URL} \
25+
"torch==${TORCH_VERSION}" \
26+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python \
27+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python3; \
1328
fi
1429

1530
WORKDIR /app
@@ -41,20 +56,21 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y --no-ins
4156
&& rm -rf /var/lib/apt/lists/*
4257

4358
# Copy app code and install dependencies
44-
# Use --python to target the base image's Python (preserves torch in its site-packages)
59+
# Use --python to target the active interpreter (preserves torch in its site-packages)
4560
COPY README.md pyproject.toml uv.lock ./
4661
COPY src/ ./
4762
RUN uv export --format requirements-txt --no-dev --no-hashes > requirements.txt \
4863
&& uv pip install --python $(which python) --break-system-packages -r requirements.txt
4964

50-
# Install numpy for the base image's Python version.
65+
# Install numpy for the active Python version.
5166
# The runpod/pytorch image ships torch but not numpy. Flash build excludes numpy
5267
# from tarballs (BASE_IMAGE_PACKAGES) to save tarball space (~30 MB), so numpy
5368
# must be provided here in the base image.
5469
RUN python -m pip install --no-cache-dir numpy
5570

56-
# Verify torch and numpy are available from the base image
57-
RUN python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
71+
# Verify torch, numpy, and the expected Python version are available.
72+
RUN python -c "import sys; actual = f'{sys.version_info.major}.{sys.version_info.minor}'; expected = '${PYTHON_VERSION}'; assert actual == expected, f'Expected Python {expected}, got {actual}'; print(f'Python {actual} OK')" \
73+
&& python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
5874
&& python -c "import numpy; print(f'numpy {numpy.__version__}')"
5975

6076
CMD ["python", "handler.py"]

Dockerfile-cpu

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
ARG PYTHON_VERSION=3.12
22
FROM python:${PYTHON_VERSION}-slim
33

4+
# Re-declare after FROM so the value is visible in this build stage, and
5+
# expose it at runtime for the worker's startup version check.
6+
ARG PYTHON_VERSION
7+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
8+
49
WORKDIR /app
510

611
# Prevent interactive prompts during package installation

Dockerfile-lb

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,30 @@
1-
# Base image provides Python 3.12 (from runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204)
1+
# Base image provides Python 3.9-3.13 via deadsnakes; only 3.12 has torch
2+
# pre-installed. For 3.10 and 3.11 we reinstall torch from the CUDA 12.8
3+
# wheel index (~7 GB overhead) and repoint /usr/local/bin/python so the
4+
# worker CMD picks up the correct interpreter.
25
FROM runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2204
36

4-
# Use the base image's Python as-is to preserve pre-installed packages (torch, cuda libs).
5-
# Validate that the base image provides the expected Python version.
6-
ARG EXPECTED_PYTHON_VERSION=3.12
7-
RUN python --version && \
8-
actual=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')") && \
9-
if [ "$actual" != "$EXPECTED_PYTHON_VERSION" ]; then \
10-
echo "ERROR: Expected Python $EXPECTED_PYTHON_VERSION but base image provides $actual" && exit 1; \
7+
# Target Python version for the worker runtime.
8+
ARG PYTHON_VERSION=3.12
9+
ARG TORCH_VERSION=2.9.1+cu128
10+
ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
11+
12+
# Expose the target version to the running worker for startup validation.
13+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
14+
15+
# Validate the base image provides the requested interpreter and activate it.
16+
# For non-3.12 targets, install torch for the selected Python and repoint
17+
# /usr/local/bin/python and python3 so downstream `python` invocations use it.
18+
# For 3.12 we keep the base image's python/torch untouched to avoid the
19+
# ~7 GB reinstall cost.
20+
RUN python${PYTHON_VERSION} --version \
21+
&& if [ "${PYTHON_VERSION}" != "3.12" ]; then \
22+
python${PYTHON_VERSION} -m ensurepip --upgrade \
23+
&& python${PYTHON_VERSION} -m pip install --no-cache-dir \
24+
--index-url ${TORCH_INDEX_URL} \
25+
"torch==${TORCH_VERSION}" \
26+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python \
27+
&& ln -sf "$(which python${PYTHON_VERSION})" /usr/local/bin/python3; \
1128
fi
1229

1330
WORKDIR /app
@@ -39,20 +56,21 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y --no-ins
3956
&& rm -rf /var/lib/apt/lists/*
4057

4158
# Copy app code and install dependencies
42-
# Use --python to target the base image's Python (preserves torch in its site-packages)
59+
# Use --python to target the active interpreter (preserves torch in its site-packages)
4360
COPY README.md pyproject.toml uv.lock ./
4461
COPY src/ ./
4562
RUN uv export --format requirements-txt --no-dev --no-hashes > requirements.txt \
4663
&& uv pip install --python $(which python) --break-system-packages -r requirements.txt
4764

48-
# Install numpy for the base image's Python version.
65+
# Install numpy for the active Python version.
4966
# The runpod/pytorch image ships torch but not numpy. Flash build excludes numpy
5067
# from tarballs (BASE_IMAGE_PACKAGES) to save tarball space (~30 MB), so numpy
5168
# must be provided here in the base image.
5269
RUN python -m pip install --no-cache-dir numpy
5370

54-
# Verify torch and numpy are available from the base image
55-
RUN python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
71+
# Verify torch, numpy, and the expected Python version are available.
72+
RUN python -c "import sys; actual = f'{sys.version_info.major}.{sys.version_info.minor}'; expected = '${PYTHON_VERSION}'; assert actual == expected, f'Expected Python {expected}, got {actual}'; print(f'Python {actual} OK')" \
73+
&& python -c "import torch; print(f'torch {torch.__version__} CUDA {torch.cuda.is_available()}')" \
5674
&& python -c "import numpy; print(f'numpy {numpy.__version__}')"
5775

5876
EXPOSE 80

Dockerfile-lb-cpu

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
ARG PYTHON_VERSION=3.12
22
FROM python:${PYTHON_VERSION}-slim
33

4+
# Re-declare after FROM so the value is visible in this build stage, and
5+
# expose it at runtime for the worker's startup version check.
6+
ARG PYTHON_VERSION
7+
ENV FLASH_PYTHON_VERSION=${PYTHON_VERSION}
8+
49
WORKDIR /app
510

611
# Prevent interactive prompts during package installation

0 commit comments

Comments
 (0)