perf(docker): cache cargo+ccache and drop redundant maturin step (-51% rebuild on .py change)#1885
Conversation
The two heavy RUN steps in py-builder (uv sync + maturin build) re-execute
on every Python source change because the upstream COPY layer for openviking/
invalidates the cache. Each rerun was ~510s + ~115s ≈ 10 min of wasted work
even though Rust/C++ source was unchanged.
Add BuildKit cache mounts so cargo and the C++ engine compilation can skip
work whose inputs are unchanged:
- Mount /cargo-target, cargo registry, and cargo git so cargo's incremental
build artifacts persist across layer reruns. Pin CARGO_TARGET_DIR so the
path stays stable when uv builds wheels in ephemeral isolated tempdirs.
- Install ccache and prepend /usr/lib/ccache to PATH so cmake (which calls
shutil.which("gcc")) resolves the ccache wrapper. ccache is path-agnostic,
so it benefits the cmake_build subdir even though setup.py recreates it
in a fresh tempdir each wheel build.
- Mount /root/.ccache so the ccache hash store persists across reruns.
Expected: hot rebuilds on Python-only changes drop step 15 from ~510s to
~60-120s (uv wheel packaging overhead remains; cargo + g++ skip on cache hit).
The second RUN step in py-builder built ragfs-python a second time and extracted its .so into the installed openviking package. This was redundant: setup.py's build_ragfs_python_artifact() already runs maturin during step 15 (uv sync --no-editable), and because build_meta passes 'bdist_wheel' through PEP 517, _should_require_ragfs_artifact() returns True and the build fails closed if maturin can't produce ragfs_python.so. The .so is then bundled into the wheel via package_data and installed into /app/.venv on wheel install. The second step's only effect was to overwrite the same file, costing ~115s per build. Verified after the fact by inspecting the installed venv and importing ragfs_python in the runtime container.
PR Reviewer Guide 🔍(Review updated until commit 9523b9e)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
|
Persistent review updated to latest commit 9523b9e |
PR Code Suggestions ✨No code suggestions found for the PR. |
There was a problem hiding this comment.
Pull request overview
This PR optimizes Docker image rebuild performance by adding BuildKit cache mounts for Rust/C++ compilation artifacts in the py-builder stage and removing a now-redundant post-install maturin build step, relying instead on the existing wheel-build contract in setup.py/pyproject.toml.
Changes:
- Add BuildKit cache mounts for Cargo target/registry/git and
ccache, and pinCARGO_TARGET_DIRto a stable path to maximize cache hits across rebuilds. - Install and enable
ccacheviaPATHso CMake-invokedgcc/g++calls use the ccache wrappers. - Delete the second
maturin buildstep (previously extracting the.sointo the venv), since wheel builds already require and bundle theragfs_pythonartifact.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
Speed up Docker rebuilds when only Python source changes by adding BuildKit cache mounts to the heavy
py-builderstage and removing a redundant secondmaturin buildstep.In the current Dockerfile, any change inside
openviking/invalidates step 11 (COPY openviking/ openviking/), which cascades through the two heavyRUNsteps:uv sync --no-editable) — triggerssetup.pyto rebuild theovRust CLI,ragfs-python(via maturin), and the C++ vector engine via cmake.ragfs-pythona second time, overwriting the.soalready installed in the venv.Neither rerun benefits from any persistent cache, so a one-line
.pyedit costs ~10 minutes of native compilation that produces output identical to the previous build.Related Issue
N/A
Type of Change
Changes Made
/cargo-target,/usr/local/cargo/registry,/usr/local/cargo/git, and/root/.ccacheso cargo and gcc/g++ reuse work across rebuilds.CARGO_TARGET_DIR=/cargo-targetso the path stays stable when uv builds wheels in ephemeral isolated tempdirs.ccacheand prepend/usr/lib/ccachetoPATHso cmake'sshutil.which("gcc")resolves to the ccache wrapper.setup.py'sbuild_ragfs_python_artifactalready runs maturin during step 15 (uv invokes setuptoolsbuild_metawithbdist_wheelin argv, so_should_require_ragfs_artifact()returns True and the build fails closed if the.socan't be produced). The.sois bundled into the wheel viapackage_dataand installed into/app/.venvon wheel install.Why step 16 was introduced — and why it became redundant
Step 16 was introduced in #1221 (the agfs → ragfs(Rust) rewrite) and was correct at that time. Two follow-up PRs since then made it obsolete:
pyproject.toml'sbuild-system.requiresdoes not listmaturin, andsetup.py'sbuild_ragfs_python_artifactsilently skips whenshutil.which(\"maturin\")returnsNone. So step 15's isolated wheel build env has no maturin → ragfs is skipped → the wheel contains no.so→ runtime import would fail. Step 16 is the necessary fallback: explicitly install maturin, build ragfs, extract the.sointo the installed venv.\"maturin>=1.0,<2.0\"topyproject.toml'sbuild-system.requires(commit message: "build: move ragfs-python packaging into setup.py"). From this point on, step 15's isolated build env already has maturin, sobuild_ragfs_python_artifactruns successfully inside the wheel build. Step 16 becomes a no-op overwrite of the.sothat step 15 already installed._should_require_ragfs_artifact()so wheel builds fail closed ifragfs_python.abi3.socan't be produced. This upgrades "step 15 produces the .so" from an empirical observation to a code-enforced contract.So step 16's original safety role has been migrated into
setup.py+pyproject.tomlsince #1307, in a strictly more robust way (the wheel itself fails closed; no separate post-install patching needed). This PR removes the now-redundant fallback.Testing
End-to-end verified on the deployment build server (ARM Neoverse-N1, 4 vCPU @ 2.0 GHz, QEMU virt, Debian 13 aarch64 host, LXC container with 12 GiB RAM):
Local sanity check on a developer laptop (Apple M4 Pro, 14-core / 48 GiB host, Linux VM via colima with 14 vCPU / 16 GiB):
docker builder prune -a)Runtime correctness (built image actually starts and serves):
ls /app/.venv/lib/python3.13/site-packages/openviking/lib/→ragfs_python.abi3.so(11.8 MB) is present, written by step 15's wheel install (no longer overwritten by the deleted step 16).python -c \"from openviking.lib import ragfs_python; print(ragfs_python.__file__)\"resolves to the venv path with no errors.docker compose up -d+curl /health→{\"status\":\"ok\",\"healthy\":true,\"version\":\"0.3.15.devN\",\"auth_mode\":\"api_key\"}.I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Dockerfile.I have tested this on the following platforms:
Checklist
Additional Notes
Cache mounts were chosen over restructuring
COPYorder because:COPYsplit +uv sync --reinstall-package openvikingdoes not avoid recompilation:setup.py'sbuild_extension(cmake-driven C++ engine build) does not honor any skip env var, so a Python-only reinstall would still rerun cmake. Adding such a flag would expand the diff intosetup.pyand increase review surface.The largest remaining cost inside step 15 is uv wheel packaging + cmake configure (~3–4 min on the build server, ~80 s locally) that cache mounts cannot address. Reaching ~1 minute hot rebuilds would require switching to an editable install or splitting native vs Python source into separate
COPYlayers, which is out of scope for this PR.