Split llama-stack container build into deps and app layers for faster rebuilds#1895
Split llama-stack container build into deps and app layers for faster rebuilds#1895Jazzcort wants to merge 1 commit into
Conversation
… rebuilds Separate the monolithic container image into a deps base image (test.deps.containerfile) that installs system packages and Python dependencies, and a thin app image (test.containerfile) that only copies source code on top. The Makefile tracks a hash of pyproject.toml and uv.lock to skip rebuilding the deps image when dependencies haven't changed, making source-only iterations significantly faster. Remove existing images by name before building replacements to prevent dangling <none>:<none> images from accumulating and wasting disk space.
WalkthroughThe PR introduces a multi-stage container build strategy that separates Python dependencies into a reusable "deps" image. The Makefile detects dependency changes via hash comparison and rebuilds the deps layer only when needed, while the app container is refactored to use the deps image as its base, copying only application code. ChangesContainer layering and build orchestration
Sequence DiagramsequenceDiagram
participant Makefile
participant DepsImage as Deps Builder
participant AppImage as App Builder
participant Registry as Image Registry
Makefile->>Makefile: Compute CURRENT_DEPS_HASH<br/>from pyproject.toml, uv.lock
Makefile->>Makefile: Load STORED_DEPS_HASH<br/>from .llama-stack-deps.hash
alt Hash mismatch or no stored hash
Makefile->>Registry: Remove old deps image
Makefile->>DepsImage: Build deps image<br/>python 3.12 + uv + locked deps
DepsImage->>Registry: Push LLAMA_STACK_DEPS_IMAGE
Makefile->>Makefile: Persist CURRENT_DEPS_HASH
end
Makefile->>AppImage: Build app image<br/>FROM ${DEPS_IMAGE}
AppImage->>AppImage: Copy src/ and config files
AppImage->>Registry: Push LLAMA_STACK_IMAGE
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Makefile`:
- Around line 42-45: Makefile currently removes $(LLAMA_STACK_DEPS_IMAGE) (and
similarly the app image at the other removal) before building which sacrifices a
fallback on build failure; change the flow so the rmi of
$(LLAMA_STACK_DEPS_IMAGE) (and $(LLAMA_STACK_APP_IMAGE) at the other site)
happens only after a successful build—e.g., build the new image into a
temporary/tagged name, verify build success, then remove the old image with
$(CONTAINER_RUNTIME) rmi—alternatively add a dedicated cleanup target (invoked
manually or in CI) to accept dangling images during development; update the
removal commands referenced by the Makefile targets that currently use
$(CONTAINER_RUNTIME) image inspect ... rmi to follow this atomic replacement
pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bfaef2ec-afc0-46f9-898b-7844a7dc6150
📒 Files selected for processing (4)
.gitignoreMakefiledeploy/llama-stack/test.containerfiledeploy/llama-stack/test.deps.containerfile
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 2
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-04-15T18:53:54.785Z
Learnt from: Lifto
Repo: lightspeed-core/lightspeed-stack PR: 1510
File: src/models/requests.py:742-769
Timestamp: 2026-04-15T18:53:54.785Z
Learning: In `src/models/requests.py`, keep `ResponsesRequest.validate_body_size` using `json.dumps(values)` with Python defaults (including ASCII escaping and the default separators). This default formatting is intentional to make the 65,536-character limit conservatively strict (accounting for small encoding/format overhead). Do not recommend changing it to `ensure_ascii=False` or using compact separators for this validator, since an exact wire-format byte match with the client payload is not achievable via `json.dumps` anyway.
Applied to files:
.gitignore
📚 Learning: 2026-04-20T15:09:45.119Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1548
File: src/app/endpoints/rlsapi_v1.py:56-56
Timestamp: 2026-04-20T15:09:45.119Z
Learning: In `src/app/endpoints/rlsapi_v1.py`, treat the line `_get_rh_identity_context = get_rh_identity_context` as an intentional temporary backward-compatibility shim (introduced for PR `#1548`, Splunk HEC telemetry work). Do not flag it as dead/unnecessary/unused code during review until the planned part 3 is merged and the responses endpoint is fully wired up such that no tests or external consumers reference the underscore-prefixed name.
Applied to files:
.gitignore
🪛 checkmake (0.3.2)
Makefile
[warning] 36-36: Target body for "build-llama-stack-deps-image" exceeds allowed length of 5 lines (12).
(maxbodylength)
[warning] 50-50: Target body for "ensure-llama-stack-deps-image" exceeds allowed length of 5 lines (13).
(maxbodylength)
[warning] 65-65: Target body for "build-llama-stack-image" exceeds allowed length of 5 lines (8).
(maxbodylength)
[warning] 97-97: Target body for "start-llama-stack-container" exceeds allowed length of 5 lines (48).
(maxbodylength)
[warning] 163-163: Target body for "clean-llama-stack" exceeds allowed length of 5 lines (9).
(maxbodylength)
🔇 Additional comments (17)
.gitignore (1)
198-199: LGTM!Makefile (5)
26-26: LGTM!
50-63: LGTM!
97-97: LGTM!
163-172: LGTM!
21-24: Harden deps hash portability in Makefile
Makefile lines 21-24 useshasum -a 256;shasumisn’t guaranteed to exist on all target systems (many only providesha256sum). Add a fallback (or switch tosha256sum) and avoid silent empty/incorrect hashes so dependency change detection can’t degrade unexpectedly.deploy/llama-stack/test.deps.containerfile (8)
1-10: LGTM!
16-19: LGTM!
21-22: LGTM!
24-29: LGTM!
31-32: LGTM!
34-39: LGTM!
41-41: LGTM!
12-14: Security: avoid unpinnedcurl | shfor UV installer in test containerfileIn
deploy/llama-stack/test.deps.containerfile(lines 12-14),RUN curl -LsSf https://astral.sh/uv/install.sh | shdownloads and executes a remote installer without checksum/signature verification and without pinning the installer/version. Consider pinning the UV installer to a specific release and verifying the installer/archive content before execution to reduce supply-chain risk.deploy/llama-stack/test.containerfile (3)
1-4: LGTM!
6-17: LGTM!
19-22: LGTM!
| @if $(CONTAINER_RUNTIME) image inspect $(LLAMA_STACK_DEPS_IMAGE) >/dev/null 2>&1; then \ | ||
| echo "Removing existing deps image to avoid dangling images..."; \ | ||
| $(CONTAINER_RUNTIME) rmi $(LLAMA_STACK_DEPS_IMAGE); \ | ||
| fi |
There was a problem hiding this comment.
Both image removals share the same atomicity trade-off. Removing images before building (both deps at lines 42-45 and app at lines 67-70) means any build failure leaves no fallback image. This trades build resilience for avoiding dangling <none>:<none> images. During development with frequent build iterations, consider whether atomic replacement (remove only after successful build) or accepting dangling images with periodic cleanup would better serve reliability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Makefile` around lines 42 - 45, Makefile currently removes
$(LLAMA_STACK_DEPS_IMAGE) (and similarly the app image at the other removal)
before building which sacrifices a fallback on build failure; change the flow so
the rmi of $(LLAMA_STACK_DEPS_IMAGE) (and $(LLAMA_STACK_APP_IMAGE) at the other
site) happens only after a successful build—e.g., build the new image into a
temporary/tagged name, verify build success, then remove the old image with
$(CONTAINER_RUNTIME) rmi—alternatively add a dedicated cleanup target (invoked
manually or in CI) to accept dangling images during development; update the
removal commands referenced by the Makefile targets that currently use
$(CONTAINER_RUNTIME) image inspect ... rmi to follow this atomic replacement
pattern.
There was a problem hiding this comment.
Why do we need a fallback image if we're building a developer build for testing? 🤔 If the build fails, we fix the problem and build it again. Cleaning those dangling images is just too tedious and they are huge.
There was a problem hiding this comment.
That's a fair point, @Jazzcort. For a developer build workflow, the "fix and rebuild" loop is the natural recovery path — there's no meaningful benefit to keeping a potentially stale fallback image around. Removing it upfront to avoid accumulating large dangling images is the right trade-off here. I'll withdraw this concern.
(┳◇┳)🐇
✏️ Learnings added
Learnt from: Jazzcort
Repo: lightspeed-core/lightspeed-stack PR: 1895
File: Makefile:42-45
Timestamp: 2026-06-10T18:11:00.333Z
Learning: In the `lightspeed-stack` repository (`Makefile`), the `build-llama-stack-deps-image` and `build-llama-stack-image` targets intentionally remove existing images *before* building (rather than atomic replacement after a successful build). This is deliberate for the developer/test workflow: the expected recovery from a build failure is to fix the issue and rebuild, so a fallback image provides no value; removing first avoids accumulating large dangling `<none>:<none>` images.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
This patch breaks the CI process in E2E server mode test. Should be a easy fix. I'll fix it later. 😁 |
Description
Separate the monolithic container image into a deps base image (test.deps.containerfile) that installs system packages and Python dependencies, and a thin app image (test.containerfile) that only copies source code on top. The Makefile tracks a hash of pyproject.toml and uv.lock to skip rebuilding the deps image when dependencies haven't changed, making source-only iterations significantly faster.
Remove existing images by name before building replacements to prevent dangling : images from accumulating and wasting disk space.
make runtakes around 1 min to build the developer build which was around 2.5 mins before (mac M3 pro)Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
make runshould just rebuild the app image and should take less time than before.make runshould rebuild both deps and app images.none:noneimages after executingmake runfor multiple times.Summary by CodeRabbit