Skip to content

Split llama-stack container build into deps and app layers for faster rebuilds#1895

Open
Jazzcort wants to merge 1 commit into
lightspeed-core:mainfrom
Jazzcort:speed-up-dev-build
Open

Split llama-stack container build into deps and app layers for faster rebuilds#1895
Jazzcort wants to merge 1 commit into
lightspeed-core:mainfrom
Jazzcort:speed-up-dev-build

Conversation

@Jazzcort

@Jazzcort Jazzcort commented Jun 10, 2026

Copy link
Copy Markdown

Description

Separate the monolithic container image into a deps base image (test.deps.containerfile) that installs system packages and Python dependencies, and a thin app image (test.containerfile) that only copies source code on top. The Makefile tracks a hash of pyproject.toml and uv.lock to skip rebuilding the deps image when dependencies haven't changed, making source-only iterations significantly faster.

Remove existing images by name before building replacements to prevent dangling : images from accumulating and wasting disk space.

make run takes around 1 min to build the developer build which was around 2.5 mins before (mac M3 pro)

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: claude-opus-4.6
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  1. If the dependencies are not changed, make run should just rebuild the app image and should take less time than before.
  2. If the dependencies are changed, make run should rebuild both deps and app images.
  3. There shouldn't be any dangling none:none images after executing make run for multiple times.

Summary by CodeRabbit

  • Chores
    • Optimized build infrastructure with improved container architecture and automated dependency change detection for streamlined, more efficient application deployments.

… rebuilds

Separate the monolithic container image into a deps base image
(test.deps.containerfile) that installs system packages and Python
dependencies, and a thin app image (test.containerfile) that only copies
source code on top. The Makefile tracks a hash of pyproject.toml and
uv.lock to skip rebuilding the deps image when dependencies haven't
changed, making source-only iterations significantly faster.

Remove existing images by name before building replacements to prevent
dangling <none>:<none> images from accumulating and wasting disk space.
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

The PR introduces a multi-stage container build strategy that separates Python dependencies into a reusable "deps" image. The Makefile detects dependency changes via hash comparison and rebuilds the deps layer only when needed, while the app container is refactored to use the deps image as its base, copying only application code.

Changes

Container layering and build orchestration

Layer / File(s) Summary
Makefile: Deps image orchestration and lifecycle
Makefile, .gitignore
Adds LLAMA_STACK_DEPS_IMAGE variable and hash tracking via DEPS_HASH_FILE. Introduces build-llama-stack-deps-image and ensure-llama-stack-deps-image targets with container-runtime checks and conditional rebuilds. Updates build-llama-stack-image to depend on deps readiness and pass DEPS_IMAGE build arg. Updates start-llama-stack-container to clean prior state before starting. Expands clean-llama-stack to remove both app and deps images plus the hash file. Gitignore excludes the auto-generated .llama-stack-deps.hash.
Deps container image definition
deploy/llama-stack/test.deps.containerfile
Builds a reusable deps layer from ubi9/python-312: installs build tools, installs uv, copies dependency files and version metadata, runs uv sync --locked for the llslibdev group without installing the project, configures virtualenv/PYTHONPATH/HOME, creates required .llama directories, and runs as non-root user 1001.
App container refactored for deps-image layering
deploy/llama-stack/test.containerfile
Replaces the monolithic in-container build: now accepts DEPS_IMAGE argument and builds on top of it instead of from scratch. Removes OS/build-tools/uv/dependency-sync steps. Copies only application source and runtime configuration files. Sets executable permissions and corrects file ownership on the thin app layer.

Sequence Diagram

sequenceDiagram
  participant Makefile
  participant DepsImage as Deps Builder
  participant AppImage as App Builder
  participant Registry as Image Registry

  Makefile->>Makefile: Compute CURRENT_DEPS_HASH<br/>from pyproject.toml, uv.lock
  Makefile->>Makefile: Load STORED_DEPS_HASH<br/>from .llama-stack-deps.hash
  alt Hash mismatch or no stored hash
    Makefile->>Registry: Remove old deps image
    Makefile->>DepsImage: Build deps image<br/>python 3.12 + uv + locked deps
    DepsImage->>Registry: Push LLAMA_STACK_DEPS_IMAGE
    Makefile->>Makefile: Persist CURRENT_DEPS_HASH
  end
  Makefile->>AppImage: Build app image<br/>FROM ${DEPS_IMAGE}
  AppImage->>AppImage: Copy src/ and config files
  AppImage->>Registry: Push LLAMA_STACK_IMAGE
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lightspeed-core/lightspeed-stack#1760: Updates the Makefile container workflow targets (build-llama-stack-image, start-llama-stack-container, clean-llama-stack) that serve as the foundation for this PR's deps-image layering strategy.

Suggested reviewers

  • tisnik
  • radofuchs
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: splitting the container build into separate deps and app layers to improve rebuild performance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 42-45: Makefile currently removes $(LLAMA_STACK_DEPS_IMAGE) (and
similarly the app image at the other removal) before building which sacrifices a
fallback on build failure; change the flow so the rmi of
$(LLAMA_STACK_DEPS_IMAGE) (and $(LLAMA_STACK_APP_IMAGE) at the other site)
happens only after a successful build—e.g., build the new image into a
temporary/tagged name, verify build success, then remove the old image with
$(CONTAINER_RUNTIME) rmi—alternatively add a dedicated cleanup target (invoked
manually or in CI) to accept dangling images during development; update the
removal commands referenced by the Makefile targets that currently use
$(CONTAINER_RUNTIME) image inspect ... rmi to follow this atomic replacement
pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bfaef2ec-afc0-46f9-898b-7844a7dc6150

📥 Commits

Reviewing files that changed from the base of the PR and between be591ad and ca96e96.

📒 Files selected for processing (4)
  • .gitignore
  • Makefile
  • deploy/llama-stack/test.containerfile
  • deploy/llama-stack/test.deps.containerfile
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: library mode / ci / group 2
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-04-15T18:53:54.785Z
Learnt from: Lifto
Repo: lightspeed-core/lightspeed-stack PR: 1510
File: src/models/requests.py:742-769
Timestamp: 2026-04-15T18:53:54.785Z
Learning: In `src/models/requests.py`, keep `ResponsesRequest.validate_body_size` using `json.dumps(values)` with Python defaults (including ASCII escaping and the default separators). This default formatting is intentional to make the 65,536-character limit conservatively strict (accounting for small encoding/format overhead). Do not recommend changing it to `ensure_ascii=False` or using compact separators for this validator, since an exact wire-format byte match with the client payload is not achievable via `json.dumps` anyway.

Applied to files:

  • .gitignore
📚 Learning: 2026-04-20T15:09:45.119Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1548
File: src/app/endpoints/rlsapi_v1.py:56-56
Timestamp: 2026-04-20T15:09:45.119Z
Learning: In `src/app/endpoints/rlsapi_v1.py`, treat the line `_get_rh_identity_context = get_rh_identity_context` as an intentional temporary backward-compatibility shim (introduced for PR `#1548`, Splunk HEC telemetry work). Do not flag it as dead/unnecessary/unused code during review until the planned part 3 is merged and the responses endpoint is fully wired up such that no tests or external consumers reference the underscore-prefixed name.

Applied to files:

  • .gitignore
🪛 checkmake (0.3.2)
Makefile

[warning] 36-36: Target body for "build-llama-stack-deps-image" exceeds allowed length of 5 lines (12).

(maxbodylength)


[warning] 50-50: Target body for "ensure-llama-stack-deps-image" exceeds allowed length of 5 lines (13).

(maxbodylength)


[warning] 65-65: Target body for "build-llama-stack-image" exceeds allowed length of 5 lines (8).

(maxbodylength)


[warning] 97-97: Target body for "start-llama-stack-container" exceeds allowed length of 5 lines (48).

(maxbodylength)


[warning] 163-163: Target body for "clean-llama-stack" exceeds allowed length of 5 lines (9).

(maxbodylength)

🔇 Additional comments (17)
.gitignore (1)

198-199: LGTM!

Makefile (5)

26-26: LGTM!


50-63: LGTM!


97-97: LGTM!


163-172: LGTM!


21-24: Harden deps hash portability in Makefile
Makefile lines 21-24 use shasum -a 256; shasum isn’t guaranteed to exist on all target systems (many only provide sha256sum). Add a fallback (or switch to sha256sum) and avoid silent empty/incorrect hashes so dependency change detection can’t degrade unexpectedly.

deploy/llama-stack/test.deps.containerfile (8)

1-10: LGTM!


16-19: LGTM!


21-22: LGTM!


24-29: LGTM!


31-32: LGTM!


34-39: LGTM!


41-41: LGTM!


12-14: Security: avoid unpinned curl | sh for UV installer in test containerfile

In deploy/llama-stack/test.deps.containerfile (lines 12-14), RUN curl -LsSf https://astral.sh/uv/install.sh | sh downloads and executes a remote installer without checksum/signature verification and without pinning the installer/version. Consider pinning the UV installer to a specific release and verifying the installer/archive content before execution to reduce supply-chain risk.

deploy/llama-stack/test.containerfile (3)

1-4: LGTM!


6-17: LGTM!


19-22: LGTM!

Comment thread Makefile
Comment on lines +42 to +45
@if $(CONTAINER_RUNTIME) image inspect $(LLAMA_STACK_DEPS_IMAGE) >/dev/null 2>&1; then \
echo "Removing existing deps image to avoid dangling images..."; \
$(CONTAINER_RUNTIME) rmi $(LLAMA_STACK_DEPS_IMAGE); \
fi

@coderabbitai coderabbitai Bot Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Both image removals share the same atomicity trade-off. Removing images before building (both deps at lines 42-45 and app at lines 67-70) means any build failure leaves no fallback image. This trades build resilience for avoiding dangling <none>:<none> images. During development with frequent build iterations, consider whether atomic replacement (remove only after successful build) or accepting dangling images with periodic cleanup would better serve reliability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 42 - 45, Makefile currently removes
$(LLAMA_STACK_DEPS_IMAGE) (and similarly the app image at the other removal)
before building which sacrifices a fallback on build failure; change the flow so
the rmi of $(LLAMA_STACK_DEPS_IMAGE) (and $(LLAMA_STACK_APP_IMAGE) at the other
site) happens only after a successful build—e.g., build the new image into a
temporary/tagged name, verify build success, then remove the old image with
$(CONTAINER_RUNTIME) rmi—alternatively add a dedicated cleanup target (invoked
manually or in CI) to accept dangling images during development; update the
removal commands referenced by the Makefile targets that currently use
$(CONTAINER_RUNTIME) image inspect ... rmi to follow this atomic replacement
pattern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a fallback image if we're building a developer build for testing? 🤔 If the build fails, we fix the problem and build it again. Cleaning those dangling images is just too tedious and they are huge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, @Jazzcort. For a developer build workflow, the "fix and rebuild" loop is the natural recovery path — there's no meaningful benefit to keeping a potentially stale fallback image around. Removing it upfront to avoid accumulating large dangling images is the right trade-off here. I'll withdraw this concern.

(┳◇┳)🐇


✏️ Learnings added
Learnt from: Jazzcort
Repo: lightspeed-core/lightspeed-stack PR: 1895
File: Makefile:42-45
Timestamp: 2026-06-10T18:11:00.333Z
Learning: In the `lightspeed-stack` repository (`Makefile`), the `build-llama-stack-deps-image` and `build-llama-stack-image` targets intentionally remove existing images *before* building (rather than atomic replacement after a successful build). This is deliberate for the developer/test workflow: the expected recovery from a build failure is to fix the issue and rebuild, so a fallback image provides no value; removing first avoids accumulating large dangling `<none>:<none>` images.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@Jazzcort

Jazzcort commented Jun 10, 2026

Copy link
Copy Markdown
Author

This patch breaks the CI process in E2E server mode test. Should be a easy fix. I'll fix it later. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant