Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions crates/lib/src/bootc_composefs/repo.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
//! Composefs repository lifecycle and OCI pull paths.
//!
//! This module owns how OCI images get into the composefs object store.
//! There are two pull paths, selected by the `use_unified` flag:
//!
//! ## Direct pull (`use_unified = false`)
//!
//! `pull_composefs_direct` fetches from the source transport (registry, OCI
//! dir, etc.) straight into the composefs repo via `composefs_oci::pull` with
//! default options. No containers-storage involvement.
//!
//! ## Unified pull (`use_unified = true`)
//!
//! `pull_composefs_unified` is the two-stage path that populates all three
//! stores (see [`crate::store`] for the architecture overview):
//!
//! **Stage 1** — Pull into bootc-owned containers-storage via
//! `CStorage::pull_with_progress` (or `pull_from_host_storage` if the image
//! already exists in the default podman store, saving a network round-trip).
//!
//! **Stage 2** — `composefs_oci::pull` with `LocalFetchOpt::ZeroCopy` and
//! `storage_root` pointing at the containers-storage directory. composefs-ctl
//! walks the overlay `diff/` directories and FICLONEs each file into the
//! composefs object store keyed by its SHA-512 fsverity digest. On a
//! reflink-capable filesystem this is near-instantaneous and consumes no
//! additional disk space.
//!
//! The caller provides `storage_path` as an absolute filesystem path string
//! (not a `Dir` fd) because composefs-ctl passes it to a child skopeo process.
//! It is derived from the physical root fd via `/proc/self/fd/{fd}` readlink.
//!
//! ## Entry points
//!
//! - [`pull_composefs_repo`] — upgrade/switch on a composefs-booted system.
//! - [`initialize_composefs_repository`] — `bootc install` with the composefs
//! backend.

use fn_error_context::context;
use std::sync::Arc;

Expand Down
46 changes: 44 additions & 2 deletions crates/lib/src/deploy.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,48 @@
//! # Write deployments merging image with configmap
//! Pull dispatch and deployment staging for the ostree backend.
//!
//! Create a merged filesystem tree with the image and mounted configmaps.
//! ## Planned Pull paths
//!
//! The top-level entry point for upgrade/switch will eventually select
//! among three paths based on the `unified` flag and filesystem capability:
//!
//! - **Unified + reflinks** (`unified = true`, XFS/btrfs): `pull_via_composefs_unified`
//! — the planned three-store pipeline. Pulls into containers-storage first, then
//! ZeroCopy into composefs, then synthesizes the ostree commit via FICLONE.
//! See [`crate::store`] for the architecture diagram.
//!
//! - **Non-unified + reflinks** (`unified = false`, XFS/btrfs): `pull_via_composefs`
//! — fetches from registry directly into composefs (no containers-storage),
//! then synthesizes the ostree commit via FICLONE.
//!
//! - **No reflinks** (ext4): `pull` — the legacy ostree-native tar importer
//! (`ostree_container::store::ImageImporter`).
//!
//! ## Planned composefs → ostree synthesis
//!
//! The synthesis plan relies on `import_from_composefs_repo` from
//! `ostree_ext::container::composefs_import` to walk the composefs
//! filesystem tree and for each regular file:
//!
//! 1. Reads uid/gid/mode/xattrs from composefs metadata. SELinux labels are
//! computed in bulk before the walk via `selabel()` and looked up per-file;
//! a NUL terminator is appended because composefs-rs omits it but the kernel
//! stores it.
//! 2. Computes the ostree content checksum in-memory (SHA-256 of
//! `uid:gid:mode:xattrs:file-content`).
//! 3. Issues `ioctl(FICLONE)` from the composefs object fd into a new `O_TMPFILE`
//! in the ostree object directory.
//! 4. Applies metadata (`fchown`, `fchmod`, `fsetxattr`) and links the tmpfile
//! into the ostree content-addressed path.
//!
//! `/etc` is remapped to `usr/etc`; virtual toplevel paths (`proc`, `sys`,
//! `dev`, etc.) are excluded — matching the ostree-container tar importer.
//!
//! ## Auto-detection
//!
//! `image_exists_in_unified_storage` checks whether the target image is already
//! present in bootc-owned containers-storage. Call sites use this to select
//! `unified = true` automatically without requiring an explicit flag from the
//! user once `bootc image set-unified` has been run.

use std::collections::HashSet;
use std::io::{BufRead, Write};
Expand Down
15 changes: 13 additions & 2 deletions crates/lib/src/image.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
//! # Controlling bootc-managed images
//!
//! APIs for operating on container images in the bootc storage.
//!
//! ## `bootc image set-unified`
//!
//! `set_unified_entrypoint` dispatches to `set_unified` (ostree backend) or
//! `set_unified_composefs` (composefs backend). Both pull the currently booted
//! image into bootc-owned containers-storage so that future upgrade/switch
//! operations can use the unified storage path.
//!
//! In the planned three-store architecture (see [`crate::store`]), this will
//! require a reflink-capable filesystem (XFS or btrfs) by default to enable
//! block sharing. The planned `--allow-copy` flag will opt into a byte copy
//! for environments like ext4 where podman access to the OS image matters
//! more than disk efficiency.

use anyhow::{Context, Result, bail};
use bootc_utils::CommandRunExt;
Expand Down
74 changes: 73 additions & 1 deletion crates/lib/src/store/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,77 @@
//! The [`Storage`] type holds references to three different types of
//! storage:
//! storage that together implement the unified storage model.
//!
//! # Planned three-store architecture
//!
//! The planned architecture for unified storage involves three content stores that
//! share physical disk blocks on a reflink-capable filesystem (XFS, btrfs):
//!
//! 1. **bootc-owned containers-storage** at `/sysroot/ostree/bootc/storage`
//! (overlay driver) — the image is accessible to podman and shares layers
//! with Logically Bound Images.
//! 2. **composefs object store** at `/sysroot/composefs/objects/`
//! (SHA-512 content-addressed) — used by composefs-boot to mount the
//! rootfs as EROFS. Populated from containers-storage via `FICLONE`
//! (`composefs_oci::pull` with `ZeroCopy`).
//! 3. **ostree bare repo** at `/sysroot/ostree/repo/objects/`
//! (SHA-256 content-addressed) — provides deployment, rollback, fsck, and
//! delta updates. Populated from the composefs object store via `FICLONE`
//! (`import_from_composefs_repo`).
//!
//! Each `FICLONE` ioctl lets the kernel mark source and destination extents as
//! copy-on-write siblings with no userspace data movement. On ext4 (no
//! reflinks), each step falls back to a byte copy.
//!
//! ## Implementation Plan
//!
//! The containers-storage → composefs step (arrow 1) is already implemented
//! for the composefs boot backend in `crates/lib/src/bootc_composefs/repo.rs`
//! via `pull_composefs_unified`.
//!
//! Wiring all three steps together for the ostree backend is the major planned work.
//! The composefs → ostree step (arrow 2) was proven by the `composefs-to-ostree`
//! spike branch. The planned implementation for the ostree backend will:
//!
//! 1. Perform a lazy cached probe (`reflinks_supported`) at install time.
//! 2. Pull into containers-storage first (Stage 1).
//! 3. Use `composefs_oci::pull` with `LocalFetchOpt::ZeroCopy` to populate composefs (Stage 2).
//! 4. Finally, synthesize the ostree commit by walking the composefs tree,
//! reading metadata, computing SELinux labels, computing the ostree checksum,
//! and `FICLONE`ing into the ostree bare repo (Stage 3).
//!
//! ## Long-term: Global composefs store
//!
//! The ultimate planned state (the "composefs-as-storage" plan) is to have podman's
//! composefs backend natively write objects to `/sysroot/composefs` directly, bypassing
//! even `containers-storage`. This would mean flatpak, podman, and bootc all share exactly
//! one global pool of content-addressed, deduplicated files.
//!
//! ## Why composefs in the middle
//!
//! The old unified storage path (containers-storage → skopeo tar → ostree)
//! serialized layers twice. composefs-ctl's `ZeroCopy` pull mode instead walks
//! the overlay `diff/` directories and FICLONEs each file into the composefs
//! object store keyed by SHA-512 fsverity digest — no tar involved.
//! See [container-libs#144](https://github.com/containers/container-libs/issues/144).
//!
//! ## Why reflink and not hardlink between composefs and ostree
//!
//! composefs is content-addressed by SHA-512 of raw bytes: two paths with
//! identical content share one composefs inode. ostree bare mode stores
//! uid/gid/mode/xattrs including `security.selinux` on each inode. Two files
//! with the same bytes but different SELinux labels produce different ostree
//! checksums but share one composefs object. One inode can hold only one
//! `security.selinux` value, so hardlinking would silently corrupt labels.
//! Reflink gives each ostree object its own inode while sharing disk extents.
//!
//! ## Reflink probe
//!
//! The reflink probe is performed lazily and cached. It creates
//! two anonymous temporary files (via `O_TMPFILE`, no
//! cleanup needed), writes one byte to the source, and attempts
//! `ioctl(FICLONE)`. Returns `true` on success, `false` on `EOPNOTSUPP` or
//! `EXDEV`. The probe directory is `composefs/objects` if it already exists,
//! otherwise the physical root itself.
//!
//! # OSTree
//!
Expand Down
71 changes: 44 additions & 27 deletions docs/src/experimental-unified-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ Tracking issue: <https://github.com/bootc-dev/bootc/issues/20>

## Overview

Unified storage is an experimental feature that allows bootc to fetch and store
the default OS image in the same [containers/storage](https://github.com/containers/storage)
backend used for [logically bound images](logically-bound-images.md) (and by podman).
This enables several benefits:
Unified storage is the goal of having all storage for bootc be "unified" with the storage
used by a container runtime, such as podman.

Currently, bootc uses either ostree or composefs. [Logically bound images](logically-bound-images.md)
use the podman container storage.

## Goals

- Direct support for zstd:chunked: Container images using zstd:chunked compression
can be efficiently pulled with deduplication
Expand All @@ -21,20 +24,6 @@ This enables several benefits:
- When used with `bootc image cmd build`, can support direct build into the bootc-owned
storage without a copy from the podman (or other app container) storage.

## Background

Historically, bootc has used two separate storage backends:

1. **ostree**: For the booted host OS image, via [ostree-rs-ext](https://github.com/ostreedev/ostree-rs-ext/)
2. **containers/storage**: For logically bound images (LBIs)

This split created challenges: the booted image couldn't be easily accessed
by podman, and container layer sharing between the host and LBIs wasn't possible.

Unified storage addresses this by pulling the host image into the bootc-owned
container storage (`/usr/lib/bootc/storage`) first, then importing from there
into ostree and setting it up for booting (e.g. performing SELinux labeling).

## Current status

**Status**: Experimental. The unified storage feature is under active development.
Expand All @@ -56,16 +45,10 @@ from its container storage into ostree, or when copying between different
container storage instances, each layer is fully re-serialized even when both
storages are on the same filesystem.

With reflink support (as proposed in that issue), copies between storages on
the same filesystem would be nearly instantaneous and use no additional disk
space. Without it, unified storage works but involves redundant I/O and
temporary disk space usage proportional to layer sizes. This is particularly
noticeable with large non-chunked layers.

The architectural fix requires separating metadata from data in the copy path,
allowing file descriptors to be passed and reflinked rather than streamed
through tar. This is related to the composefs approach of content-addressed
storage with distinct metadata and data channels.
through tar. This will be solved by putting [composefs-rs](https://github.com/containers/composefs-rs)
in the middle to orchestrate zero-copy pulls. See [Future plans: composefs-to-ostree](#future-plans-composefs-to-ostree).

## Enabling unified storage

Expand Down Expand Up @@ -153,7 +136,41 @@ podman --storage-opt=additionalimagestore=/usr/lib/bootc/storage run localhost/b
Unified storage is complementary to the [composefs backend](experimental-composefs.md).
While unified storage changes *how images are pulled* (using containers/storage),
the composefs backend changes *how the filesystem is stored and verified*.
These features can potentially be combined in the future.

## Future plans: composefs-to-ostree

These features will be combined in upcoming work to build a "composefs-first"
import pipeline. In this planned model, containers/storage will pull the image,
composefs will import it via reflinks (`FICLONE`), and then ostree will
synthesize its commit by `FICLONE`ing from the composefs objects.

This will eliminate tar serialization entirely, meaning only one physical copy
of the image data will exist on disk, shared across all three stores.

## Future plans: composefs-as-storage

Looking further ahead, the ultimate evolution of unified storage is to make the host's `/sysroot/composefs` object store the single, global source of truth for all content-addressed files on the system.

Instead of `containers/storage` maintaining its own copy of application image layers and merely sharing the *host* OS layers, podman's composefs backend could be configured to write objects directly into `/sysroot/composefs` on bootc-managed systems.

This means there would be exactly one storage pool for:

1. The bootc host OS image
2. Logically bound app containers
3. Standard Podman app containers
4. Flatpak apps (by having flatpak's system helper write to the same object store)

Every file across the entire system—whether part of the base OS, a containerized database, or a desktop application—would be deduplicated automatically and perfectly at the object level via fsverity digests.

### Implementation notes

For developers, the internal design and target architecture for this three-store
unified storage model is documented in the rustdoc comments of the relevant source files:

- `crates/lib/src/store/mod.rs` — the target three-store architecture and reflink behavior
- `crates/lib/src/bootc_composefs/repo.rs` — composefs unified pull path stages
- `crates/lib/src/deploy.rs` — pull dispatch and ostree backend synthesis
- `crates/lib/src/image.rs` — `bootc image set-unified` entrypoints

## Limitations

Expand Down