diff --git a/crates/lib/src/bootc_composefs/repo.rs b/crates/lib/src/bootc_composefs/repo.rs index 1f59fb8d7..fe58e201b 100644 --- a/crates/lib/src/bootc_composefs/repo.rs +++ b/crates/lib/src/bootc_composefs/repo.rs @@ -1,3 +1,40 @@ +//! Composefs repository lifecycle and OCI pull paths. +//! +//! This module owns how OCI images get into the composefs object store. +//! There are two pull paths, selected by the `use_unified` flag: +//! +//! ## Direct pull (`use_unified = false`) +//! +//! `pull_composefs_direct` fetches from the source transport (registry, OCI +//! dir, etc.) straight into the composefs repo via `composefs_oci::pull` with +//! default options. No containers-storage involvement. +//! +//! ## Unified pull (`use_unified = true`) +//! +//! `pull_composefs_unified` is the two-stage path that populates all three +//! stores (see [`crate::store`] for the architecture overview): +//! +//! **Stage 1** — Pull into bootc-owned containers-storage via +//! `CStorage::pull_with_progress` (or `pull_from_host_storage` if the image +//! already exists in the default podman store, saving a network round-trip). +//! +//! **Stage 2** — `composefs_oci::pull` with `LocalFetchOpt::ZeroCopy` and +//! `storage_root` pointing at the containers-storage directory. composefs-ctl +//! walks the overlay `diff/` directories and FICLONEs each file into the +//! composefs object store keyed by its SHA-512 fsverity digest. On a +//! reflink-capable filesystem this is near-instantaneous and consumes no +//! additional disk space. +//! +//! The caller provides `storage_path` as an absolute filesystem path string +//! (not a `Dir` fd) because composefs-ctl passes it to a child skopeo process. +//! It is derived from the physical root fd via `/proc/self/fd/{fd}` readlink. +//! +//! ## Entry points +//! +//! - [`pull_composefs_repo`] — upgrade/switch on a composefs-booted system. +//! - [`initialize_composefs_repository`] — `bootc install` with the composefs +//! backend. + use fn_error_context::context; use std::sync::Arc; diff --git a/crates/lib/src/deploy.rs b/crates/lib/src/deploy.rs index e82314629..be92e3128 100644 --- a/crates/lib/src/deploy.rs +++ b/crates/lib/src/deploy.rs @@ -1,6 +1,48 @@ -//! # Write deployments merging image with configmap +//! Pull dispatch and deployment staging for the ostree backend. //! -//! Create a merged filesystem tree with the image and mounted configmaps. +//! ## Planned Pull paths +//! +//! The top-level entry point for upgrade/switch will eventually select +//! among three paths based on the `unified` flag and filesystem capability: +//! +//! - **Unified + reflinks** (`unified = true`, XFS/btrfs): `pull_via_composefs_unified` +//! — the planned three-store pipeline. Pulls into containers-storage first, then +//! ZeroCopy into composefs, then synthesizes the ostree commit via FICLONE. +//! See [`crate::store`] for the architecture diagram. +//! +//! - **Non-unified + reflinks** (`unified = false`, XFS/btrfs): `pull_via_composefs` +//! — fetches from registry directly into composefs (no containers-storage), +//! then synthesizes the ostree commit via FICLONE. +//! +//! - **No reflinks** (ext4): `pull` — the legacy ostree-native tar importer +//! (`ostree_container::store::ImageImporter`). +//! +//! ## Planned composefs → ostree synthesis +//! +//! The synthesis plan relies on `import_from_composefs_repo` from +//! `ostree_ext::container::composefs_import` to walk the composefs +//! filesystem tree and for each regular file: +//! +//! 1. Reads uid/gid/mode/xattrs from composefs metadata. SELinux labels are +//! computed in bulk before the walk via `selabel()` and looked up per-file; +//! a NUL terminator is appended because composefs-rs omits it but the kernel +//! stores it. +//! 2. Computes the ostree content checksum in-memory (SHA-256 of +//! `uid:gid:mode:xattrs:file-content`). +//! 3. Issues `ioctl(FICLONE)` from the composefs object fd into a new `O_TMPFILE` +//! in the ostree object directory. +//! 4. Applies metadata (`fchown`, `fchmod`, `fsetxattr`) and links the tmpfile +//! into the ostree content-addressed path. +//! +//! `/etc` is remapped to `usr/etc`; virtual toplevel paths (`proc`, `sys`, +//! `dev`, etc.) are excluded — matching the ostree-container tar importer. +//! +//! ## Auto-detection +//! +//! `image_exists_in_unified_storage` checks whether the target image is already +//! present in bootc-owned containers-storage. Call sites use this to select +//! `unified = true` automatically without requiring an explicit flag from the +//! user once `bootc image set-unified` has been run. use std::collections::HashSet; use std::io::{BufRead, Write}; diff --git a/crates/lib/src/image.rs b/crates/lib/src/image.rs index 8a6b17a54..adbaace0a 100644 --- a/crates/lib/src/image.rs +++ b/crates/lib/src/image.rs @@ -1,6 +1,17 @@ -//! # Controlling bootc-managed images -//! //! APIs for operating on container images in the bootc storage. +//! +//! ## `bootc image set-unified` +//! +//! `set_unified_entrypoint` dispatches to `set_unified` (ostree backend) or +//! `set_unified_composefs` (composefs backend). Both pull the currently booted +//! image into bootc-owned containers-storage so that future upgrade/switch +//! operations can use the unified storage path. +//! +//! In the planned three-store architecture (see [`crate::store`]), this will +//! require a reflink-capable filesystem (XFS or btrfs) by default to enable +//! block sharing. The planned `--allow-copy` flag will opt into a byte copy +//! for environments like ext4 where podman access to the OS image matters +//! more than disk efficiency. use anyhow::{Context, Result, bail}; use bootc_utils::CommandRunExt; diff --git a/crates/lib/src/store/mod.rs b/crates/lib/src/store/mod.rs index 940d59100..653fd2a25 100644 --- a/crates/lib/src/store/mod.rs +++ b/crates/lib/src/store/mod.rs @@ -1,5 +1,77 @@ //! The [`Storage`] type holds references to three different types of -//! storage: +//! storage that together implement the unified storage model. +//! +//! # Planned three-store architecture +//! +//! The planned architecture for unified storage involves three content stores that +//! share physical disk blocks on a reflink-capable filesystem (XFS, btrfs): +//! +//! 1. **bootc-owned containers-storage** at `/sysroot/ostree/bootc/storage` +//! (overlay driver) — the image is accessible to podman and shares layers +//! with Logically Bound Images. +//! 2. **composefs object store** at `/sysroot/composefs/objects/` +//! (SHA-512 content-addressed) — used by composefs-boot to mount the +//! rootfs as EROFS. Populated from containers-storage via `FICLONE` +//! (`composefs_oci::pull` with `ZeroCopy`). +//! 3. **ostree bare repo** at `/sysroot/ostree/repo/objects/` +//! (SHA-256 content-addressed) — provides deployment, rollback, fsck, and +//! delta updates. Populated from the composefs object store via `FICLONE` +//! (`import_from_composefs_repo`). +//! +//! Each `FICLONE` ioctl lets the kernel mark source and destination extents as +//! copy-on-write siblings with no userspace data movement. On ext4 (no +//! reflinks), each step falls back to a byte copy. +//! +//! ## Implementation Plan +//! +//! The containers-storage → composefs step (arrow 1) is already implemented +//! for the composefs boot backend in `crates/lib/src/bootc_composefs/repo.rs` +//! via `pull_composefs_unified`. +//! +//! Wiring all three steps together for the ostree backend is the major planned work. +//! The composefs → ostree step (arrow 2) was proven by the `composefs-to-ostree` +//! spike branch. The planned implementation for the ostree backend will: +//! +//! 1. Perform a lazy cached probe (`reflinks_supported`) at install time. +//! 2. Pull into containers-storage first (Stage 1). +//! 3. Use `composefs_oci::pull` with `LocalFetchOpt::ZeroCopy` to populate composefs (Stage 2). +//! 4. Finally, synthesize the ostree commit by walking the composefs tree, +//! reading metadata, computing SELinux labels, computing the ostree checksum, +//! and `FICLONE`ing into the ostree bare repo (Stage 3). +//! +//! ## Long-term: Global composefs store +//! +//! The ultimate planned state (the "composefs-as-storage" plan) is to have podman's +//! composefs backend natively write objects to `/sysroot/composefs` directly, bypassing +//! even `containers-storage`. This would mean flatpak, podman, and bootc all share exactly +//! one global pool of content-addressed, deduplicated files. +//! +//! ## Why composefs in the middle +//! +//! The old unified storage path (containers-storage → skopeo tar → ostree) +//! serialized layers twice. composefs-ctl's `ZeroCopy` pull mode instead walks +//! the overlay `diff/` directories and FICLONEs each file into the composefs +//! object store keyed by SHA-512 fsverity digest — no tar involved. +//! See [container-libs#144](https://github.com/containers/container-libs/issues/144). +//! +//! ## Why reflink and not hardlink between composefs and ostree +//! +//! composefs is content-addressed by SHA-512 of raw bytes: two paths with +//! identical content share one composefs inode. ostree bare mode stores +//! uid/gid/mode/xattrs including `security.selinux` on each inode. Two files +//! with the same bytes but different SELinux labels produce different ostree +//! checksums but share one composefs object. One inode can hold only one +//! `security.selinux` value, so hardlinking would silently corrupt labels. +//! Reflink gives each ostree object its own inode while sharing disk extents. +//! +//! ## Reflink probe +//! +//! The reflink probe is performed lazily and cached. It creates +//! two anonymous temporary files (via `O_TMPFILE`, no +//! cleanup needed), writes one byte to the source, and attempts +//! `ioctl(FICLONE)`. Returns `true` on success, `false` on `EOPNOTSUPP` or +//! `EXDEV`. The probe directory is `composefs/objects` if it already exists, +//! otherwise the physical root itself. //! //! # OSTree //! diff --git a/docs/src/experimental-unified-storage.md b/docs/src/experimental-unified-storage.md index 68d5ea142..22fabdef1 100644 --- a/docs/src/experimental-unified-storage.md +++ b/docs/src/experimental-unified-storage.md @@ -7,10 +7,13 @@ Tracking issue: ## Overview -Unified storage is an experimental feature that allows bootc to fetch and store -the default OS image in the same [containers/storage](https://github.com/containers/storage) -backend used for [logically bound images](logically-bound-images.md) (and by podman). -This enables several benefits: +Unified storage is the goal of having all storage for bootc be "unified" with the storage +used by a container runtime, such as podman. + +Currently, bootc uses either ostree or composefs. [Logically bound images](logically-bound-images.md) +use the podman container storage. + +## Goals - Direct support for zstd:chunked: Container images using zstd:chunked compression can be efficiently pulled with deduplication @@ -21,20 +24,6 @@ This enables several benefits: - When used with `bootc image cmd build`, can support direct build into the bootc-owned storage without a copy from the podman (or other app container) storage. -## Background - -Historically, bootc has used two separate storage backends: - -1. **ostree**: For the booted host OS image, via [ostree-rs-ext](https://github.com/ostreedev/ostree-rs-ext/) -2. **containers/storage**: For logically bound images (LBIs) - -This split created challenges: the booted image couldn't be easily accessed -by podman, and container layer sharing between the host and LBIs wasn't possible. - -Unified storage addresses this by pulling the host image into the bootc-owned -container storage (`/usr/lib/bootc/storage`) first, then importing from there -into ostree and setting it up for booting (e.g. performing SELinux labeling). - ## Current status **Status**: Experimental. The unified storage feature is under active development. @@ -56,16 +45,10 @@ from its container storage into ostree, or when copying between different container storage instances, each layer is fully re-serialized even when both storages are on the same filesystem. -With reflink support (as proposed in that issue), copies between storages on -the same filesystem would be nearly instantaneous and use no additional disk -space. Without it, unified storage works but involves redundant I/O and -temporary disk space usage proportional to layer sizes. This is particularly -noticeable with large non-chunked layers. - The architectural fix requires separating metadata from data in the copy path, allowing file descriptors to be passed and reflinked rather than streamed -through tar. This is related to the composefs approach of content-addressed -storage with distinct metadata and data channels. +through tar. This will be solved by putting [composefs-rs](https://github.com/containers/composefs-rs) +in the middle to orchestrate zero-copy pulls. See [Future plans: composefs-to-ostree](#future-plans-composefs-to-ostree). ## Enabling unified storage @@ -153,7 +136,41 @@ podman --storage-opt=additionalimagestore=/usr/lib/bootc/storage run localhost/b Unified storage is complementary to the [composefs backend](experimental-composefs.md). While unified storage changes *how images are pulled* (using containers/storage), the composefs backend changes *how the filesystem is stored and verified*. -These features can potentially be combined in the future. + +## Future plans: composefs-to-ostree + +These features will be combined in upcoming work to build a "composefs-first" +import pipeline. In this planned model, containers/storage will pull the image, +composefs will import it via reflinks (`FICLONE`), and then ostree will +synthesize its commit by `FICLONE`ing from the composefs objects. + +This will eliminate tar serialization entirely, meaning only one physical copy +of the image data will exist on disk, shared across all three stores. + +## Future plans: composefs-as-storage + +Looking further ahead, the ultimate evolution of unified storage is to make the host's `/sysroot/composefs` object store the single, global source of truth for all content-addressed files on the system. + +Instead of `containers/storage` maintaining its own copy of application image layers and merely sharing the *host* OS layers, podman's composefs backend could be configured to write objects directly into `/sysroot/composefs` on bootc-managed systems. + +This means there would be exactly one storage pool for: + +1. The bootc host OS image +2. Logically bound app containers +3. Standard Podman app containers +4. Flatpak apps (by having flatpak's system helper write to the same object store) + +Every file across the entire system—whether part of the base OS, a containerized database, or a desktop application—would be deduplicated automatically and perfectly at the object level via fsverity digests. + +### Implementation notes + +For developers, the internal design and target architecture for this three-store +unified storage model is documented in the rustdoc comments of the relevant source files: + +- `crates/lib/src/store/mod.rs` — the target three-store architecture and reflink behavior +- `crates/lib/src/bootc_composefs/repo.rs` — composefs unified pull path stages +- `crates/lib/src/deploy.rs` — pull dispatch and ostree backend synthesis +- `crates/lib/src/image.rs` — `bootc image set-unified` entrypoints ## Limitations