Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 19 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files
for content — avoiding doubled disk usage, partition table management,
and similar headaches.

### Separation between metadata and data

A key aspect of composefs is its separation of "data" (non-empty regular
files) from "metadata" (everything else: directories, symlinks, permissions,
ownership, etc.).

composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem
image that contains only metadata. The non-empty data files live in a
separate "backing store" directory. The EROFS image includes
`trusted.overlay.redirect` extended attributes that tell the overlayfs
mount how to find the real underlying files.

### Shared backing store

The primary use case for composefs is versioned, immutable filesystem
trees — container images and bootable host systems — where multiple
images may share parts of their storage.

By storing files content-addressed (named by the hash of their content),
shared files need to be stored only once on disk yet can appear in
multiple mounts. Crucially, these data files are also shared in the
[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache),
allowing multiple running container images to reliably share memory.

### Filesystem integrity

composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
validation of content files. The digest of each content file is stored
in the EROFS image via `trusted.overlay.metacopy` extended attributes,
which overlayfs validates when the file is accessed. This means backing
content cannot be changed (by mistake or by malice) without detection.

You can also enable fs-verity on the image file itself and pass the expected
digest as a mount option. This provides full trust of both data and metadata,
solving a weakness of fs-verity alone (which can only verify file data,
not metadata like permissions, ownership, or directory structure).
composefs separates metadata (directories, permissions, xattrs) from data
(file content). An EROFS image carries only the metadata; data files live in
a content-addressed backing store, shared across images and in the Linux
[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).
Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
provides end-to-end integrity verification of both data and metadata.
For design details, see the [crate documentation](https://docs.rs/composefs).

## Use cases

### Container images

For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
container images, a common approach (used by both Docker and Podman) is
to untar each layer separately and use overlayfs to stitch them together.
composefs improves on this by storing file content in a content-addressed
fashion, allowing sharing between images even when metadata like
timestamps or ownership differs.

Combined with approaches like
[zstd:chunked](https://github.com/containers/storage/pull/775),
this speeds up pulling container images and avoids redundantly
creating files that are already present.
composefs improves on the traditional per-layer overlayfs model for
[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
container images by storing file content in a content-addressed store,
enabling sharing between images and faster pulls via
[zstd:chunked](https://github.com/containers/storage/pull/775).

### Bootable host systems

Anywhere one wants versioned immutable filesystem trees ("images"),
composefs provides compelling advantages. In particular, this project
aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/).

OSTree uses a content-addressed object store, but traditionally checks out
into a regular directory (using hardlinks), which is then bind-mounted as
the rootfs. While OSTree supports enabling fs-verity on files in the store,
nothing protects the checkout directories from modification.

composefs replaces this checkout with a directly-mountable image pointing
into the object store. We can enable fs-verity on the composefs image and
embed its digest in the kernel commandline or a Unified Kernel Image (UKI).
Since composefs generation is reproducible, we can verify the generated
image is correct by comparing its digest to one in the metadata produced
at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).
composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/)
by replacing hardlink checkouts with directly-mountable images backed by a
shared object store. Combined with fs-verity and a digest embedded in the
kernel commandline or a UKI, this provides cryptographic verification of
the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867)
for background.

## Components

Expand Down Expand Up @@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly.

## Documentation

- [Repository format](doc/repository.md)
- [OCI integration](doc/oci.md)
- [Splitstream format](doc/splitstream.md)
- [API and design documentation](https://docs.rs/composefs)
- [Examples README](examples/README.md)

## Status
Expand Down
90 changes: 90 additions & 0 deletions crates/composefs-boot/src/design.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
//! # Booting from a composefs image
//!
//! This document describes how composefs-rs sets up the root filesystem during
//! early boot. It covers the kernel command-line interface, the expected on-disk
//! layout, kernel requirements, and the step-by-step mount sequence performed by
//! `composefs-setup-root`.
//!
//! The target audience is system integrators and OS developers who are packaging a
//! bootable system using composefs. Familiarity with Linux mount namespaces,
//! overlayfs, and fs-verity is assumed.
//!
//! ## Kernel command-line
//!
//! The initramfs code in composefs supports multiple kernel arguments; it
//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and
//! SHA-512. On an installed system, the repository only supports one digest
//! by default today, and the first found will be selected.
//!
//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible
//! version will be found.
//!
//! ```text
//! composefs.digest=v1-sha256-12:<digest> # V1 EROFS image (preferred; RHEL9-era kernels)
//! composefs.digest=v1-sha512-12:<digest> # V1 EROFS image (SHA-512 variant)
//! composefs.digest=v2-sha512-12:<digest> # V2 EROFS image (explicit form)
//! composefs=<digest> # V2 EROFS image (legacy shorthand)
//! ```
//!
//! The value format is `<version>-<hash>-<lg_blocksize>:<hex_digest>`, where
//! `<version>` is `v1` or `v2`, `<hash>` is `sha256` or `sha512`, and
//! `<lg_blocksize>` is the log2 block size (currently always `12`, i.e. 4096
//! bytes). This mirrors how `meta.json` encodes the algorithm as
//! `fsverity-sha256-12`.
//!
//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline
//! (one per format/algorithm combination); the initramfs tries each in order and
//! mounts the first image that actually exists in the repository.
//!
//! `composefs=<digest>` is a legacy shorthand equivalent to
//! `composefs.digest=v2-<hash>-12:<digest>` -- the algorithm is inferred from the
//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when
//! no `composefs.digest=` token matches.
//!
//! **Insecure mode.** Placing `?` immediately after `=` (e.g.
//! `composefs.digest=?v1-sha256-12:<digest>` or `composefs=?<digest>`) makes
//! fs-verity verification optional. The system will boot even when the underlying
//! filesystem does not support fs-verity or the image has no verity metadata
//! attached. This mode exists for development and testing only; it must not be used
//! in production.
//!
//! ## On-disk layout
//!
//! The composefs repository must be present at `/sysroot/composefs` with the
//! standard layout described in the `composefs::repository_format` module.
//!
//! The digest must correspond to a symlink under `images/`.
//!
//! Persistent per-deployment state lives at `/sysroot/state/deploy/<digest>/`,
//! where `<digest>` matches the boot karg digest exactly. The `etc/` and `var/`
//! subdirectories within that directory serve as the upper layers for the
//! corresponding overlayfs mounts.
//!
//! ## Kernel requirements
//!
//! The following kernel features must be available:
//!
//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`)
//! - **overlayfs** with `metacopy=on` and `redirect_dir=on`
//! (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`)
//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`)
//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`),
//! available since kernel 5.2. Kernel >= 6.15 is required for the atomic root
//! replacement path (the default build). On kernels without `fsconfig_set_fd`
//! support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created
//! automatically by `composefs::mountcompat`.
//!
//! ## Kernel argument
//!
//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted.
//! Without the `?` insecure prefix, every file access through the overlayfs is
//! verified against the object's stored digest by the kernel, combining fs-verity
//! on the data objects with overlayfs `verity=require`.
//!
//! ## Other notes
//!
//! As a workaround for a GPT auto-root issue in systemd
//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)),
//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a
//! symlink pointing to the real block device before performing any mounts. Failure
//! to do so is non-fatal and does not abort the boot sequence.
3 changes: 3 additions & 0 deletions crates/composefs-boot/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ pub mod selabel;
pub mod uki;
pub mod write_boot;

#[cfg(doc)]
pub mod design;

use std::ffi::OsStr;

use anyhow::Result;
Expand Down
127 changes: 127 additions & 0 deletions crates/composefs-oci/src/design.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
//! # How to create a composefs from an OCI image
//!
//! This document is incomplete. It only serves to document some decisions we've
//! taken about how to resolve ambiguous situations.
//!
//! # Data precision
//!
//! We currently create a composefs image using the granularity of data as
//! typically appears in OCI tarballs:
//! - atime and ctime are not present (these are actually not physically present
//! in the erofs inode structure at all, either the compact or extended forms)
//! - mtime is set to the mtime in seconds; the sub-seconds value is simply
//! truncated (ie: we always round down). erofs has an nsec field, but it's not
//! normally present in OCI tarballs. That's down to the fact that the usual
//! tar header only has timestamps in seconds and extended headers are not
//! usually added for this purpose.
//! - we take great care to faithfully represent hardlinks: even though the
//! produced filesystem is read-only and we have data de-duplication via the
//! objects store, we make sure that hardlinks result in an actual shared inode
//! as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.
//!
//! We apply these precision restrictions also when creating images by scanning the
//! filesystem. For example: even if we get more-accurate timestamp information,
//! we'll truncate it to the nearest second.
//!
//! # Merging directories
//!
//! This is done according to the OCI spec, with an additional clarification: in
//! case a directory entry is present in multiple layers, we use the tar metadata
//! from the most-derived layer to determine the attributes (owner, permissions,
//! mtime) for the directory.
//!
//! # The root inode
//!
//! The root inode (/) is a difficult case because OCI container layer tars often
//! don't include a root directory entry, and when they do, container runtimes
//! (Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's
//! [containers/storage](https://github.com/containers/storage) uses root:root
//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but
//! Docker uses `0755`. In general, the metadata for `/` is not defined.
//!
//! Because composefs requires (has a goal of providing) precise cryptographically
//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr`
//! to the root directory. The rationale is that `/usr` is always present in
//! standard filesystem layouts and must be defined explicitly in the OCI layers.
//!
//! This is implemented via the `copy_root_metadata_from_usr()` method and the
//! `read_container_root()` convenience function.
//!
//! When building a filesystem from OCI layers programmatically, use
//! `Stat::uninitialized()` to create the initial `FileSystem`. This placeholder
//! has mode `0` (obviously invalid) to make it clear that the root metadata should
//! be set before computing digests - typically by calling
//! `copy_root_metadata_from_usr()` after processing all layers.
//!
//! # Extended attributes (xattrs)
//!
//! When reading a container filesystem from a mounted root (as opposed to
//! processing OCI layer tars directly), host-side xattrs can leak into the
//! image. This is particularly problematic for `security.selinux` labels:
//! if SELinux is enabled at build time, files will have labels like
//! `container_t` that come from the build host, not from the target system's
//! policy.
//!
//! To ensure reproducibility, `read_container_root()` filters xattrs to only
//! include those in an allowlist. Currently this is just `security.capability`,
//! which represents actual file capabilities that should be preserved.
//!
//! SELinux labels are handled separately by `transform_for_boot()`:
//! - If the target filesystem contains a SELinux policy (in `/etc/selinux`),
//! all files are relabeled according to that policy
//! - If no SELinux policy is found, all `security.selinux` xattrs are stripped
//!
//! This ensures that:
//! - Build-time SELinux labels don't leak into non-SELinux targets
//! - SELinux-enabled targets get correct labels from their own policy
//! - Other host xattrs (overlayfs internals, etc.) don't pollute the image
//!
//! See: <https://github.com/containers/storage/pull/1608#issuecomment-1600915185>
//!
//! # The /run directory
//!
//! When processing OCI images via `create_filesystem()`, the `/run` directory
//! is emptied if present. This is a tmpfs at runtime and should always be
//! empty in images. Its mtime is set to match `/usr` for consistency with
//! how root directory metadata is handled.
//!
//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache
//! mounts can leave incomplete directory entries in OCI tar layers (directories
//! without explicit tar entries inherit incorrect mtimes) by pointing all
//! such mounts into `/run`, and then redirecting from their final location
//! via e.g. symlinks into `/run`.
//!
//! ## Container build cache mounts
//!
//! A practical implication of emptying `/run` is that container authors can
//! use it for cache mounts without worrying about polluting the final image.
//!
//! Instead of:
//! ```dockerfile
//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ...
//! ```
//!
//! Consider:
//! ```dockerfile
//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...
//! ```
//!
//! This avoids potential mtime inconsistencies in `/var/cache` while still
//! benefiting from build caching.
//!
//! See: <https://github.com/containers/composefs-rs/issues/132>
//!
//! # Emptied directories for boot
//!
//! When preparing a filesystem for boot via `transform_for_boot()`, certain
//! additional directories are emptied because their contents should not be
//! part of the final verified image:
//!
//! - `/boot`: Contains the UKI which embeds the composefs digest, so including
//! it would create a circular dependency
//! - `/sysroot`: Only has content in ostree-container cases, and traversing
//! it for SELinux labeling causes problems
//!
//! These directories are emptied and their mtime is set to match `/usr` for
//! consistency with how the root directory metadata is handled.
3 changes: 3 additions & 0 deletions crates/composefs-oci/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ pub mod tar;
#[doc(hidden)]
pub mod test_util;

#[cfg(doc)]
pub mod design;

// Re-export the composefs crate for consumers who only need composefs-oci
pub use composefs;

Expand Down
Loading