composefs
diff --git a/‎README.md‎
Lines changed: 19 additions & 65 deletions b/‎README.md‎
Lines changed: 19 additions & 65 deletions
diff --git a/‎crates/composefs-boot/src/design.rs‎
Lines changed: 90 additions & 0 deletions b/‎crates/composefs-boot/src/design.rs‎
Lines changed: 90 additions & 0 deletions
diff --git a/‎crates/composefs-boot/src/lib.rs‎
Lines changed: 3 additions & 0 deletions b/‎crates/composefs-boot/src/lib.rs‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎crates/composefs-oci/src/design.rs‎
Lines changed: 127 additions & 0 deletions b/‎crates/composefs-oci/src/design.rs‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎crates/composefs-oci/src/lib.rs‎
Lines changed: 3 additions & 0 deletions b/‎crates/composefs-oci/src/lib.rs‎
Lines changed: 3 additions & 0 deletions
@@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files
 for content — avoiding doubled disk usage, partition table management,
 and similar headaches.
 
-### Separation between metadata and data
-
-A key aspect of composefs is its separation of "data" (non-empty regular
-files) from "metadata" (everything else: directories, symlinks, permissions,
-ownership, etc.).
-
-composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem
-image that contains only metadata. The non-empty data files live in a
-separate "backing store" directory. The EROFS image includes
-`trusted.overlay.redirect` extended attributes that tell the overlayfs
-mount how to find the real underlying files.
-
-### Shared backing store
-
-The primary use case for composefs is versioned, immutable filesystem
-trees — container images and bootable host systems — where multiple
-images may share parts of their storage.
-
-By storing files content-addressed (named by the hash of their content),
-shared files need to be stored only once on disk yet can appear in
-multiple mounts. Crucially, these data files are also shared in the
-[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache),
-allowing multiple running container images to reliably share memory.
-
-### Filesystem integrity
-
-composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
-validation of content files. The digest of each content file is stored
-in the EROFS image via `trusted.overlay.metacopy` extended attributes,
-which overlayfs validates when the file is accessed. This means backing
-content cannot be changed (by mistake or by malice) without detection.
-
-You can also enable fs-verity on the image file itself and pass the expected
-digest as a mount option. This provides full trust of both data and metadata,
-solving a weakness of fs-verity alone (which can only verify file data,
-not metadata like permissions, ownership, or directory structure).
+composefs separates metadata (directories, permissions, xattrs) from data
+(file content). An EROFS image carries only the metadata; data files live in
+a content-addressed backing store, shared across images and in the Linux
+[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).
+Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
+provides end-to-end integrity verification of both data and metadata.
+For design details, see the [crate documentation](https://docs.rs/composefs).
 
 ## Use cases
 
 ### Container images
 
-For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
-container images, a common approach (used by both Docker and Podman) is
-to untar each layer separately and use overlayfs to stitch them together.
-composefs improves on this by storing file content in a content-addressed
-fashion, allowing sharing between images even when metadata like
-timestamps or ownership differs.
-
-Combined with approaches like
-[zstd:chunked](https://github.com/containers/storage/pull/775),
-this speeds up pulling container images and avoids redundantly
-creating files that are already present.
+composefs improves on the traditional per-layer overlayfs model for
+[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
+container images by storing file content in a content-addressed store,
+enabling sharing between images and faster pulls via
+[zstd:chunked](https://github.com/containers/storage/pull/775).
 
 ### Bootable host systems
 
-Anywhere one wants versioned immutable filesystem trees ("images"),
-composefs provides compelling advantages. In particular, this project
-aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/).
-
-OSTree uses a content-addressed object store, but traditionally checks out
-into a regular directory (using hardlinks), which is then bind-mounted as
-the rootfs. While OSTree supports enabling fs-verity on files in the store,
-nothing protects the checkout directories from modification.
-
-composefs replaces this checkout with a directly-mountable image pointing
-into the object store. We can enable fs-verity on the composefs image and
-embed its digest in the kernel commandline or a Unified Kernel Image (UKI).
-Since composefs generation is reproducible, we can verify the generated
-image is correct by comparing its digest to one in the metadata produced
-at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).
+composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/)
+by replacing hardlink checkouts with directly-mountable images backed by a
+shared object store. Combined with fs-verity and a digest embedded in the
+kernel commandline or a UKI, this provides cryptographic verification of
+the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867)
+for background.
 
 ## Components
 
@@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly.
 
 ## Documentation
 
- - [Repository format](doc/repository.md)
- - [OCI integration](doc/oci.md)
- - [Splitstream format](doc/splitstream.md)
+ - [API and design documentation](https://docs.rs/composefs)
  - [Examples README](examples/README.md)
 
 ## Status
 
@@ -0,0 +1,90 @@
+//! # Booting from a composefs image
+//!
+//! This document describes how composefs-rs sets up the root filesystem during
+//! early boot. It covers the kernel command-line interface, the expected on-disk
+//! layout, kernel requirements, and the step-by-step mount sequence performed by
+//! `composefs-setup-root`.
+//!
+//! The target audience is system integrators and OS developers who are packaging a
+//! bootable system using composefs. Familiarity with Linux mount namespaces,
+//! overlayfs, and fs-verity is assumed.
+//!
+//! ## Kernel command-line
+//!
+//! The initramfs code in composefs supports multiple kernel arguments; it
+//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and
+//! SHA-512. On an installed system, the repository only supports one digest
+//! by default today, and the first found will be selected.
+//!
+//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible
+//! version will be found.
+//!
+//! ```text
+//! composefs.digest=v1-sha256-12:<digest>   # V1 EROFS image (preferred; RHEL9-era kernels)
+//! composefs.digest=v1-sha512-12:<digest>   # V1 EROFS image (SHA-512 variant)
+//! composefs.digest=v2-sha512-12:<digest>   # V2 EROFS image (explicit form)
+//! composefs=<digest>                       # V2 EROFS image (legacy shorthand)
+//! ```
+//!
+//! The value format is `<version>-<hash>-<lg_blocksize>:<hex_digest>`, where
+//! `<version>` is `v1` or `v2`, `<hash>` is `sha256` or `sha512`, and
+//! `<lg_blocksize>` is the log2 block size (currently always `12`, i.e. 4096
+//! bytes). This mirrors how `meta.json` encodes the algorithm as
+//! `fsverity-sha256-12`.
+//!
+//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline
+//! (one per format/algorithm combination); the initramfs tries each in order and
+//! mounts the first image that actually exists in the repository.
+//!
+//! `composefs=<digest>` is a legacy shorthand equivalent to
+//! `composefs.digest=v2-<hash>-12:<digest>` -- the algorithm is inferred from the
+//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when
+//! no `composefs.digest=` token matches.
+//!
+//! **Insecure mode.** Placing `?` immediately after `=` (e.g.
+//! `composefs.digest=?v1-sha256-12:<digest>` or `composefs=?<digest>`) makes
+//! fs-verity verification optional. The system will boot even when the underlying
+//! filesystem does not support fs-verity or the image has no verity metadata
+//! attached. This mode exists for development and testing only; it must not be used
+//! in production.
+//!
+//! ## On-disk layout
+//!
+//! The composefs repository must be present at `/sysroot/composefs` with the
+//! standard layout described in the `composefs::repository_format` module.
+//!
+//! The digest must correspond to a symlink under `images/`.
+//!
+//! Persistent per-deployment state lives at `/sysroot/state/deploy/<digest>/`,
+//! where `<digest>` matches the boot karg digest exactly. The `etc/` and `var/`
+//! subdirectories within that directory serve as the upper layers for the
+//! corresponding overlayfs mounts.
+//!
+//! ## Kernel requirements
+//!
+//! The following kernel features must be available:
+//!
+//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`)
+//! - **overlayfs** with `metacopy=on` and `redirect_dir=on`
+//!   (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`)
+//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`)
+//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`),
+//!   available since kernel 5.2. Kernel >= 6.15 is required for the atomic root
+//!   replacement path (the default build). On kernels without `fsconfig_set_fd`
+//!   support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created
+//!   automatically by `composefs::mountcompat`.
+//!
+//! ## Kernel argument
+//!
+//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted.
+//! Without the `?` insecure prefix, every file access through the overlayfs is
+//! verified against the object's stored digest by the kernel, combining fs-verity
+//! on the data objects with overlayfs `verity=require`.
+//!
+//! ## Other notes
+//!
+//! As a workaround for a GPT auto-root issue in systemd
+//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)),
+//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a
+//! symlink pointing to the real block device before performing any mounts. Failure
+//! to do so is non-fatal and does not abort the boot sequence.
@@ -15,6 +15,9 @@ pub mod selabel;
 pub mod uki;
 pub mod write_boot;
 
+#[cfg(doc)]
+pub mod design;
+
 use std::ffi::OsStr;
 
 use anyhow::Result;
 
@@ -0,0 +1,127 @@
+//! # How to create a composefs from an OCI image
+//!
+//! This document is incomplete.  It only serves to document some decisions we've
+//! taken about how to resolve ambiguous situations.
+//!
+//! # Data precision
+//!
+//! We currently create a composefs image using the granularity of data as
+//! typically appears in OCI tarballs:
+//!  - atime and ctime are not present (these are actually not physically present
+//!    in the erofs inode structure at all, either the compact or extended forms)
+//!  - mtime is set to the mtime in seconds; the sub-seconds value is simply
+//!    truncated (ie: we always round down).  erofs has an nsec field, but it's not
+//!    normally present in OCI tarballs.  That's down to the fact that the usual
+//!    tar header only has timestamps in seconds and extended headers are not
+//!    usually added for this purpose.
+//!  - we take great care to faithfully represent hardlinks: even though the
+//!    produced filesystem is read-only and we have data de-duplication via the
+//!    objects store, we make sure that hardlinks result in an actual shared inode
+//!    as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.
+//!
+//! We apply these precision restrictions also when creating images by scanning the
+//! filesystem.  For example: even if we get more-accurate timestamp information,
+//! we'll truncate it to the nearest second.
+//!
+//! # Merging directories
+//!
+//! This is done according to the OCI spec, with an additional clarification: in
+//! case a directory entry is present in multiple layers, we use the tar metadata
+//! from the most-derived layer to determine the attributes (owner, permissions,
+//! mtime) for the directory.
+//!
+//! # The root inode
+//!
+//! The root inode (/) is a difficult case because OCI container layer tars often
+//! don't include a root directory entry, and when they do, container runtimes
+//! (Podman, Docker) ignore it and use hardcoded defaults.  For example, Podman's
+//! [containers/storage](https://github.com/containers/storage) uses root:root
+//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but
+//! Docker uses `0755`. In general, the metadata for `/` is not defined.
+//!
+//! Because composefs requires (has a goal of providing) precise cryptographically
+//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr`
+//! to the root directory.  The rationale is that `/usr` is always present in
+//! standard filesystem layouts and must be defined explicitly in the OCI layers.
+//!
+//! This is implemented via the `copy_root_metadata_from_usr()` method and the
+//! `read_container_root()` convenience function.
+//!
+//! When building a filesystem from OCI layers programmatically, use
+//! `Stat::uninitialized()` to create the initial `FileSystem`.  This placeholder
+//! has mode `0` (obviously invalid) to make it clear that the root metadata should
+//! be set before computing digests - typically by calling
+//! `copy_root_metadata_from_usr()` after processing all layers.
+//!
+//! # Extended attributes (xattrs)
+//!
+//! When reading a container filesystem from a mounted root (as opposed to
+//! processing OCI layer tars directly), host-side xattrs can leak into the
+//! image.  This is particularly problematic for `security.selinux` labels:
+//! if SELinux is enabled at build time, files will have labels like
+//! `container_t` that come from the build host, not from the target system's
+//! policy.
+//!
+//! To ensure reproducibility, `read_container_root()` filters xattrs to only
+//! include those in an allowlist.  Currently this is just `security.capability`,
+//! which represents actual file capabilities that should be preserved.
+//!
+//! SELinux labels are handled separately by `transform_for_boot()`:
+//!  - If the target filesystem contains a SELinux policy (in `/etc/selinux`),
+//!    all files are relabeled according to that policy
+//!  - If no SELinux policy is found, all `security.selinux` xattrs are stripped
+//!
+//! This ensures that:
+//!  - Build-time SELinux labels don't leak into non-SELinux targets
+//!  - SELinux-enabled targets get correct labels from their own policy
+//!  - Other host xattrs (overlayfs internals, etc.) don't pollute the image
+//!
+//! See: <https://github.com/containers/storage/pull/1608#issuecomment-1600915185>
+//!
+//! # The /run directory
+//!
+//! When processing OCI images via `create_filesystem()`, the `/run` directory
+//! is emptied if present. This is a tmpfs at runtime and should always be
+//! empty in images. Its mtime is set to match `/usr` for consistency with
+//! how root directory metadata is handled.
+//!
+//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache
+//! mounts can leave incomplete directory entries in OCI tar layers (directories
+//! without explicit tar entries inherit incorrect mtimes) by pointing all
+//! such mounts into `/run`, and then redirecting from their final location
+//! via e.g. symlinks into `/run`.
+//!
+//! ## Container build cache mounts
+//!
+//! A practical implication of emptying `/run` is that container authors can
+//! use it for cache mounts without worrying about polluting the final image.
+//!
+//! Instead of:
+//! ```dockerfile
+//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ...
+//! ```
+//!
+//! Consider:
+//! ```dockerfile
+//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
+//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...
+//! ```
+//!
+//! This avoids potential mtime inconsistencies in `/var/cache` while still
+//! benefiting from build caching.
+//!
+//! See: <https://github.com/containers/composefs-rs/issues/132>
+//!
+//! # Emptied directories for boot
+//!
+//! When preparing a filesystem for boot via `transform_for_boot()`, certain
+//! additional directories are emptied because their contents should not be
+//! part of the final verified image:
+//!
+//! - `/boot`: Contains the UKI which embeds the composefs digest, so including
+//!   it would create a circular dependency
+//! - `/sysroot`: Only has content in ostree-container cases, and traversing
+//!   it for SELinux labeling causes problems
+//!
+//! These directories are emptied and their mtime is set to match `/usr` for
+//! consistency with how the root directory metadata is handled.
@@ -35,6 +35,9 @@ pub mod tar;
 #[doc(hidden)]
 pub mod test_util;
 
+#[cfg(doc)]
+pub mod design;
+
 // Re-export the composefs crate for consumers who only need composefs-oci
 pub use composefs;