Skip to content

Commit 3e3fe9f

Browse files
committed
Make rustdoc hold most documentation
I want composefs to have a website with useful docs. After some reflection, what I think is the least effort and most clear right now is just to move things into the Rust documentation. So things like the repository format, splitstream definition now live there. The pattern is that we have `#[cfg(doc)] mod foo;` for things that only need to live in docs. Add information about varlink there too. Assisted-by: OpenCode (claude-opus-4-6) Signed-off-by: Colin Walters <walters@verbum.org>
1 parent ad726f3 commit 3e3fe9f

18 files changed

Lines changed: 1152 additions & 987 deletions

File tree

README.md

Lines changed: 19 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files
3434
for content — avoiding doubled disk usage, partition table management,
3535
and similar headaches.
3636

37-
### Separation between metadata and data
38-
39-
A key aspect of composefs is its separation of "data" (non-empty regular
40-
files) from "metadata" (everything else: directories, symlinks, permissions,
41-
ownership, etc.).
42-
43-
composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem
44-
image that contains only metadata. The non-empty data files live in a
45-
separate "backing store" directory. The EROFS image includes
46-
`trusted.overlay.redirect` extended attributes that tell the overlayfs
47-
mount how to find the real underlying files.
48-
49-
### Shared backing store
50-
51-
The primary use case for composefs is versioned, immutable filesystem
52-
trees — container images and bootable host systems — where multiple
53-
images may share parts of their storage.
54-
55-
By storing files content-addressed (named by the hash of their content),
56-
shared files need to be stored only once on disk yet can appear in
57-
multiple mounts. Crucially, these data files are also shared in the
58-
[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache),
59-
allowing multiple running container images to reliably share memory.
60-
61-
### Filesystem integrity
62-
63-
composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
64-
validation of content files. The digest of each content file is stored
65-
in the EROFS image via `trusted.overlay.metacopy` extended attributes,
66-
which overlayfs validates when the file is accessed. This means backing
67-
content cannot be changed (by mistake or by malice) without detection.
68-
69-
You can also enable fs-verity on the image file itself and pass the expected
70-
digest as a mount option. This provides full trust of both data and metadata,
71-
solving a weakness of fs-verity alone (which can only verify file data,
72-
not metadata like permissions, ownership, or directory structure).
37+
composefs separates metadata (directories, permissions, xattrs) from data
38+
(file content). An EROFS image carries only the metadata; data files live in
39+
a content-addressed backing store, shared across images and in the Linux
40+
[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).
41+
Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
42+
provides end-to-end integrity verification of both data and metadata.
43+
For design details, see the [crate documentation](https://docs.rs/composefs).
7344

7445
## Use cases
7546

7647
### Container images
7748

78-
For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
79-
container images, a common approach (used by both Docker and Podman) is
80-
to untar each layer separately and use overlayfs to stitch them together.
81-
composefs improves on this by storing file content in a content-addressed
82-
fashion, allowing sharing between images even when metadata like
83-
timestamps or ownership differs.
84-
85-
Combined with approaches like
86-
[zstd:chunked](https://github.com/containers/storage/pull/775),
87-
this speeds up pulling container images and avoids redundantly
88-
creating files that are already present.
49+
composefs improves on the traditional per-layer overlayfs model for
50+
[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
51+
container images by storing file content in a content-addressed store,
52+
enabling sharing between images and faster pulls via
53+
[zstd:chunked](https://github.com/containers/storage/pull/775).
8954

9055
### Bootable host systems
9156

92-
Anywhere one wants versioned immutable filesystem trees ("images"),
93-
composefs provides compelling advantages. In particular, this project
94-
aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/).
95-
96-
OSTree uses a content-addressed object store, but traditionally checks out
97-
into a regular directory (using hardlinks), which is then bind-mounted as
98-
the rootfs. While OSTree supports enabling fs-verity on files in the store,
99-
nothing protects the checkout directories from modification.
100-
101-
composefs replaces this checkout with a directly-mountable image pointing
102-
into the object store. We can enable fs-verity on the composefs image and
103-
embed its digest in the kernel commandline or a Unified Kernel Image (UKI).
104-
Since composefs generation is reproducible, we can verify the generated
105-
image is correct by comparing its digest to one in the metadata produced
106-
at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).
57+
composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/)
58+
by replacing hardlink checkouts with directly-mountable images backed by a
59+
shared object store. Combined with fs-verity and a digest embedded in the
60+
kernel commandline or a UKI, this provides cryptographic verification of
61+
the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867)
62+
for background.
10763

10864
## Components
10965

@@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly.
147103

148104
## Documentation
149105

150-
- [Repository format](doc/repository.md)
151-
- [OCI integration](doc/oci.md)
152-
- [Splitstream format](doc/splitstream.md)
106+
- [API and design documentation](https://docs.rs/composefs)
153107
- [Examples README](examples/README.md)
154108

155109
## Status
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
//! # Booting from a composefs image
2+
//!
3+
//! This document describes how composefs-rs sets up the root filesystem during
4+
//! early boot. It covers the kernel command-line interface, the expected on-disk
5+
//! layout, kernel requirements, and the step-by-step mount sequence performed by
6+
//! `composefs-setup-root`.
7+
//!
8+
//! The target audience is system integrators and OS developers who are packaging a
9+
//! bootable system using composefs. Familiarity with Linux mount namespaces,
10+
//! overlayfs, and fs-verity is assumed.
11+
//!
12+
//! ## Kernel command-line
13+
//!
14+
//! The initramfs code in composefs supports multiple kernel arguments; it
15+
//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and
16+
//! SHA-512. On an installed system, the repository only supports one digest
17+
//! by default today, and the first found will be selected.
18+
//!
19+
//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible
20+
//! version will be found.
21+
//!
22+
//! ```text
23+
//! composefs.digest=v1-sha256-12:<digest> # V1 EROFS image (preferred; RHEL9-era kernels)
24+
//! composefs.digest=v1-sha512-12:<digest> # V1 EROFS image (SHA-512 variant)
25+
//! composefs.digest=v2-sha512-12:<digest> # V2 EROFS image (explicit form)
26+
//! composefs=<digest> # V2 EROFS image (legacy shorthand)
27+
//! ```
28+
//!
29+
//! The value format is `<version>-<hash>-<lg_blocksize>:<hex_digest>`, where
30+
//! `<version>` is `v1` or `v2`, `<hash>` is `sha256` or `sha512`, and
31+
//! `<lg_blocksize>` is the log2 block size (currently always `12`, i.e. 4096
32+
//! bytes). This mirrors how `meta.json` encodes the algorithm as
33+
//! `fsverity-sha256-12`.
34+
//!
35+
//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline
36+
//! (one per format/algorithm combination); the initramfs tries each in order and
37+
//! mounts the first image that actually exists in the repository.
38+
//!
39+
//! `composefs=<digest>` is a legacy shorthand equivalent to
40+
//! `composefs.digest=v2-<hash>-12:<digest>` -- the algorithm is inferred from the
41+
//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when
42+
//! no `composefs.digest=` token matches.
43+
//!
44+
//! **Insecure mode.** Placing `?` immediately after `=` (e.g.
45+
//! `composefs.digest=?v1-sha256-12:<digest>` or `composefs=?<digest>`) makes
46+
//! fs-verity verification optional. The system will boot even when the underlying
47+
//! filesystem does not support fs-verity or the image has no verity metadata
48+
//! attached. This mode exists for development and testing only; it must not be used
49+
//! in production.
50+
//!
51+
//! ## On-disk layout
52+
//!
53+
//! The composefs repository must be present at `/sysroot/composefs` with the
54+
//! standard layout described in the `composefs::repository_format` module.
55+
//!
56+
//! The digest must correspond to a symlink under `images/`.
57+
//!
58+
//! Persistent per-deployment state lives at `/sysroot/state/deploy/<digest>/`,
59+
//! where `<digest>` matches the boot karg digest exactly. The `etc/` and `var/`
60+
//! subdirectories within that directory serve as the upper layers for the
61+
//! corresponding overlayfs mounts.
62+
//!
63+
//! ## Kernel requirements
64+
//!
65+
//! The following kernel features must be available:
66+
//!
67+
//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`)
68+
//! - **overlayfs** with `metacopy=on` and `redirect_dir=on`
69+
//! (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`)
70+
//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`)
71+
//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`),
72+
//! available since kernel 5.2. Kernel >= 6.15 is required for the atomic root
73+
//! replacement path (the default build). On kernels without `fsconfig_set_fd`
74+
//! support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created
75+
//! automatically by `composefs::mountcompat`.
76+
//!
77+
//! ## Kernel argument
78+
//!
79+
//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted.
80+
//! Without the `?` insecure prefix, every file access through the overlayfs is
81+
//! verified against the object's stored digest by the kernel, combining fs-verity
82+
//! on the data objects with overlayfs `verity=require`.
83+
//!
84+
//! ## Other notes
85+
//!
86+
//! As a workaround for a GPT auto-root issue in systemd
87+
//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)),
88+
//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a
89+
//! symlink pointing to the real block device before performing any mounts. Failure
90+
//! to do so is non-fatal and does not abort the boot sequence.

crates/composefs-boot/src/lib.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ pub mod selabel;
1515
pub mod uki;
1616
pub mod write_boot;
1717

18+
#[cfg(doc)]
19+
pub mod design;
20+
1821
use std::ffi::OsStr;
1922

2023
use anyhow::Result;

crates/composefs-oci/src/design.rs

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
//! # How to create a composefs from an OCI image
2+
//!
3+
//! This document is incomplete. It only serves to document some decisions we've
4+
//! taken about how to resolve ambiguous situations.
5+
//!
6+
//! # Data precision
7+
//!
8+
//! We currently create a composefs image using the granularity of data as
9+
//! typically appears in OCI tarballs:
10+
//! - atime and ctime are not present (these are actually not physically present
11+
//! in the erofs inode structure at all, either the compact or extended forms)
12+
//! - mtime is set to the mtime in seconds; the sub-seconds value is simply
13+
//! truncated (ie: we always round down). erofs has an nsec field, but it's not
14+
//! normally present in OCI tarballs. That's down to the fact that the usual
15+
//! tar header only has timestamps in seconds and extended headers are not
16+
//! usually added for this purpose.
17+
//! - we take great care to faithfully represent hardlinks: even though the
18+
//! produced filesystem is read-only and we have data de-duplication via the
19+
//! objects store, we make sure that hardlinks result in an actual shared inode
20+
//! as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.
21+
//!
22+
//! We apply these precision restrictions also when creating images by scanning the
23+
//! filesystem. For example: even if we get more-accurate timestamp information,
24+
//! we'll truncate it to the nearest second.
25+
//!
26+
//! # Merging directories
27+
//!
28+
//! This is done according to the OCI spec, with an additional clarification: in
29+
//! case a directory entry is present in multiple layers, we use the tar metadata
30+
//! from the most-derived layer to determine the attributes (owner, permissions,
31+
//! mtime) for the directory.
32+
//!
33+
//! # The root inode
34+
//!
35+
//! The root inode (/) is a difficult case because OCI container layer tars often
36+
//! don't include a root directory entry, and when they do, container runtimes
37+
//! (Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's
38+
//! [containers/storage](https://github.com/containers/storage) uses root:root
39+
//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but
40+
//! Docker uses `0755`. In general, the metadata for `/` is not defined.
41+
//!
42+
//! Because composefs requires (has a goal of providing) precise cryptographically
43+
//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr`
44+
//! to the root directory. The rationale is that `/usr` is always present in
45+
//! standard filesystem layouts and must be defined explicitly in the OCI layers.
46+
//!
47+
//! This is implemented via the `copy_root_metadata_from_usr()` method and the
48+
//! `read_container_root()` convenience function.
49+
//!
50+
//! When building a filesystem from OCI layers programmatically, use
51+
//! `Stat::uninitialized()` to create the initial `FileSystem`. This placeholder
52+
//! has mode `0` (obviously invalid) to make it clear that the root metadata should
53+
//! be set before computing digests - typically by calling
54+
//! `copy_root_metadata_from_usr()` after processing all layers.
55+
//!
56+
//! # Extended attributes (xattrs)
57+
//!
58+
//! When reading a container filesystem from a mounted root (as opposed to
59+
//! processing OCI layer tars directly), host-side xattrs can leak into the
60+
//! image. This is particularly problematic for `security.selinux` labels:
61+
//! if SELinux is enabled at build time, files will have labels like
62+
//! `container_t` that come from the build host, not from the target system's
63+
//! policy.
64+
//!
65+
//! To ensure reproducibility, `read_container_root()` filters xattrs to only
66+
//! include those in an allowlist. Currently this is just `security.capability`,
67+
//! which represents actual file capabilities that should be preserved.
68+
//!
69+
//! SELinux labels are handled separately by `transform_for_boot()`:
70+
//! - If the target filesystem contains a SELinux policy (in `/etc/selinux`),
71+
//! all files are relabeled according to that policy
72+
//! - If no SELinux policy is found, all `security.selinux` xattrs are stripped
73+
//!
74+
//! This ensures that:
75+
//! - Build-time SELinux labels don't leak into non-SELinux targets
76+
//! - SELinux-enabled targets get correct labels from their own policy
77+
//! - Other host xattrs (overlayfs internals, etc.) don't pollute the image
78+
//!
79+
//! See: <https://github.com/containers/storage/pull/1608#issuecomment-1600915185>
80+
//!
81+
//! # The /run directory
82+
//!
83+
//! When processing OCI images via `create_filesystem()`, the `/run` directory
84+
//! is emptied if present. This is a tmpfs at runtime and should always be
85+
//! empty in images. Its mtime is set to match `/usr` for consistency with
86+
//! how root directory metadata is handled.
87+
//!
88+
//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache
89+
//! mounts can leave incomplete directory entries in OCI tar layers (directories
90+
//! without explicit tar entries inherit incorrect mtimes) by pointing all
91+
//! such mounts into `/run`, and then redirecting from their final location
92+
//! via e.g. symlinks into `/run`.
93+
//!
94+
//! ## Container build cache mounts
95+
//!
96+
//! A practical implication of emptying `/run` is that container authors can
97+
//! use it for cache mounts without worrying about polluting the final image.
98+
//!
99+
//! Instead of:
100+
//! ```dockerfile
101+
//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ...
102+
//! ```
103+
//!
104+
//! Consider:
105+
//! ```dockerfile
106+
//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
107+
//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...
108+
//! ```
109+
//!
110+
//! This avoids potential mtime inconsistencies in `/var/cache` while still
111+
//! benefiting from build caching.
112+
//!
113+
//! See: <https://github.com/containers/composefs-rs/issues/132>
114+
//!
115+
//! # Emptied directories for boot
116+
//!
117+
//! When preparing a filesystem for boot via `transform_for_boot()`, certain
118+
//! additional directories are emptied because their contents should not be
119+
//! part of the final verified image:
120+
//!
121+
//! - `/boot`: Contains the UKI which embeds the composefs digest, so including
122+
//! it would create a circular dependency
123+
//! - `/sysroot`: Only has content in ostree-container cases, and traversing
124+
//! it for SELinux labeling causes problems
125+
//!
126+
//! These directories are emptied and their mtime is set to match `/usr` for
127+
//! consistency with how the root directory metadata is handled.

crates/composefs-oci/src/lib.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ pub mod tar;
3535
#[doc(hidden)]
3636
pub mod test_util;
3737

38+
#[cfg(doc)]
39+
pub mod design;
40+
3841
// Re-export the composefs crate for consumers who only need composefs-oci
3942
pub use composefs;
4043

0 commit comments

Comments
 (0)