|
| 1 | +//! # How to create a composefs from an OCI image |
| 2 | +//! |
| 3 | +//! This document is incomplete. It only serves to document some decisions we've |
| 4 | +//! taken about how to resolve ambiguous situations. |
| 5 | +//! |
| 6 | +//! # Data precision |
| 7 | +//! |
| 8 | +//! We currently create a composefs image using the granularity of data as |
| 9 | +//! typically appears in OCI tarballs: |
| 10 | +//! - atime and ctime are not present (these are actually not physically present |
| 11 | +//! in the erofs inode structure at all, either the compact or extended forms) |
| 12 | +//! - mtime is set to the mtime in seconds; the sub-seconds value is simply |
| 13 | +//! truncated (ie: we always round down). erofs has an nsec field, but it's not |
| 14 | +//! normally present in OCI tarballs. That's down to the fact that the usual |
| 15 | +//! tar header only has timestamps in seconds and extended headers are not |
| 16 | +//! usually added for this purpose. |
| 17 | +//! - we take great care to faithfully represent hardlinks: even though the |
| 18 | +//! produced filesystem is read-only and we have data de-duplication via the |
| 19 | +//! objects store, we make sure that hardlinks result in an actual shared inode |
| 20 | +//! as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem. |
| 21 | +//! |
| 22 | +//! We apply these precision restrictions also when creating images by scanning the |
| 23 | +//! filesystem. For example: even if we get more-accurate timestamp information, |
| 24 | +//! we'll truncate it to the nearest second. |
| 25 | +//! |
| 26 | +//! # Merging directories |
| 27 | +//! |
| 28 | +//! This is done according to the OCI spec, with an additional clarification: in |
| 29 | +//! case a directory entry is present in multiple layers, we use the tar metadata |
| 30 | +//! from the most-derived layer to determine the attributes (owner, permissions, |
| 31 | +//! mtime) for the directory. |
| 32 | +//! |
| 33 | +//! # The root inode |
| 34 | +//! |
| 35 | +//! The root inode (/) is a difficult case because OCI container layer tars often |
| 36 | +//! don't include a root directory entry, and when they do, container runtimes |
| 37 | +//! (Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's |
| 38 | +//! [containers/storage](https://github.com/containers/storage) uses root:root |
| 39 | +//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but |
| 40 | +//! Docker uses `0755`. In general, the metadata for `/` is not defined. |
| 41 | +//! |
| 42 | +//! Because composefs requires (has a goal of providing) precise cryptographically |
| 43 | +//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr` |
| 44 | +//! to the root directory. The rationale is that `/usr` is always present in |
| 45 | +//! standard filesystem layouts and must be defined explicitly in the OCI layers. |
| 46 | +//! |
| 47 | +//! This is implemented via the `copy_root_metadata_from_usr()` method and the |
| 48 | +//! `read_container_root()` convenience function. |
| 49 | +//! |
| 50 | +//! When building a filesystem from OCI layers programmatically, use |
| 51 | +//! `Stat::uninitialized()` to create the initial `FileSystem`. This placeholder |
| 52 | +//! has mode `0` (obviously invalid) to make it clear that the root metadata should |
| 53 | +//! be set before computing digests - typically by calling |
| 54 | +//! `copy_root_metadata_from_usr()` after processing all layers. |
| 55 | +//! |
| 56 | +//! # Extended attributes (xattrs) |
| 57 | +//! |
| 58 | +//! When reading a container filesystem from a mounted root (as opposed to |
| 59 | +//! processing OCI layer tars directly), host-side xattrs can leak into the |
| 60 | +//! image. This is particularly problematic for `security.selinux` labels: |
| 61 | +//! if SELinux is enabled at build time, files will have labels like |
| 62 | +//! `container_t` that come from the build host, not from the target system's |
| 63 | +//! policy. |
| 64 | +//! |
| 65 | +//! To ensure reproducibility, `read_container_root()` filters xattrs to only |
| 66 | +//! include those in an allowlist. Currently this is just `security.capability`, |
| 67 | +//! which represents actual file capabilities that should be preserved. |
| 68 | +//! |
| 69 | +//! SELinux labels are handled separately by `transform_for_boot()`: |
| 70 | +//! - If the target filesystem contains a SELinux policy (in `/etc/selinux`), |
| 71 | +//! all files are relabeled according to that policy |
| 72 | +//! - If no SELinux policy is found, all `security.selinux` xattrs are stripped |
| 73 | +//! |
| 74 | +//! This ensures that: |
| 75 | +//! - Build-time SELinux labels don't leak into non-SELinux targets |
| 76 | +//! - SELinux-enabled targets get correct labels from their own policy |
| 77 | +//! - Other host xattrs (overlayfs internals, etc.) don't pollute the image |
| 78 | +//! |
| 79 | +//! See: <https://github.com/containers/storage/pull/1608#issuecomment-1600915185> |
| 80 | +//! |
| 81 | +//! # The /run directory |
| 82 | +//! |
| 83 | +//! When processing OCI images via `create_filesystem()`, the `/run` directory |
| 84 | +//! is emptied if present. This is a tmpfs at runtime and should always be |
| 85 | +//! empty in images. Its mtime is set to match `/usr` for consistency with |
| 86 | +//! how root directory metadata is handled. |
| 87 | +//! |
| 88 | +//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache |
| 89 | +//! mounts can leave incomplete directory entries in OCI tar layers (directories |
| 90 | +//! without explicit tar entries inherit incorrect mtimes) by pointing all |
| 91 | +//! such mounts into `/run`, and then redirecting from their final location |
| 92 | +//! via e.g. symlinks into `/run`. |
| 93 | +//! |
| 94 | +//! ## Container build cache mounts |
| 95 | +//! |
| 96 | +//! A practical implication of emptying `/run` is that container authors can |
| 97 | +//! use it for cache mounts without worrying about polluting the final image. |
| 98 | +//! |
| 99 | +//! Instead of: |
| 100 | +//! ```dockerfile |
| 101 | +//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ... |
| 102 | +//! ``` |
| 103 | +//! |
| 104 | +//! Consider: |
| 105 | +//! ```dockerfile |
| 106 | +//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf |
| 107 | +//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ... |
| 108 | +//! ``` |
| 109 | +//! |
| 110 | +//! This avoids potential mtime inconsistencies in `/var/cache` while still |
| 111 | +//! benefiting from build caching. |
| 112 | +//! |
| 113 | +//! See: <https://github.com/containers/composefs-rs/issues/132> |
| 114 | +//! |
| 115 | +//! # Emptied directories for boot |
| 116 | +//! |
| 117 | +//! When preparing a filesystem for boot via `transform_for_boot()`, certain |
| 118 | +//! additional directories are emptied because their contents should not be |
| 119 | +//! part of the final verified image: |
| 120 | +//! |
| 121 | +//! - `/boot`: Contains the UKI which embeds the composefs digest, so including |
| 122 | +//! it would create a circular dependency |
| 123 | +//! - `/sysroot`: Only has content in ostree-container cases, and traversing |
| 124 | +//! it for SELinux labeling causes problems |
| 125 | +//! |
| 126 | +//! These directories are emptied and their mtime is set to match `/usr` for |
| 127 | +//! consistency with how the root directory metadata is handled. |
0 commit comments