Skip to content

Commit d56c14c

Browse files
committed
Add OCI image support: pull, unpack, run, prune, status, policy
Implement the full elfuse OCI image lifecycle as a self-contained `elfuse oci` subcommand. Image distribution never touches Hypervisor.framework, so the subcommand dispatches in main() before any guest setup; only `oci run` enters the VM bring-up path. - pull / inspect: content-addressable blob store over HTTPS with bearer-token + Basic auth, OCI index walk to the linux/arm64 leaf, parallel blob fetch with HTTP Range resume, offline inspect renderer. - unpack: tar reader (ustar + PAX x/g records), gzip + system libzstd (decode path), whiteout-aware layer apply, per-image case-sensitive APFS sysroot; cross-volume unpack via copyfile(2) with clone fallback. - run: clonefile(2) per-run rootfs; Entrypoint / Cmd / Env / WorkingDir and symbolic/numeric User honoured; reuses the shared elfuse_launch bring-up so a dynamic guest runs through the same shim + syscall path. - lifecycle: prune (--older-than / --keep-bytes), per-layer + ChainID stack snapshot caches, oci status (text + --json), rebuild-cache. - policy: podman/skopeo-style policy.json + registries.d overlay; loopback-gated --insecure; CLI flags override. Extract the VM bring-up from main() into core/launch.c (elfuse_launch) so oci run and the positional-ELF main share one path; the host-path resolution now lives in the caller per the guest_bootstrap_prepare split. zstd is consumed as a system shared library (pkg-config libzstd), mirroring the existing system zlib and libcurl; only cJSON stays vendored. Adds 25 native test-oci-* unit suites plus an opt-in heavy compat mode.
1 parent bde0b37 commit d56c14c

110 files changed

Lines changed: 54197 additions & 23 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ jobs:
122122
GNU_OBJCOPY: /opt/homebrew/opt/binutils/bin/objcopy
123123
HOMEBREW_NO_INSTALL_CLEANUP: 1
124124
HOMEBREW_NO_AUTO_UPDATE: 1
125-
BREW_PKGS: binutils
125+
BREW_PKGS: binutils zstd
126126
steps:
127127
- name: Checkout
128128
uses: actions/checkout@v6
@@ -181,7 +181,7 @@ jobs:
181181
HOMEBREW_NO_AUTO_UPDATE: 1
182182
# binutils is needed because make lint depends on the shim_blob.h
183183
# generated by the assembly + objcopy pipeline.
184-
BREW_PKGS: binutils llvm
184+
BREW_PKGS: binutils llvm zstd
185185
CLANG_TIDY: /opt/homebrew/opt/llvm/bin/clang-tidy
186186
steps:
187187
- name: Checkout
@@ -220,7 +220,7 @@ jobs:
220220
GNU_OBJCOPY: /opt/homebrew/opt/binutils/bin/objcopy
221221
HOMEBREW_NO_INSTALL_CLEANUP: 1
222222
HOMEBREW_NO_AUTO_UPDATE: 1
223-
BREW_PKGS: binutils llvm
223+
BREW_PKGS: binutils llvm zstd
224224
LLVM_BIN: /opt/homebrew/opt/llvm/bin
225225
steps:
226226
- name: Checkout

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
build/
22
archive/
3-
externals/
3+
# externals/ holds downloaded fixtures (kernel, rootfs, packages) that are
4+
# fetched on demand; tracking them in git would balloon the repo. The
5+
# vendored cJSON tree is the exception: it ships with the source so the OCI
6+
# parser builds out of the box. zstd is consumed as a system library.
7+
externals/*
8+
!externals/cjson/
49
lib/modules/
510
*.o
611
*.bin

Makefile

Lines changed: 265 additions & 2 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ guest debugging through a built-in GDB RSP stub.
2626
- macOS 13 or newer
2727
- Xcode Command Line Tools, `clang`, `codesign`, and GNU `make`
2828
- GNU `objcopy` from Homebrew `binutils`, or `llvm-objcopy`
29+
- `zstd` library and headers for OCI image support, resolved via
30+
`pkg-config`: `brew install zstd` (macOS) or `apt-get install libzstd-dev`
31+
(Linux). The `oci` subcommand decodes zstd-compressed layers; the rest of
32+
the build links the system `libcurl` and `zlib` that ship with macOS.
2933
- Hypervisor entitlement: `com.apple.security.hypervisor`
3034

3135
For guest test binaries, the project also expects an AArch64 Linux cross

docs/usage.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,179 @@ and memory access, and per-thread inspection. Implementation details, including
9999
the snapshot protocol used to keep Hypervisor.framework register access on the
100100
owning thread, are documented in [internals.md](internals.md).
101101

102+
## Running OCI Images (`elfuse oci run`)
103+
104+
Phase 3 adds a direct-execution path for pulled OCI images:
105+
106+
```sh
107+
elfuse oci run [OPTIONS] IMAGE [ARG...]
108+
```
109+
110+
The subcommand reads the image's runtime block (Entrypoint, Cmd, Env,
111+
WorkingDir, User) and folds in any CLI overrides, then unpacks the image
112+
into the local APFS sysroot volume, clones a per-run rootfs via APFS
113+
`clonefile(2)`, resolves argv[0] against PATH inside the rootfs, and
114+
hands off to the same VM bring-up the legacy positional-ELF `elfuse`
115+
entry uses.
116+
117+
The image must already be pulled. `oci run` does not auto-pull on miss.
118+
The usual workflow is:
119+
120+
```sh
121+
elfuse oci pull alpine:3
122+
elfuse oci run alpine:3 /bin/sh -c 'echo hello from inside'
123+
```
124+
125+
### Options
126+
127+
| Option | Meaning |
128+
|--------|---------|
129+
| `--store DIR` | Override the local store root |
130+
| `--volume DIR` | Override the APFS sysroot volume mount point |
131+
| `--entrypoint PROG` | Replace the image Entrypoint with `PROG` |
132+
| `-e KEY=VAL`, `--env KEY=VAL` | Set or replace one env var (repeatable) |
133+
| `-e KEY`, `--env KEY` | Import `KEY` from the host environ (repeatable) |
134+
| `-w DIR`, `--workdir DIR` | Override image WorkingDir |
135+
| `-u USER[:GROUP]`, `--user USER[:GROUP]` | Override image User; numeric `UID[:GID]` or symbolic `name[:group]` resolved from the rootfs `/etc/passwd` and `/etc/group` (see [User and WorkingDir](#user-and-workingdir)) |
136+
| `--keep` | Keep the per-run cloned rootfs after exit |
137+
| `--name NAME` | Reserved: deterministic clone-dir suffix (ignored today) |
138+
139+
### Argv override matrix
140+
141+
| Image Entrypoint | Image Cmd | CLI ARGV | `--entrypoint` | Result argv |
142+
|--|--|--|--|--|
143+
| set | set | none | none | Entrypoint ++ Cmd |
144+
| set | set | provided | none | Entrypoint ++ CLI ARGV (Cmd dropped) |
145+
| set | none | provided | none | Entrypoint ++ CLI ARGV |
146+
| none | set | none | none | Cmd |
147+
| none | set | provided | none | CLI ARGV (Cmd dropped) |
148+
| set | set | optional | provided | [`--entrypoint`] ++ CLI ARGV |
149+
| none | none | provided | none | CLI ARGV |
150+
| none | none | none | none | `EINVAL` "image has no entrypoint or cmd; pass one on the CLI" |
151+
152+
### Env merge policy
153+
154+
The merged guest env is built in this order:
155+
156+
1. Image `Env` (verbatim, in spec order)
157+
2. Each CLI `-e KEY=VAL` set-or-replaces by key
158+
3. Each CLI `-e KEY` (no `=`) imports the host's value when present, otherwise drops silently
159+
4. `TERM` auto-imported from the host iff the merged env has no `TERM`
160+
5. `PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin` injected iff the merged env has no `PATH`
161+
6. `container=elfuse` injected unconditionally so systemd-style sandbox detection works
162+
163+
CLI `-e DYLD_*=...` overrides are hard-rejected with `EINVAL`: `DYLD_*` is a
164+
macOS-only loader contract with no meaning inside an aarch64-linux guest.
165+
Image-provided `DYLD_*` entries pass through (the guest ignores them).
166+
167+
### User and WorkingDir
168+
169+
`User` accepts seven shapes: the empty string (no override), a numeric
170+
`UID`, `UID:GID`, a symbolic `name`, `name:group`, `uid:group`, or
171+
`name:gid`. Symbolic forms read `/etc/passwd` and `/etc/group` from
172+
the cloned rootfs. A token made entirely of ASCII digits is always
173+
parsed numerically, even when a same-named account ships in the image
174+
(this matches runc semantics, so an image that happens to carry a
175+
`1234` account does not capture `--user 1234`). When the symbolic
176+
form names an account the unpacked layers do not actually carry,
177+
lookup fails closed; `elfuse` never silently falls back to root.
178+
`--user UID` alone defaults GID to the same value.
179+
180+
`WorkingDir` must be absolute and free of `..` segments. If neither the
181+
image nor the CLI sets it, the guest starts in `/`. The directory is
182+
materialized under the cloned rootfs (`mkdir -p`, mode 0755, best-
183+
effort chown to the resolved uid:gid when `--user` or image User
184+
selects credentials).
185+
186+
### Scope guardrails
187+
188+
- Auto-pull on `run` miss -> never; `elfuse oci pull` must run first
189+
- Network policy, `docker run -p`-style port mapping -> later phases
190+
- Live `docker exec`-style attach -> never
191+
192+
### Runtime host-truth surface
193+
194+
`elfuse oci run` runs the guest against a freshly cloned per-run
195+
rootfs and a small set of synthesized host-truth files. The rootfs
196+
is produced by APFS `clonefile(2)` against the unpacked image
197+
layers, so the first guest write to any path triggers copy-on-write
198+
in APFS without touching the original image. The clone is removed at
199+
guest exit unless `--keep` is set; nothing is ever pushed back to
200+
the on-disk image, and concurrent `oci run` invocations against the
201+
same image are isolated.
202+
203+
Three `/etc` files are overwritten in the clone before the guest
204+
starts. Any pre-existing symlink (the common case is
205+
`/etc/resolv.conf -> /run/systemd/resolve/stub-resolv.conf`) is
206+
unlinked first so it does not dangle inside the guest:
207+
208+
| File | Source |
209+
|--|--|
210+
| `/etc/resolv.conf` | `nameserver` lines harvested from `scutil --dns`; falls back to `8.8.8.8` and `1.1.1.1` on any scutil failure |
211+
| `/etc/hosts` | fixed 5-line block: `localhost`, the ip6-loopback aliases, ip6 link-local multicast, and `127.0.0.1 host.elfuse.internal` |
212+
| `/etc/hostname` | literal string `elfuse` |
213+
214+
The following pseudo-filesystem paths are synthesized by the host-side
215+
openat interceptor and do not need to exist inside the rootfs:
216+
217+
| Path | Behavior |
218+
|--|--|
219+
| `/dev/null`, `/dev/zero`, `/dev/random`, `/dev/urandom`, `/dev/tty` | redirected to the host device of the same name |
220+
| `/dev/full` | reads zero-fill, writes of any non-zero length return `ENOSPC` |
221+
| `/dev/console` | mirrored from the controlling tty when present (macOS reserves the real `/dev/console` for the kernel) |
222+
| other `/dev/*` | `ENOENT` |
223+
| `/proc/cpuinfo`, `/proc/meminfo`, `/proc/version` | derived from host sysctl |
224+
| `/proc/self/{maps,exe,status,stat,comm,statm,cgroup}` | synthesized; `cgroup` reports the canonical `0::/` (elfuse runs outside any cgroup hierarchy) |
225+
| `/proc/sys/kernel/{ostype,osrelease,hostname}` | tracks the cached `uname` fields (`Linux`, `6.17.0-20-generic`, `elfuse`) |
226+
227+
### Libc-adjacent compatibility
228+
229+
`elfuse` does not patch libc-adjacent payload (NSS modules, time-zone
230+
data, locale data, character-set converters, dynamic-linker cache)
231+
inside the guest. Each item below names the contract `elfuse` honors
232+
and the failure mode an image hits when it does not ship the
233+
matching files.
234+
235+
- **`/etc/nsswitch.conf`** is read by the guest's libc, not by
236+
`elfuse`. Only the `files` and `dns` backends actually function:
237+
`files` resolves through `/etc/{passwd,group,hosts}` in the cloned
238+
rootfs, and `dns` resolves through host `getaddrinfo` via the
239+
synthesized `/etc/resolv.conf`. Backends such as `systemd`, `sss`,
240+
or `ldap` need their NSS shared object plus a matching daemon,
241+
neither of which `elfuse` provides.
242+
- **NSS shared objects** (`libnss_systemd.so`, `libnss_sss.so`,
243+
`libnss_ldap.so`, ...) are `dlopen`'d by guest libc against its own
244+
loader. `elfuse` never injects NSS modules: they are aarch64-linux
245+
ELF objects against guest libc, so the macOS host has no way to
246+
load them, and the guest can only `dlopen` the modules its image
247+
already carries.
248+
- **tzdata** (`/usr/share/zoneinfo`, `/etc/localtime`, `/etc/timezone`)
249+
ships with the image. `elfuse` does not transcode macOS
250+
`/var/db/timezone/zoneinfo` into the tzdata format; if the image is
251+
missing the needed zone, glibc / musl fall back to UTC. The `TZ`
252+
environment variable is honored as-is and is not rewritten by the
253+
Env merge policy.
254+
- **`/usr/lib/locale/locale-archive`** is not regenerated. glibc
255+
images without a built archive (or the matching `<lang>.UTF-8/`
256+
directory) fall back to the `C` locale; locale-aware sort / printf
257+
/ strcoll outputs ASCII order. musl images do not use the archive
258+
and are unaffected.
259+
- **`/usr/lib/<triple>/gconv/`** modules and the `gconv-modules`
260+
index ship with the image. Missing modules surface as `EILSEQ` from
261+
`iconv` / glibc's character-set conversion; this most often shows
262+
up when an image ships a stripped glibc layer.
263+
- **`ld.so.cache`** is not rebuilt. The guest dynamic linker reads
264+
whatever cache the image carries; missing entries fall through to
265+
the linker's library-path search, which is the normal slow path.
266+
267+
Common workloads and the symptom-to-workaround mapping:
268+
269+
| Symptom | Trigger | Workaround |
270+
|--|--|--|
271+
| `getaddrinfo` returns `EAI_AGAIN` or an empty result | `/etc/nsswitch.conf` lists a backend (`systemd`, `sss`, ...) that needs a daemon | use a distro whose `nsswitch.conf` is `files dns` (alpine ships this by default; debian needs the file edited) |
272+
| `date`, `strftime` show UTC instead of the expected zone | the image does not contain `/usr/share/zoneinfo/<Zone>` | install tzdata in the image (`apk add tzdata` / `apt install tzdata`), or pass `-e TZ=UTC` to acknowledge UTC |
273+
| `sort`, `printf`, `strcoll` collate in ASCII order | the image is missing `/usr/lib/locale/locale-archive` or the matching `<lang>.UTF-8/` directory | accept the C-locale fallback, run `locale-gen` during the image build, or use a musl-based image (alpine), which does not depend on the archive |
274+
102275
## Guest Compatibility Model
103276

104277
`elfuse` is designed for Linux user-space workloads, not for booting a Linux

externals/cjson/LICENSE

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Copyright (c) 2009-2017 Dave Gamble and cJSON contributors
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the "Software"), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in
11+
all copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19+
THE SOFTWARE.
20+

externals/cjson/VENDORING.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Vendored cJSON
2+
3+
This directory contains a vendored copy of [cJSON](https://github.com/DaveGamble/cJSON),
4+
the ultralightweight JSON parser written in ANSI C. cJSON ships as a single
5+
`.c` / `.h` pair and is dual-licensed under the MIT license (see `LICENSE`).
6+
7+
## Why vendored
8+
9+
The OCI work stays hand-rolled C alongside the existing elfuse codebase: no
10+
Go, no Rust, no `cargo` / `go` in the build matrix. cJSON is the smallest
11+
credible JSON dependency that fits that contract; it is self-contained, has no
12+
external dependencies, and compiles cleanly with `clang` and `gcc` on macOS
13+
and Linux.
14+
15+
## Version
16+
17+
Pinned to upstream tag `v1.7.18` (2024-05-13). Fetched with:
18+
19+
```
20+
curl -fsSL -o cJSON.h https://raw.githubusercontent.com/DaveGamble/cJSON/v1.7.18/cJSON.h
21+
curl -fsSL -o cJSON.c https://raw.githubusercontent.com/DaveGamble/cJSON/v1.7.18/cJSON.c
22+
curl -fsSL -o LICENSE https://raw.githubusercontent.com/DaveGamble/cJSON/v1.7.18/LICENSE
23+
```
24+
25+
## Local modifications
26+
27+
None. The files are byte-identical to the upstream tag so future security
28+
updates can be applied by re-running the curl commands above.
29+
30+
## Build integration
31+
32+
The Makefile compiles `cJSON.c` with project warning flags relaxed: cJSON is
33+
third-party code and its style does not match elfuse's `-Wpedantic
34+
-Wmissing-prototypes -Wshadow` posture. Only `src/oci/` translation units
35+
include `externals/cjson/cJSON.h`; the rest of the codebase never sees it.

0 commit comments

Comments
 (0)