Skip to content

Commit 4f74cdd

Browse files
committed
Add OCI image support: pull, unpack, run, prune, status, policy
Implement the full elfuse OCI image lifecycle as a self-contained `elfuse oci` subcommand. Image distribution never touches Hypervisor.framework, so the subcommand dispatches in main() before any guest setup; only `oci run` enters the VM bring-up path. - pull / inspect: content-addressable blob store over HTTPS with bearer-token + Basic auth, OCI index walk to the linux/arm64 leaf, parallel blob fetch with HTTP Range resume, offline inspect renderer. - unpack: tar reader (ustar + PAX x/g records), gzip + system libzstd (decode path), whiteout-aware layer apply, per-image case-sensitive APFS sysroot; cross-volume unpack via copyfile(2) with clone fallback. - run: clonefile(2) per-run rootfs; Entrypoint / Cmd / Env / WorkingDir and symbolic/numeric User honoured; reuses the shared elfuse_launch bring-up so a dynamic guest runs through the same shim + syscall path. - lifecycle: prune (--older-than / --keep-bytes), per-layer + ChainID stack snapshot caches, oci status (text + --json), rebuild-cache. - policy: podman/skopeo-style policy.json + registries.d overlay; loopback-gated --insecure; CLI flags override. Extract the VM bring-up from main() into core/launch.c (elfuse_launch) so oci run and the positional-ELF main share one path; the host-path resolution now lives in the caller per the guest_bootstrap_prepare split. zstd and cJSON are consumed as system shared libraries (pkg-config libzstd / libcjson), mirroring the existing system zlib and libcurl; nothing is vendored under externals/. Adds 25 native test-oci-* unit suites plus an opt-in heavy compat mode.
1 parent bde0b37 commit 4f74cdd

106 files changed

Lines changed: 50697 additions & 23 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ jobs:
122122
GNU_OBJCOPY: /opt/homebrew/opt/binutils/bin/objcopy
123123
HOMEBREW_NO_INSTALL_CLEANUP: 1
124124
HOMEBREW_NO_AUTO_UPDATE: 1
125-
BREW_PKGS: binutils
125+
BREW_PKGS: binutils zstd cjson
126126
steps:
127127
- name: Checkout
128128
uses: actions/checkout@v6
@@ -181,7 +181,7 @@ jobs:
181181
HOMEBREW_NO_AUTO_UPDATE: 1
182182
# binutils is needed because make lint depends on the shim_blob.h
183183
# generated by the assembly + objcopy pipeline.
184-
BREW_PKGS: binutils llvm
184+
BREW_PKGS: binutils llvm zstd cjson
185185
CLANG_TIDY: /opt/homebrew/opt/llvm/bin/clang-tidy
186186
steps:
187187
- name: Checkout
@@ -220,7 +220,7 @@ jobs:
220220
GNU_OBJCOPY: /opt/homebrew/opt/binutils/bin/objcopy
221221
HOMEBREW_NO_INSTALL_CLEANUP: 1
222222
HOMEBREW_NO_AUTO_UPDATE: 1
223-
BREW_PKGS: binutils llvm
223+
BREW_PKGS: binutils llvm zstd cjson
224224
LLVM_BIN: /opt/homebrew/opt/llvm/bin
225225
steps:
226226
- name: Checkout

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
build/
22
archive/
3-
externals/
3+
# externals/ holds downloaded fixtures (kernel, rootfs, packages) that are
4+
# fetched on demand; tracking them in git would balloon the repo. Nothing
5+
# under externals/ is vendored now -- cJSON and zstd are both consumed as
6+
# system libraries via pkg-config.
7+
externals/*
48
lib/modules/
59
*.o
610
*.bin

Makefile

Lines changed: 263 additions & 2 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ guest debugging through a built-in GDB RSP stub.
2626
- macOS 13 or newer
2727
- Xcode Command Line Tools, `clang`, `codesign`, and GNU `make`
2828
- GNU `objcopy` from Homebrew `binutils`, or `llvm-objcopy`
29+
- `zstd` and `cJSON` libraries with headers for OCI image support, resolved
30+
via `pkg-config`: `brew install zstd cjson` (macOS) or `apt-get install
31+
libzstd-dev libcjson-dev` (Linux). The `oci` subcommand decodes
32+
zstd-compressed layers and parses JSON manifests; the rest of the build
33+
links the system `libcurl` and `zlib` that ship with macOS.
2934
- Hypervisor entitlement: `com.apple.security.hypervisor`
3035

3136
For guest test binaries, the project also expects an AArch64 Linux cross

docs/usage.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,179 @@ and memory access, and per-thread inspection. Implementation details, including
9999
the snapshot protocol used to keep Hypervisor.framework register access on the
100100
owning thread, are documented in [internals.md](internals.md).
101101

102+
## Running OCI Images (`elfuse oci run`)
103+
104+
Phase 3 adds a direct-execution path for pulled OCI images:
105+
106+
```sh
107+
elfuse oci run [OPTIONS] IMAGE [ARG...]
108+
```
109+
110+
The subcommand reads the image's runtime block (Entrypoint, Cmd, Env,
111+
WorkingDir, User) and folds in any CLI overrides, then unpacks the image
112+
into the local APFS sysroot volume, clones a per-run rootfs via APFS
113+
`clonefile(2)`, resolves argv[0] against PATH inside the rootfs, and
114+
hands off to the same VM bring-up the legacy positional-ELF `elfuse`
115+
entry uses.
116+
117+
The image must already be pulled. `oci run` does not auto-pull on miss.
118+
The usual workflow is:
119+
120+
```sh
121+
elfuse oci pull alpine:3
122+
elfuse oci run alpine:3 /bin/sh -c 'echo hello from inside'
123+
```
124+
125+
### Options
126+
127+
| Option | Meaning |
128+
|--------|---------|
129+
| `--store DIR` | Override the local store root |
130+
| `--volume DIR` | Override the APFS sysroot volume mount point |
131+
| `--entrypoint PROG` | Replace the image Entrypoint with `PROG` |
132+
| `-e KEY=VAL`, `--env KEY=VAL` | Set or replace one env var (repeatable) |
133+
| `-e KEY`, `--env KEY` | Import `KEY` from the host environ (repeatable) |
134+
| `-w DIR`, `--workdir DIR` | Override image WorkingDir |
135+
| `-u USER[:GROUP]`, `--user USER[:GROUP]` | Override image User; numeric `UID[:GID]` or symbolic `name[:group]` resolved from the rootfs `/etc/passwd` and `/etc/group` (see [User and WorkingDir](#user-and-workingdir)) |
136+
| `--keep` | Keep the per-run cloned rootfs after exit |
137+
| `--name NAME` | Reserved: deterministic clone-dir suffix (ignored today) |
138+
139+
### Argv override matrix
140+
141+
| Image Entrypoint | Image Cmd | CLI ARGV | `--entrypoint` | Result argv |
142+
|--|--|--|--|--|
143+
| set | set | none | none | Entrypoint ++ Cmd |
144+
| set | set | provided | none | Entrypoint ++ CLI ARGV (Cmd dropped) |
145+
| set | none | provided | none | Entrypoint ++ CLI ARGV |
146+
| none | set | none | none | Cmd |
147+
| none | set | provided | none | CLI ARGV (Cmd dropped) |
148+
| set | set | optional | provided | [`--entrypoint`] ++ CLI ARGV |
149+
| none | none | provided | none | CLI ARGV |
150+
| none | none | none | none | `EINVAL` "image has no entrypoint or cmd; pass one on the CLI" |
151+
152+
### Env merge policy
153+
154+
The merged guest env is built in this order:
155+
156+
1. Image `Env` (verbatim, in spec order)
157+
2. Each CLI `-e KEY=VAL` set-or-replaces by key
158+
3. Each CLI `-e KEY` (no `=`) imports the host's value when present, otherwise drops silently
159+
4. `TERM` auto-imported from the host iff the merged env has no `TERM`
160+
5. `PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin` injected iff the merged env has no `PATH`
161+
6. `container=elfuse` injected unconditionally so systemd-style sandbox detection works
162+
163+
CLI `-e DYLD_*=...` overrides are hard-rejected with `EINVAL`: `DYLD_*` is a
164+
macOS-only loader contract with no meaning inside an aarch64-linux guest.
165+
Image-provided `DYLD_*` entries pass through (the guest ignores them).
166+
167+
### User and WorkingDir
168+
169+
`User` accepts seven shapes: the empty string (no override), a numeric
170+
`UID`, `UID:GID`, a symbolic `name`, `name:group`, `uid:group`, or
171+
`name:gid`. Symbolic forms read `/etc/passwd` and `/etc/group` from
172+
the cloned rootfs. A token made entirely of ASCII digits is always
173+
parsed numerically, even when a same-named account ships in the image
174+
(this matches runc semantics, so an image that happens to carry a
175+
`1234` account does not capture `--user 1234`). When the symbolic
176+
form names an account the unpacked layers do not actually carry,
177+
lookup fails closed; `elfuse` never silently falls back to root.
178+
`--user UID` alone defaults GID to the same value.
179+
180+
`WorkingDir` must be absolute and free of `..` segments. If neither the
181+
image nor the CLI sets it, the guest starts in `/`. The directory is
182+
materialized under the cloned rootfs (`mkdir -p`, mode 0755, best-
183+
effort chown to the resolved uid:gid when `--user` or image User
184+
selects credentials).
185+
186+
### Scope guardrails
187+
188+
- Auto-pull on `run` miss -> never; `elfuse oci pull` must run first
189+
- Network policy, `docker run -p`-style port mapping -> later phases
190+
- Live `docker exec`-style attach -> never
191+
192+
### Runtime host-truth surface
193+
194+
`elfuse oci run` runs the guest against a freshly cloned per-run
195+
rootfs and a small set of synthesized host-truth files. The rootfs
196+
is produced by APFS `clonefile(2)` against the unpacked image
197+
layers, so the first guest write to any path triggers copy-on-write
198+
in APFS without touching the original image. The clone is removed at
199+
guest exit unless `--keep` is set; nothing is ever pushed back to
200+
the on-disk image, and concurrent `oci run` invocations against the
201+
same image are isolated.
202+
203+
Three `/etc` files are overwritten in the clone before the guest
204+
starts. Any pre-existing symlink (the common case is
205+
`/etc/resolv.conf -> /run/systemd/resolve/stub-resolv.conf`) is
206+
unlinked first so it does not dangle inside the guest:
207+
208+
| File | Source |
209+
|--|--|
210+
| `/etc/resolv.conf` | `nameserver` lines harvested from `scutil --dns`; falls back to `8.8.8.8` and `1.1.1.1` on any scutil failure |
211+
| `/etc/hosts` | fixed 5-line block: `localhost`, the ip6-loopback aliases, ip6 link-local multicast, and `127.0.0.1 host.elfuse.internal` |
212+
| `/etc/hostname` | literal string `elfuse` |
213+
214+
The following pseudo-filesystem paths are synthesized by the host-side
215+
openat interceptor and do not need to exist inside the rootfs:
216+
217+
| Path | Behavior |
218+
|--|--|
219+
| `/dev/null`, `/dev/zero`, `/dev/random`, `/dev/urandom`, `/dev/tty` | redirected to the host device of the same name |
220+
| `/dev/full` | reads zero-fill, writes of any non-zero length return `ENOSPC` |
221+
| `/dev/console` | mirrored from the controlling tty when present (macOS reserves the real `/dev/console` for the kernel) |
222+
| other `/dev/*` | `ENOENT` |
223+
| `/proc/cpuinfo`, `/proc/meminfo`, `/proc/version` | derived from host sysctl |
224+
| `/proc/self/{maps,exe,status,stat,comm,statm,cgroup}` | synthesized; `cgroup` reports the canonical `0::/` (elfuse runs outside any cgroup hierarchy) |
225+
| `/proc/sys/kernel/{ostype,osrelease,hostname}` | tracks the cached `uname` fields (`Linux`, `6.17.0-20-generic`, `elfuse`) |
226+
227+
### Libc-adjacent compatibility
228+
229+
`elfuse` does not patch libc-adjacent payload (NSS modules, time-zone
230+
data, locale data, character-set converters, dynamic-linker cache)
231+
inside the guest. Each item below names the contract `elfuse` honors
232+
and the failure mode an image hits when it does not ship the
233+
matching files.
234+
235+
- **`/etc/nsswitch.conf`** is read by the guest's libc, not by
236+
`elfuse`. Only the `files` and `dns` backends actually function:
237+
`files` resolves through `/etc/{passwd,group,hosts}` in the cloned
238+
rootfs, and `dns` resolves through host `getaddrinfo` via the
239+
synthesized `/etc/resolv.conf`. Backends such as `systemd`, `sss`,
240+
or `ldap` need their NSS shared object plus a matching daemon,
241+
neither of which `elfuse` provides.
242+
- **NSS shared objects** (`libnss_systemd.so`, `libnss_sss.so`,
243+
`libnss_ldap.so`, ...) are `dlopen`'d by guest libc against its own
244+
loader. `elfuse` never injects NSS modules: they are aarch64-linux
245+
ELF objects against guest libc, so the macOS host has no way to
246+
load them, and the guest can only `dlopen` the modules its image
247+
already carries.
248+
- **tzdata** (`/usr/share/zoneinfo`, `/etc/localtime`, `/etc/timezone`)
249+
ships with the image. `elfuse` does not transcode macOS
250+
`/var/db/timezone/zoneinfo` into the tzdata format; if the image is
251+
missing the needed zone, glibc / musl fall back to UTC. The `TZ`
252+
environment variable is honored as-is and is not rewritten by the
253+
Env merge policy.
254+
- **`/usr/lib/locale/locale-archive`** is not regenerated. glibc
255+
images without a built archive (or the matching `<lang>.UTF-8/`
256+
directory) fall back to the `C` locale; locale-aware sort / printf
257+
/ strcoll outputs ASCII order. musl images do not use the archive
258+
and are unaffected.
259+
- **`/usr/lib/<triple>/gconv/`** modules and the `gconv-modules`
260+
index ship with the image. Missing modules surface as `EILSEQ` from
261+
`iconv` / glibc's character-set conversion; this most often shows
262+
up when an image ships a stripped glibc layer.
263+
- **`ld.so.cache`** is not rebuilt. The guest dynamic linker reads
264+
whatever cache the image carries; missing entries fall through to
265+
the linker's library-path search, which is the normal slow path.
266+
267+
Common workloads and the symptom-to-workaround mapping:
268+
269+
| Symptom | Trigger | Workaround |
270+
|--|--|--|
271+
| `getaddrinfo` returns `EAI_AGAIN` or an empty result | `/etc/nsswitch.conf` lists a backend (`systemd`, `sss`, ...) that needs a daemon | use a distro whose `nsswitch.conf` is `files dns` (alpine ships this by default; debian needs the file edited) |
272+
| `date`, `strftime` show UTC instead of the expected zone | the image does not contain `/usr/share/zoneinfo/<Zone>` | install tzdata in the image (`apk add tzdata` / `apt install tzdata`), or pass `-e TZ=UTC` to acknowledge UTC |
273+
| `sort`, `printf`, `strcoll` collate in ASCII order | the image is missing `/usr/lib/locale/locale-archive` or the matching `<lang>.UTF-8/` directory | accept the C-locale fallback, run `locale-gen` during the image build, or use a musl-based image (alpine), which does not depend on the archive |
274+
102275
## Guest Compatibility Model
103276

104277
`elfuse` is designed for Linux user-space workloads, not for booting a Linux

mk/analysis.mk

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,14 @@ SHELL_SCRIPTS := $(shell git ls-files --cached --others --exclude-standard \
1414
PYTHON_FORMAT_FILES := $(shell git ls-files --cached --others --exclude-standard \
1515
-- '*.py')
1616

17-
## Run clang-tidy on all source files
17+
## Run clang-tidy on all source files. ZSTD_CFLAGS comes from the parent
18+
## Makefile (pkg-config libzstd) so src/oci/decompress.c, which is the only
19+
## translation unit that #includes <zstd.h>, can resolve the header during
20+
## analysis.
1821
lint: $(BUILD_DIR)/shim_blob.h $(BUILD_DIR)/version.h
1922
@echo " TIDY src/"
20-
$(Q)$(CLANG_TIDY) $(SRCS) -- $(CFLAGS) -Isrc -I$(BUILD_DIR)
23+
$(Q)$(CLANG_TIDY) $(SRCS) -- $(CFLAGS) -Isrc -I$(BUILD_DIR) \
24+
$(ZSTD_CFLAGS)
2125

2226
## Run clang static analyzer (scan-build)
2327
analyze:

mk/config.mk

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,17 @@ endif
1616

1717
# Exclude native macOS test files from cross-compilation
1818
NATIVE_TESTS := tests/test-multi-vcpu.c tests/test-rwx.c \
19-
tests/test-tlbi-encoder-host.c
19+
tests/test-tlbi-encoder-host.c \
20+
tests/test-oci-ref.c \
21+
tests/test-oci-digest.c tests/test-oci-blob-store.c \
22+
tests/test-oci-manifest.c tests/test-oci-fetch.c \
23+
tests/test-oci-store.c tests/test-oci-pull.c \
24+
tests/test-oci-inspect.c tests/test-oci-tar.c \
25+
tests/test-oci-decompress.c tests/test-oci-meta.c \
26+
tests/test-oci-layer-apply.c tests/test-oci-volume.c \
27+
tests/test-oci-clone.c tests/test-oci-unpack.c \
28+
tests/test-oci-runspec.c tests/test-oci-path-resolve.c \
29+
tests/test-oci-run.c
2030
SPECIAL_TEST_SRCS := tests/test-lowbase-mem.c
2131
SPECIAL_TEST_BINS := $(BUILD_DIR)/test-lowbase-mem-200000 $(BUILD_DIR)/test-lowbase-mem-300000
2232

0 commit comments

Comments
 (0)