Skip to content

Commit 83bb030

Browse files
authored
Merge pull request #3 from pnc/kernel-panic
Fix kernel panic after unattended kernel upgrade
2 parents 3b7c6a5 + 5c35788 commit 83bb030

5 files changed

Lines changed: 248 additions & 21 deletions

File tree

CLAUDE.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@
44

55
## Development workflow
66

7-
Before running the e2e tests, ensure the test prerequisites from
8-
HACKING.md are installed.
7+
The e2e tests run inside this VM. If `uv` is not yet installed, install
8+
it first (see Tooling policy below). Ensure the test prerequisites from
9+
HACKING.md are also installed.
910

1011
Always run the test suite before committing:
1112

@@ -15,6 +16,24 @@ uv run pytest tests/test_e2e.py -v -s
1516

1617
The test boots the VM end-to-end (takes ~90s without KVM) and verifies `curl https://pypi.org` works through mitmproxy. Do not commit if this fails.
1718

19+
### Fixing CI failures
20+
21+
When a test fails in CI but passes locally, **reproduce the failure locally
22+
before applying a fix.** This VM has KVM, but CI may not — one known
23+
divergence is the QEMU CPU model (`-cpu host` with KVM vs `-cpu max` with
24+
TCG). To match CI's TCG environment:
25+
26+
```bash
27+
QEMU_ACCEL=tcg uv run pytest tests/test_e2e.py::test_that_failed -v -s
28+
```
29+
30+
The workflow is:
31+
32+
1. **Reproduce** — run the failing test under CI-like conditions and confirm it fails.
33+
2. **Fix** — apply the change.
34+
3. **Verify** — re-run under the same conditions and confirm it passes.
35+
4. **Full suite** — run the complete test suite to check for regressions.
36+
1837
The full suite including the network isolation tests can take 5+ minutes under TCG emulation. TCG is slower than KVM but not *that* slow — if cloud-init status is unchanged for more than a minute, check the console log and process list rather than assuming it's just slow. A dead QEMU process or OOM kill is more likely than TCG being the bottleneck.
1938

2039
Launch the test with `Bash` using `run_in_background: true`, then immediately attach a `Monitor` to tail the output file with a progress filter. This keeps the conversation unblocked while streaming results:

allowlist.txt

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -59,20 +59,27 @@ GET https://api.anthropic.com/api/hello
5959
# astral.sh; binary downloads come from releases.astral.sh (or GitHub
6060
# release assets as a fallback). URLs vary by version and platform.
6161
GET https://astral.sh/uv/install.sh
62+
GET https://releases.astral.sh/installers/uv/*
6263
GET https://releases.astral.sh/github/uv/releases/*
6364
GET https://github.com/astral-sh/uv/releases/*
6465
GET https://release-assets.githubusercontent.com/github-production-release-asset/*
6566

66-
# ── Docker Hub ────────────────────────────────────────────────────
67-
# Registry API — paths vary by image name, tag, and sha256 digest
68-
# (e.g. /v2/library/hello-world/manifests/latest). Scoped to /v2/.
69-
GET https://registry-1.docker.io/v2/*
70-
# Auth tokens — the registry returns 401 with a token URL whose
71-
# query parameters vary per request (scope, service, etc.).
67+
# ── Docker Hub (hello-world only) ─────────────────────────────────
68+
# Scoped to the library/hello-world image used by the e2e test.
69+
# To pull other images, add their specific paths here.
70+
# /v2/ (bare) is Docker's registry version check — required before
71+
# any image pull.
72+
GET https://registry-1.docker.io/v2/
73+
GET https://registry-1.docker.io/v2/library/hello-world/*
74+
# Auth tokens — scoped to the hello-world repository.
7275
GET https://auth.docker.io/token*
73-
# Blob storage — the registry redirects layer downloads to this
74-
# Cloudflare R2 bucket. Paths contain per-blob sha256 digests.
76+
# Blob storage — the registry redirects layer downloads to either
77+
# a Cloudflare R2 bucket or CloudFront CDN. Blob paths contain
78+
# sha256 digests that can't be scoped per-image, but the registry
79+
# only returns redirect URLs for layers belonging to images the
80+
# client already resolved via the scoped manifest rules above.
7581
GET https://docker-images-prod.6aa30f8b08e16409b46e0173d6de2f56.r2.cloudflarestorage.com/registry-v2/*
82+
GET https://production.cloudfront.docker.com/registry-v2/*
7683

7784
# ── Debian cloud images (nested VM testing only) ──────────────────
7885
# Only needed when running the e2e test suite inside a VM (i.e. the

cloud-init/user-data

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,18 @@ bootcmd:
2121
wget -qO /usr/local/share/ca-certificates/mitmproxy.crt http://mitm.it/cert/pem
2222
update-ca-certificates
2323
fi
24-
# Disable initramfs rebuilds — this is an ephemeral VM that is never
25-
# rebooted. The generic kernel's mkinitramfs takes 2+ minutes under
26-
# TCG because it copies hundreds of driver modules.
27-
- dpkg-divert --local --rename --add /usr/sbin/update-initramfs
28-
- ln -sf /bin/true /usr/sbin/update-initramfs
29-
24+
# Suppress initramfs rebuilds during first-boot provisioning only.
25+
# Package installs (docker.io, etc.) trigger the initramfs-tools dpkg
26+
# hook, and a full rebuild takes minutes under TCG emulation. The
27+
# diversion is undone in runcmd so that future kernel upgrades (via
28+
# unattended-upgrades) generate a working initramfs.
29+
# Guard: boot-finished is written at the very end of cloud-init's final
30+
# stage, so it won't exist during first boot but will on all subsequent.
31+
- |
32+
if [ ! -f /var/lib/cloud/instance/boot-finished ]; then
33+
dpkg-divert --local --rename --add /usr/sbin/update-initramfs 2>/dev/null
34+
ln -sf /bin/true /usr/sbin/update-initramfs
35+
fi
3036
write_files:
3137
- path: /etc/apt/apt.conf.d/90proxy
3238
content: |
@@ -64,6 +70,13 @@ write_files:
6470
Environment="HTTPS_PROXY=http://__HOST_IP__:__PROXY_PORT__"
6571
Environment="NO_PROXY=localhost,127.0.0.1,__HOST_IP__"
6672

73+
# The default MODULES=most copies hundreds of bare-metal drivers into
74+
# the initramfs. Under TCG emulation this takes minutes. MODULES=dep
75+
# limits it to modules for detected hardware (virtio), cutting the
76+
# rebuild from ~5 min to seconds.
77+
- path: /etc/initramfs-tools/conf.d/vm-modules-dep
78+
content: |
79+
MODULES=dep
6780
- path: /etc/systemd/system/mnt-9p.mount
6881
content: |
6982
[Unit]
@@ -184,6 +197,14 @@ packages:
184197
- docker.io
185198

186199
runcmd:
200+
# Undo the update-initramfs diversion applied during first-boot
201+
# provisioning (see bootcmd). From this point on, kernel upgrades
202+
# will generate a proper initramfs.
203+
- |
204+
if dpkg-divert --list /usr/sbin/update-initramfs 2>/dev/null | grep -q diversion; then
205+
rm -f /usr/sbin/update-initramfs
206+
dpkg-divert --local --rename --remove /usr/sbin/update-initramfs
207+
fi
187208
- mkdir -p /mnt/9p /home/vm/shared
188209
- systemctl daemon-reload
189210
- systemctl enable --now mnt-9p.mount
@@ -200,8 +221,27 @@ runcmd:
200221
# from /etc/profile.d/proxy.sh). --no-modify-path because proxy.sh
201222
# already adds ~/.local/bin to PATH. Binary lands in /home/vm/.local/bin/.
202223
- su - vm -c 'curl -LsSf https://astral.sh/uv/install.sh | sh -s -- --no-modify-path'
203-
# Install Claude Code CLI.
204-
- su - vm -c 'curl -fsSL https://claude.ai/install.sh | bash'
224+
# Install Claude Code CLI. The official install script runs
225+
# `claude install` after downloading, which maps ~70 GB of virtual
226+
# memory. Under TCG emulation this either triggers an invalid-opcode
227+
# trap (qemu64 lacks the required instructions) or takes so long that
228+
# cloud-init times out. Download the binary directly instead.
229+
- |
230+
su - vm -c '
231+
set -e
232+
DOWNLOAD_BASE="https://downloads.claude.ai/claude-code-releases"
233+
case "$(uname -m)" in
234+
x86_64|amd64) platform="linux-x64" ;;
235+
aarch64|arm64) platform="linux-arm64" ;;
236+
*) echo "Unsupported arch: $(uname -m)" >&2; exit 1 ;;
237+
esac
238+
version=$(curl -fsSL "$DOWNLOAD_BASE/latest")
239+
mkdir -p ~/.local/share/claude/versions ~/.local/bin
240+
curl -fsSL -o ~/.local/share/claude/versions/"$version" \
241+
"$DOWNLOAD_BASE/$version/$platform/claude"
242+
chmod +x ~/.local/share/claude/versions/"$version"
243+
ln -sf ~/.local/share/claude/versions/"$version" ~/.local/bin/claude
244+
'
205245
# Propagate the host user's git identity into the VM so commits
206246
# made inside the guest have the correct author. The placeholders
207247
# are substituted by vm.py from `git config --global`; if the host

tests/test_e2e.py

Lines changed: 162 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ def running_vm():
181181
# vm.py start runs mitmproxy in the background and QEMU in the foreground.
182182
# Both inherit our file handles, so their output lands in console.log.
183183
vm_proc = subprocess.Popen(
184-
[sys.executable, str(VM_PY), "start", "--memory", "512M",
184+
[sys.executable, str(VM_PY), "start", "--memory", "2G",
185185
"--ssh-port", str(TEST_SSH_PORT),
186186
"--proxy-port", str(TEST_PROXY_PORT),
187187
"--extra-user-data", str(REPO / "tests" / "nmap.yaml")],
@@ -332,6 +332,45 @@ def test_docker_hello_world(running_vm):
332332
)
333333

334334

335+
def test_uv_installed(running_vm):
336+
"""uv should be installed and functional after cloud-init provisioning."""
337+
_progress("Checking uv installation…")
338+
r = _vm_ssh("bash -lc 'uv --version'", timeout=30)
339+
assert r.returncode == 0, (
340+
f"uv not installed or not on PATH (rc={r.returncode}):\n"
341+
f"stdout: {r.stdout[:500]}\nstderr: {r.stderr[:500]}"
342+
)
343+
assert "uv" in r.stdout, f"Unexpected uv --version output: {r.stdout}"
344+
345+
346+
def test_claude_code_installed(running_vm):
347+
"""Claude Code CLI should be installed and functional after cloud-init provisioning."""
348+
_progress("Checking Claude Code installation…")
349+
r = _vm_ssh("bash -lc 'claude --version'", timeout=30)
350+
if r.returncode != 0:
351+
diag = _vm_ssh(
352+
"bash -lc '"
353+
"echo \"=== binary ===\"; ls -la ~/.local/bin/claude 2>&1; "
354+
"echo \"=== versions ===\"; ls ~/.local/share/claude/versions/ 2>&1; "
355+
"echo \"=== file ===\"; file $(readlink -f ~/.local/bin/claude) 2>&1; "
356+
"echo \"=== ldd ===\"; ldd $(readlink -f ~/.local/bin/claude) 2>&1; "
357+
"echo \"=== dmesg ===\"; sudo dmesg | tail -20 2>&1; "
358+
"echo \"=== free ===\"; free -h 2>&1; "
359+
"echo \"=== PATH ===\"; echo PATH=$PATH'",
360+
timeout=10,
361+
)
362+
assert False, (
363+
f"claude not installed or not on PATH (rc={r.returncode}):\n"
364+
f"stderr: {r.stderr[:500]}\n"
365+
f"diagnostics:\n{diag.stdout[:2000]}"
366+
)
367+
output = (r.stdout + r.stderr).lower()
368+
assert "claude" in output, (
369+
f"Unexpected claude --version output:\n"
370+
f"stdout: {r.stdout!r}\nstderr: {r.stderr!r}"
371+
)
372+
373+
335374
def test_blocked_domain(running_vm):
336375
"""Requests to domains not in filter.py's allowlist should be blocked with 403."""
337376
result = _vm_ssh(
@@ -676,3 +715,125 @@ def test_guest_cannot_modify_host_allowlist(running_vm):
676715
_vm_ssh(f"rm -f ~/shared/{marker} 2>/dev/null; true", timeout=10)
677716
# Safety net: restore original content in case the test failed
678717
allowlist_path.write_text(original_content)
718+
719+
720+
# ---------------------------------------------------------------------------
721+
# Kernel upgrade + reboot
722+
# ---------------------------------------------------------------------------
723+
724+
725+
def test_kernel_install_and_reboot(running_vm):
726+
"""Installing a new kernel and rebooting must not kernel panic.
727+
728+
The base cloud-init config once diverted update-initramfs to /bin/true
729+
to speed up provisioning (~2 min saved under TCG emulation). This was
730+
safe under the assumption that the VM was ephemeral and never rebooted.
731+
In practice, Debian's unattended-upgrades installs kernel security
732+
updates on a daily timer. Because update-initramfs was a no-op, the
733+
new kernel shipped without an initramfs. GRUB's os-prober still picked
734+
up the new vmlinuz and made it the default boot entry — but with no
735+
initrd line. On next boot the kernel couldn't load the virtio_blk
736+
module (it lives in the initramfs, not built-in), so the root disk was
737+
invisible and the kernel panicked:
738+
739+
VFS: Cannot open root device "PARTUUID=..." or unknown-block(0,0)
740+
Kernel panic - not syncing: VFS: Unable to mount root fs
741+
742+
This test reproduces that scenario end-to-end: install a second kernel
743+
flavor, set GRUB to boot it, and reboot. If update-initramfs is broken,
744+
the VM kernel-panics and SSH never comes back.
745+
746+
Placed last because it reboots the VM.
747+
"""
748+
# Detect guest architecture to pick the right cloud kernel package.
749+
r = _vm_ssh("dpkg --print-architecture", timeout=10)
750+
assert r.returncode == 0
751+
arch = r.stdout.strip()
752+
cloud_pkg = f"linux-image-cloud-{arch}"
753+
754+
_progress(f"Installing {cloud_pkg}…")
755+
r = _vm_ssh(
756+
f"bash -lc 'sudo apt-get install -y -qq {cloud_pkg} 2>&1'",
757+
timeout=600,
758+
)
759+
assert r.returncode == 0, (
760+
f"Kernel install failed (rc={r.returncode}):\n"
761+
f"{r.stdout[-2000:]}\n{r.stderr[-2000:]}"
762+
)
763+
764+
# Find the newly installed cloud kernel version.
765+
r = _vm_ssh(f"ls /boot/vmlinuz-*-cloud-{arch}", timeout=10)
766+
assert r.returncode == 0, f"No cloud kernel found in /boot:\n{r.stderr}"
767+
cloud_vmlinuz = r.stdout.strip().splitlines()[-1].strip()
768+
cloud_version = cloud_vmlinuz.rsplit("/", 1)[-1].removeprefix("vmlinuz-")
769+
_progress(f"Installed cloud kernel: {cloud_version}")
770+
771+
# Verify the initramfs was created for it.
772+
r = _vm_ssh(f"test -f /boot/initrd.img-{cloud_version}", timeout=10)
773+
assert r.returncode == 0, (
774+
f"initrd.img-{cloud_version} was not created.\n"
775+
"update-initramfs is likely diverted to /bin/true."
776+
)
777+
778+
# Set GRUB to boot the cloud kernel by default.
779+
# Read the root filesystem UUID from the running VM rather than
780+
# hardcoding a PARTUUID that is specific to one image build.
781+
r = _vm_ssh(
782+
"sudo grub-probe --target=fs_uuid /",
783+
timeout=10,
784+
)
785+
assert r.returncode == 0, f"Cannot determine root FS UUID:\n{r.stderr}"
786+
root_uuid = r.stdout.strip()
787+
grub_entry = f"gnulinux-advanced-{root_uuid}>gnulinux-{cloud_version}-advanced-{root_uuid}"
788+
_vm_ssh(
789+
f"sudo grub-set-default '{grub_entry}' 2>&1",
790+
timeout=10,
791+
)
792+
_vm_ssh("sudo update-grub 2>&1", timeout=60)
793+
794+
# Verify GRUB config has an initrd line for the cloud kernel.
795+
r = _vm_ssh("cat /boot/grub/grub.cfg", timeout=10)
796+
assert f"initrd\t/boot/initrd.img-{cloud_version}" in r.stdout, (
797+
f"GRUB config missing initrd for {cloud_version}."
798+
)
799+
800+
_progress("Rebooting into cloud kernel…")
801+
_vm_ssh("sudo reboot", timeout=10)
802+
803+
# Wait for SSH to go down.
804+
time.sleep(10)
805+
806+
# Wait for SSH to come back — if the kernel panicked, it never will.
807+
deadline = time.monotonic() + BOOT_TIMEOUT
808+
attempt = 0
809+
while time.monotonic() < deadline:
810+
if running_vm.poll() is not None:
811+
_dump_logs()
812+
pytest.fail(
813+
"QEMU exited during reboot — likely kernel panic.\n"
814+
"Check console log above."
815+
)
816+
attempt += 1
817+
remaining = int(deadline - time.monotonic())
818+
_progress(f"Post-reboot SSH probe #{attempt} ({remaining}s remaining)…")
819+
try:
820+
r = _vm_ssh("true", timeout=10)
821+
if r.returncode == 0:
822+
_progress(f"VM back after reboot ({attempt} probe(s))")
823+
break
824+
except subprocess.TimeoutExpired:
825+
pass
826+
time.sleep(SSH_POLL_INTERVAL)
827+
else:
828+
_dump_logs()
829+
pytest.fail(
830+
f"VM did not come back after reboot within {BOOT_TIMEOUT}s.\n"
831+
"Likely kernel panic due to missing initramfs."
832+
)
833+
834+
# Confirm we're running the new kernel.
835+
r = _vm_ssh("uname -r", timeout=10)
836+
_progress(f"Running kernel after reboot: {r.stdout.strip()}")
837+
assert "cloud" in r.stdout, (
838+
f"Expected to boot cloud kernel, got: {r.stdout.strip()}"
839+
)

vm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ def machine_args(self) -> list[str]:
166166
if self.arch == Arch.ARM64:
167167
cpu = "host" if self._accel == "hvf" else "cortex-a57"
168168
return ["-machine", f"virt,accel={self._accel}", "-cpu", cpu]
169-
cpu = "host" if self._accel == "hvf" else "qemu64"
169+
cpu = "host" if self._accel == "hvf" else "max"
170170
return ["-machine", f"q35,accel={self._accel}", "-cpu", cpu]
171171

172172
def prepare_efi(self, state_dir: Path) -> tuple[Path, Path]:
@@ -200,7 +200,7 @@ def machine_args(self) -> list[str]:
200200
if self.arch == Arch.ARM64:
201201
cpu = "host" if self._accel == "kvm" else "cortex-a57"
202202
return ["-machine", f"virt,accel={self._accel}", "-cpu", cpu]
203-
cpu = "host" if self._accel == "kvm" else "qemu64"
203+
cpu = "host" if self._accel == "kvm" else "max"
204204
return ["-machine", f"q35,accel={self._accel}", "-cpu", cpu]
205205

206206
def prepare_efi(self, state_dir: Path) -> tuple[Path, Path]:

0 commit comments

Comments
 (0)