fix(ungrub): replicate GRUB EFI core to all boot-pool members#2681
fix(ungrub): replicate GRUB EFI core to all boot-pool members#2681elibosley wants to merge 3 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 54 minutes and 24 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. WalkthroughAdds shared EFI core copy helpers, uses them during ChangesEFI core redistribution
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ungrub/mkbootable`:
- Around line 111-129: The sync loop in mkbootable currently logs mapping,
mount, copy, and flush failures but always exits successfully; update the sync
path so it tracks whether any ESP update failed and returns a non-zero status if
any member was skipped or not written. Use the existing sync_efi_to_members
logic around member_to_esp, mount, cp, and sync -f to set a failure flag or
propagate errors, and ensure the function’s final return reflects partial
failure instead of success.
- Around line 156-166: The install flow in mkbootable is losing failures from
both staging and sync: if the BOOTX64.EFI copy into core_img fails, an empty
temp file still gets passed to sync_efi_to_members(), and the final rm -f can
overwrite a non-zero sync exit status. Update the mkbootable staging/sync block
to check the cp result before calling sync_efi_to_members, and preserve the sync
return code by running cleanup without masking it so the install fails when EFI
propagation fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
a365fcd to
d3dfd17
Compare
ebb2bda to
57c4332
Compare
install_grub installed the core to only the target device's ESP, while grub-install rebuilt the shared modules under /boot/grub. On a mirror that left every other member's ESP with a stale core that no longer matched the rebuilt modules, so those members failed to boot with "symbol 'grub_memcpy' not found" -- silently destroying boot redundancy. - install_grub now copies the freshly built core to every other member's ESP (no-op for a single-device pool). - new "mkbootable sync <good-device>" op re-distributes the core from a known good member to all others, to repair members that already drifted (eg a device added via a bare "zpool attach"). install_efi_core writes safely: - reuse the ESP's existing mount if it has one, else mount it on a temp dir (never stack a second mount on one FAT -- two rw mounts corrupt each other). - stage as BOOTX64.EFI.new then mv over the live file; mv on one filesystem is rename(2), an atomic replace, so an interrupted write can't leave a torn or missing core. The core uses a dynamic ($root) prefix so the same image is valid on every member. Split into member_esp / install_efi_core / replicate_grub. Both the reuse-mount and self-mount paths verified on a loop device.
57c4332 to
01afe8c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ungrub/mkbootable`:
- Line 235: The BOOTX64.EFI replication path in replicate_grub currently assumes
the source file exists and is non-empty, so add an upfront [[ -s
"$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] validation before any target iteration; if
the check fails, make sync abort immediately and perform the existing
cleanup/error handling so no targets are processed with an invalid source core.
- Around line 131-132: The ESP skip check in replicate_grub currently compares
$esp and $EFI_PART as raw path strings, so alias paths can bypass the guard and
let sync target the source mount. Update the logic around member_esp and the
existing skip condition to compare canonical device identity instead, using
readlink -f (or an equivalent MAJ:MIN block-device comparison) for both $esp and
$EFI_PART before deciding to continue.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| log "sync: cannot mount source ESP $EFI_PART" | ||
| exit 1 | ||
| fi | ||
| replicate_grub "$BOOT/efi/EFI/BOOT/BOOTX64.EFI" |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Validate the source core before redistributing it.
If the mounted source ESP lacks EFI/BOOT/BOOTX64.EFI or it is empty, sync should fail before iterating targets. Add a [[ -s "$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] check with cleanup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ungrub/mkbootable` at line 235, The BOOTX64.EFI replication path in
replicate_grub currently assumes the source file exists and is non-empty, so add
an upfront [[ -s "$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] validation before any
target iteration; if the check fails, make sync abort immediately and perform
the existing cleanup/error handling so no targets are processed with an invalid
source core.
…y copy Replace the file-copy replicate_grub/install_efi_core with install_grub_all_members, which re-runs grub-install for both x86_64-efi and i386-pc on every boot-pool member. The EFI-only copy left the legacy/BIOS boot blocks to drift: install_grub runs grub-install --target=i386-pc on only the added disk, so a pre-existing member's core.img (BIOS Boot Partition) and boot.img (MBR) go stale against the rebuilt i386-pc modules and brick that member under CSM/legacy boot with "symbol 'grub_memcpy' not found" -- the same failure the EFI fix addresses, on the path it didn't cover. grub-install is the canonical mechanism (cf. OpenZFS "repeat grub-install for each disk"): it writes core.img to each member's BIOS Boot Partition and boot.img to each MBR while preserving the partition table, which a raw copy cannot guarantee. The ESP mount-reuse guard is kept for the UEFI install. "mkbootable sync" no longer needs a <good-device> arg -- it reinstalls from the current shared modules on all members.
…mber install_grub_all_members logged per-member failures but always returned 0, so mkbootable add/sync could report success while a member was skipped or failed to update -- silently leaving behind the exact stale-core drift this fix is meant to prevent. Now it counts skipped/failed members, logs a "N of M member(s) failed" summary, and returns non-zero if any member did not get a fresh core (or the pool enumerated zero members). install_grub and add_device propagate that status, and the operation dispatch exits with it, so a degraded remediation surfaces as a failure to the caller instead of a false success.
Problem
mkbootablewrites the GRUB EFI core to only the target device's ESP, whilegrub-installrefreshes the shared modules under/boot/grub(which lives on the boot pool and is seen by every member). On a mirror, that means a grub-install for one disk silently desyncs every other member's core from the freshly-rebuilt modules.The result: the un-refreshed member boots its old core against the new shared modules, the GRUB core/module ABI doesn't match, and it dies at the bootloader with:
The pool still boots (firmware falls through to a good member), so the loss of boot redundancy is silent — the "mirror" can't actually boot if the good drive fails.
Seen in the wild on a 2-way NVMe boot mirror: the member added later carried a stale core and would not boot, while a grub refresh on the original member rebuilt the shared modules and finished off the stale one.
Fix
install_grubnow stashes the freshly built core and replicates it to every boot-pool member's ESP after grub-install. No-op for a single-device pool; keeps all mirror members in lockstep on every add / replace / refresh.mkbootable syncop re-distributes the canonical core (/boot/grub/x86_64-efi/core.efi, which is built in lockstep with the shared modules) to all members — to repair systems whose members already drifted (e.g. a device added via a barezpool attach).member_to_espmaps each member partition (…p3) to its ESP (…p2) vialsblk pkname, handling bothnvmeXnYpNandsdXNnaming.The distributed core is safe to copy to every member because it uses a dynamic
($root)prefix and is rebuilt together with the modules it loads.Validation
bash -nclean (the two SC2174 shellcheck warnings are pre-existing).flashmirror resolved both members to the correct ESPs./boot/grub/x86_64-efi/core.efiis byte-identical to the working member'sBOOTX64.EFI.Scope / follow-up
Covers UEFI (
BOOTX64.EFI). Sibling disks' BIOS boot code (grub-install --target=i386-pcto p1/MBR) has the same drift exposure and is a sensible follow-up; not addressed here.Refs Linear DST-54.
Summary by CodeRabbit
mkbootable syncoperation to copy the canonical GRUB UEFI boot file to all boot-pool member disks.