Skip to content

fix(ungrub): replicate GRUB EFI core to all boot-pool members#2681

Open
elibosley wants to merge 3 commits into
masterfrom
fix/sync-bootx64-to-all-mirror-members
Open

fix(ungrub): replicate GRUB EFI core to all boot-pool members#2681
elibosley wants to merge 3 commits into
masterfrom
fix/sync-bootx64-to-all-mirror-members

Conversation

@elibosley

@elibosley elibosley commented Jun 24, 2026

Copy link
Copy Markdown
Member

Problem

mkbootable writes the GRUB EFI core to only the target device's ESP, while grub-install refreshes the shared modules under /boot/grub (which lives on the boot pool and is seen by every member). On a mirror, that means a grub-install for one disk silently desyncs every other member's core from the freshly-rebuilt modules.

The result: the un-refreshed member boots its old core against the new shared modules, the GRUB core/module ABI doesn't match, and it dies at the bootloader with:

error: symbol 'grub_memcpy' not found

The pool still boots (firmware falls through to a good member), so the loss of boot redundancy is silent — the "mirror" can't actually boot if the good drive fails.

Seen in the wild on a 2-way NVMe boot mirror: the member added later carried a stale core and would not boot, while a grub refresh on the original member rebuilt the shared modules and finished off the stale one.

Fix

  • install_grub now stashes the freshly built core and replicates it to every boot-pool member's ESP after grub-install. No-op for a single-device pool; keeps all mirror members in lockstep on every add / replace / refresh.
  • New mkbootable sync op re-distributes the canonical core (/boot/grub/x86_64-efi/core.efi, which is built in lockstep with the shared modules) to all members — to repair systems whose members already drifted (e.g. a device added via a bare zpool attach).
  • member_to_esp maps each member partition (…p3) to its ESP (…p2) via lsblk pkname, handling both nvmeXnYpN and sdXN naming.

The distributed core is safe to copy to every member because it uses a dynamic ($root) prefix and is rebuilt together with the modules it loads.

Validation

  • bash -n clean (the two SC2174 shellcheck warnings are pre-existing).
  • Device→ESP enumeration dry-run against a live NVMe flash mirror resolved both members to the correct ESPs.
  • Confirmed /boot/grub/x86_64-efi/core.efi is byte-identical to the working member's BOOTX64.EFI.

Scope / follow-up

Covers UEFI (BOOTX64.EFI). Sibling disks' BIOS boot code (grub-install --target=i386-pc to p1/MBR) has the same drift exposure and is a sensible follow-up; not addressed here.

Refs Linear DST-54.

Summary by CodeRabbit

  • New Features
    • Added a new mkbootable sync operation to copy the canonical GRUB UEFI boot file to all boot-pool member disks.
    • Enhanced boot setup so updated boot files are automatically propagated to every relevant EFI System Partition.
  • Bug Fixes
    • Improved robustness with safer handling for unmapped partitions, non-block devices, and mount/install failures, including per-disk logging and skip behavior when appropriate.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@elibosley, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 24 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 67e718fd-3112-4d74-b989-fab211ab1262

📥 Commits

Reviewing files that changed from the base of the PR and between 57c4332 and 769bab1.

📒 Files selected for processing (1)
  • ungrub/mkbootable

Walkthrough

Adds shared EFI core copy helpers, uses them during install_grub(), and adds a standalone sync operation that redistributes BOOTX64.EFI from a mounted source ESP.

Changes

EFI core redistribution

Layer / File(s) Summary
ESP mapping and copy routine
ungrub/mkbootable
member_esp() derives each member disk's ESP path, install_efi_core() writes BOOTX64.EFI to a target ESP through either a mounted path or mtools staging, and replicate_grub() enumerates pool members, skips the source ESP and non-block devices, and logs per-target failures.
Install-time replication
ungrub/mkbootable
install_grub() calls replicate_grub() with the freshly built BOOTX64.EFI so other member ESPs are refreshed after installation.
Standalone sync command
ungrub/mkbootable
sync_device() mounts the source ESP read-only, replicates its EFI/BOOT/BOOTX64.EFI to other member ESPs, unmounts, and the dispatcher routes the sync operation to it.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A bunny hopped through EFI light,
and copied boots from disk to disk just right.
One sync, one hop, one tidy cheer,
now every ESP keeps pace sincere. 🐇

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: synchronizing the GRUB EFI core across all boot-pool members.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sync-bootx64-to-all-mirror-members

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ungrub/mkbootable`:
- Around line 111-129: The sync loop in mkbootable currently logs mapping,
mount, copy, and flush failures but always exits successfully; update the sync
path so it tracks whether any ESP update failed and returns a non-zero status if
any member was skipped or not written. Use the existing sync_efi_to_members
logic around member_to_esp, mount, cp, and sync -f to set a failure flag or
propagate errors, and ensure the function’s final return reflects partial
failure instead of success.
- Around line 156-166: The install flow in mkbootable is losing failures from
both staging and sync: if the BOOTX64.EFI copy into core_img fails, an empty
temp file still gets passed to sync_efi_to_members(), and the final rm -f can
overwrite a non-zero sync exit status. Update the mkbootable staging/sync block
to check the cp result before calling sync_efi_to_members, and preserve the sync
return code by running cleanup without masking it so the install fails when EFI
propagation fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a9aedb71-1607-487e-a4f0-f1c68a6c1687

📥 Commits

Reviewing files that changed from the base of the PR and between bccf49e and a365fcd.

📒 Files selected for processing (1)
  • ungrub/mkbootable

Comment thread ungrub/mkbootable Outdated
Comment thread ungrub/mkbootable Outdated
@elibosley elibosley force-pushed the fix/sync-bootx64-to-all-mirror-members branch from a365fcd to d3dfd17 Compare June 24, 2026 20:35
@elibosley elibosley changed the title fix(ungrub): sync BOOTX64.EFI to all boot-pool members fix(ungrub): replicate GRUB EFI core to all boot-pool members Jun 24, 2026
@elibosley elibosley force-pushed the fix/sync-bootx64-to-all-mirror-members branch 3 times, most recently from ebb2bda to 57c4332 Compare June 24, 2026 23:05
install_grub installed the core to only the target device's ESP, while
grub-install rebuilt the shared modules under /boot/grub. On a mirror that
left every other member's ESP with a stale core that no longer matched the
rebuilt modules, so those members failed to boot with
"symbol 'grub_memcpy' not found" -- silently destroying boot redundancy.

- install_grub now copies the freshly built core to every other member's ESP
  (no-op for a single-device pool).
- new "mkbootable sync <good-device>" op re-distributes the core from a known
  good member to all others, to repair members that already drifted (eg a
  device added via a bare "zpool attach").

install_efi_core writes safely:
- reuse the ESP's existing mount if it has one, else mount it on a temp dir
  (never stack a second mount on one FAT -- two rw mounts corrupt each other).
- stage as BOOTX64.EFI.new then mv over the live file; mv on one filesystem is
  rename(2), an atomic replace, so an interrupted write can't leave a torn or
  missing core.

The core uses a dynamic ($root) prefix so the same image is valid on every
member. Split into member_esp / install_efi_core / replicate_grub. Both the
reuse-mount and self-mount paths verified on a loop device.
@elibosley elibosley force-pushed the fix/sync-bootx64-to-all-mirror-members branch from 57c4332 to 01afe8c Compare June 24, 2026 23:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ungrub/mkbootable`:
- Line 235: The BOOTX64.EFI replication path in replicate_grub currently assumes
the source file exists and is non-empty, so add an upfront [[ -s
"$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] validation before any target iteration; if
the check fails, make sync abort immediately and perform the existing
cleanup/error handling so no targets are processed with an invalid source core.
- Around line 131-132: The ESP skip check in replicate_grub currently compares
$esp and $EFI_PART as raw path strings, so alias paths can bypass the guard and
let sync target the source mount. Update the logic around member_esp and the
existing skip condition to compare canonical device identity instead, using
readlink -f (or an equivalent MAJ:MIN block-device comparison) for both $esp and
$EFI_PART before deciding to continue.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 99694f4a-3c83-4c50-bd1d-e17860fa1796

📥 Commits

Reviewing files that changed from the base of the PR and between a365fcd and 57c4332.

📒 Files selected for processing (1)
  • ungrub/mkbootable

Comment thread ungrub/mkbootable Outdated
Comment thread ungrub/mkbootable Outdated
log "sync: cannot mount source ESP $EFI_PART"
exit 1
fi
replicate_grub "$BOOT/efi/EFI/BOOT/BOOTX64.EFI"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the source core before redistributing it.

If the mounted source ESP lacks EFI/BOOT/BOOTX64.EFI or it is empty, sync should fail before iterating targets. Add a [[ -s "$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] check with cleanup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ungrub/mkbootable` at line 235, The BOOTX64.EFI replication path in
replicate_grub currently assumes the source file exists and is non-empty, so add
an upfront [[ -s "$BOOT/efi/EFI/BOOT/BOOTX64.EFI" ]] validation before any
target iteration; if the check fails, make sync abort immediately and perform
the existing cleanup/error handling so no targets are processed with an invalid
source core.

…y copy

Replace the file-copy replicate_grub/install_efi_core with
install_grub_all_members, which re-runs grub-install for both x86_64-efi
and i386-pc on every boot-pool member.

The EFI-only copy left the legacy/BIOS boot blocks to drift: install_grub
runs grub-install --target=i386-pc on only the added disk, so a pre-existing
member's core.img (BIOS Boot Partition) and boot.img (MBR) go stale against
the rebuilt i386-pc modules and brick that member under CSM/legacy boot with
"symbol 'grub_memcpy' not found" -- the same failure the EFI fix addresses,
on the path it didn't cover.

grub-install is the canonical mechanism (cf. OpenZFS "repeat grub-install for
each disk"): it writes core.img to each member's BIOS Boot Partition and
boot.img to each MBR while preserving the partition table, which a raw copy
cannot guarantee. The ESP mount-reuse guard is kept for the UEFI install.

"mkbootable sync" no longer needs a <good-device> arg -- it reinstalls from
the current shared modules on all members.
…mber

install_grub_all_members logged per-member failures but always returned 0, so
mkbootable add/sync could report success while a member was skipped or failed
to update -- silently leaving behind the exact stale-core drift this fix is
meant to prevent.

Now it counts skipped/failed members, logs a "N of M member(s) failed" summary,
and returns non-zero if any member did not get a fresh core (or the pool
enumerated zero members). install_grub and add_device propagate that status,
and the operation dispatch exits with it, so a degraded remediation surfaces as
a failure to the caller instead of a false success.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant