Skip to content

RFD 0191: bhyve live migration#168

Draft
nwilkens wants to merge 8 commits into
masterfrom
rfd-0191
Draft

RFD 0191: bhyve live migration#168
nwilkens wants to merge 8 commits into
masterfrom
rfd-0191

Conversation

@nwilkens
Copy link
Copy Markdown
Member

Summary

  • Architecture design for live migration of bhyve VMs in Triton
  • Extends the existing VMAPI migration control-plane with bhyve-specific hypervisor hooks
  • Covers CPU compatibility, memory transfer, vCPU state, device serialization, and network cutover

Renumbered from RFD 190 to 191 to avoid conflict with existing rfd-190 branch.

Replaces #167.

🤖 Generated with Claude Code

nwilkens and others added 8 commits April 15, 2026 12:14
- Add a "Hard Requirements" section that enumerates, by state category
  (CPU, interrupt controllers, memory, devices, clocks, storage,
  network, guest-visible identity), what has to move for live
  migration to work and why, plus what explicitly does not move and
  what invariants the hypervisor must enforce.
- Add a "CPU And Platform Compatibility" section discussing how other
  hypervisors (VMware EVC, QEMU/libvirt, Hyper-V, Nutanix AHV, Oxide
  Propolis) handle the same problem, four candidate approaches for
  Triton, a recommended combination (per-VM baseline + preamble
  validation), and platform-image ABI compatibility.
- Restructure "The CN-Local Migration Data Plane" to state the
  component's requirements without committing to a deployment home;
  leave the packaging decision (dedicated service / cn-agent /
  something else) explicitly open for community input.
- Soften threat-model language: the data plane should be secured as
  a matter of principle rather than assuming a specific network
  posture.
- Drop the phased "Implementation Plan" in favor of a single
  paragraph noting that scheduling is a planning exercise, not an
  architectural decision.
- Reframe "Alternatives Considered" to explain why each alternative
  is less useful than the proposed direction, rather than labeling
  each as "rejected".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refinements to the "CPU and per-vCPU state" subsection of Hard
Requirements, based on a careful read of real working implementations:

- Remove CR8 from the control register list.  CR8's guest-visible
  value is the LAPIC Task Priority Register; it moves with the
  LAPIC, not the control registers.
- Add EFER explicitly to the control register group with a note
  that while it is architecturally an MSR, its role in gating
  long-mode means it must transfer alongside the control registers
  and be excluded from any enumeration-based MSR list to avoid
  double-restoration.
- Add IA32_DEBUGCTL to the debug registers group for the same
  architectural reason: it is an MSR, but belongs with the debug
  register state functionally.
- Replace the vague "XSAVE area of up to several kilobytes" with a
  hard requirement that implementations query the required buffer
  size at runtime (via VM_DESC_FPU_AREA or equivalent) rather than
  hardcode a size.  The area grows with each new state component
  (AMX tiledata alone is 8 KiB) and static buffers silently
  truncate on future microarchitectures.
- Note that the MSR list is kernel-authoritative, not userspace-
  authoritative.  Enumerating exhaustively from the kernel's MSR
  data class absorbs kernel additions without userspace ABI churn;
  hardcoded userspace lists silently drop new MSRs.
- Extend per-vCPU run state to call out the SIPI vector explicitly:
  run state and SIPI vector must migrate as a pair, or APs that
  received INIT-but-not-SIPI will never start.
- Split the old "pending injected events" bullet into two pieces:
  pending events (NMI, ExtInt, Exception, IntInfo via VDC_VMM_ARCH
  VAI_PEND_* fields) and the interrupt shadow
  (VM_REG_GUEST_INTR_SHADOW).  Both are distinct and both must
  migrate to avoid dropping signals or double-delivering interrupts
  one instruction early.

Also remove a stray em-dash introduced in the pending-events
rewrite, keeping the document em-dash-free.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RFD 190 is already claimed by the "Deprecating NodeJS for Rust" RFD
on the rfd-190 branch. Renumber our bhyve Live Migration Architecture
RFD to 191 to avoid the conflict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a new top-level "Memory Transfer And Convergence Policy"
section as a structural peer to "Storage Transfer".  The prior
draft covered what RAM must move (§ Guest physical memory) and the
hypervisor-side dirty-tracking requirement (§ 5. Dirty-page
tracking) but left the policy layer unspecified: when does the
agent decide pre-copy has done enough, what guarantee do operators
get about the resulting downtime, and what happens when that
guarantee can't be met.

Central recommendation: express the exit criterion as a target
post-pause downtime in milliseconds.  Each pass measures its own
effective transfer rate; the exit threshold for pass N is
`downtime_budget_ms × observed_bw_pages_per_ms`.  Adapts to
network conditions and guest behavior by construction.

Also covers:
- short fixed cooldown between passes (avoid trivial-convergence
  races);
- hard safety ceiling on pass count;
- SLA-miss event when the ceiling fires, so operators distinguish
  cleanly-converged from ran-to-ceiling migrations;
- best-effort default, with strict mode as a per-migration option
  driven from the VMAPI request (out of scope for the first
  release);
- per-deployment budget with per-migration override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an overlay-case bullet to the Network subsection under Hard
Requirements, and a short callout on why the invariant matters for
sub-second cutover.  Describes the shape of the requirement (peer
caches must point at the destination within the cutover budget) and
why a guest-issued GARP cannot bootstrap its own invalidation on a
fabric network.  Stays above the mechanism: does not prescribe which
component drives invalidation, what API it uses, or how propagation
is measured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant