Skip to content

[multicast] connect MGD and DDM to Omicron#10346

Open
zeeshanlakhani wants to merge 9 commits into
zl/multicast-m2p-forwardingfrom
zl/multicast-mgd-ddm
Open

[multicast] connect MGD and DDM to Omicron#10346
zeeshanlakhani wants to merge 9 commits into
zl/multicast-m2p-forwardingfrom
zl/multicast-mgd-ddm

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Collaborator

@zeeshanlakhani zeeshanlakhani commented Apr 30, 2026

This PR stacks atop #10070 and inherits the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance subscription endpoints.

This also builds on and integrates #10381.

Here, we wire MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain.

Multicast is instance networking under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers (omicron#10167).

Sled-side underlay NIC filter programming

  • set_mcast_m2p / clear_mcast_m2p in the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggers mac_multicast_add in the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908.
  • Eager rehydration at sled-agent startup reopens those filter sockets for M2P entries that survive in xde across a restart. Initial-join and rehydration failures both roll back the underlying M2P entry so convergence retries on the next pass instead of silently dropping the group.

Switch-zone integration

  • New MulticastSwitchZoneClient fans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-client GET /peers endpoint that returns if_name / port info per peer).
  • ServiceName::Ddm registered in internal DNS via host_zone_switch (now takes a ddm_port) so cross-sled consumers can discover ddmd in switch zones. RSS, the test starter, and overridables_for_test thread the new port through. The multicast reconciler is the first cross-sled consumer; previously, all DdmAdminClient callers were sled-local via DdmAdminClient::localhost.
  • Resolver helper preserves SRV target names alongside resolved sockets, enabling per-target correlation when multiple switch zones share an address but differ by port.

Note: the first reconciler pass after upgrade publishes one new _ddm._tcp SRV record per switch zone, causing a one-time DNS generation bump.

VMM-scoped multicast subscriptions

  • v41 (VERSION_MCAST_M2P_FORWARDING) keeps the VMM-keyed PUT/DELETE /vmms/{propolis_id}/multicast-group shape and threads propolis_id end-to-end through the sled-agent. instance_manager dispatches by propolis_id directly via self.jobs.get(&propolis_id), removing the live-migration ambiguity where source and target VMMs for the same instance could both hold entries keyed by instance_id.
  • v7 endpoints remain on the trait as deprecated shims that delegate to the v41 handler with the path's propolis_id. No propolis-to-instance lookup is performed.

Per-pass sled-to-port resolution

Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input.

  • Replaces the previous TTL'd sled-mapping cache with a single-pass amortization built once at the top of the member reconciler pass and threaded through the per-pass reconciler context.
  • DDM peer topology is the primary source. Inventory + DPD backplane is the fallback and supplements partial DDM coverage (per-sled gap-fill) rather than being all-or-nothing.
  • Parsed peer port IDs are cross-validated against the DPD backplane map.
  • Sequential per-switch fallback for shared-state DPD reads (backplane map, underlay group fetch), so a single unhealthy switch can't fail the whole read.

Saga and RPW interaction

  • Saga state guard widened: the DPD-ensure saga accepts "Active" as well as "Creating" so crash-recovery re-execution doesn't roll back already-applied DPD state.
  • instance_stop detaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if "Stop" fails.
  • MRIB advertisement is gated on a dpd_synced flag rather than running unconditionally after the DPD match arm, so a DPD failure no longer leaves a route advertised via DDM with no programmed forwarding state.
  • MulticastGroupReconcilerConfig gains group_concurrency_limit (default 16) and member_concurrency_limit (default 32) to bound the per-pass fan-out of the reconciler's buffer_unordered streams.

Tests and simulation

  • Integration coverage for MRIB programming, DDM-vs-inventory drift, saga idempotent crash-recovery, per-switch invariant checks, and underlay MAC filter lifecycle.
  • populate_ddm_peers synthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback an empty DdmInstance would otherwise force. It rebuilds the peer map every call: the previous sled-id-keyed cache could reuse stale sp_slot-derived port names if inventory shape changed within the same sled set.
  • Sim alignments:
    • v7 multicast endpoints fall through to trait defaults instead of overriding with unimplemented!()
    • VMM existence check on join / leave restored
    • sim_force_remove tolerates the same Object-not-present race that sim_poke already handles.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch 5 times, most recently from ca48d80 to 15e64aa Compare May 1, 2026 15:59
Wires MGD (MRIB programming) and DDM (live peer topology for
sled-to-switch-port resolution) into the multicast reconciler RPW. The
reconciler resolves sled-to-port mapping via DDM peers (primary, live
source) and falls back to inventory + DPD backplane when DDM is
unavailable. MRIB routes are advertised through MGD and withdrawn when
no "Joined" members remain.

Multicast is *instance networking* under the planned migration of
system-level networking from Nexus RPWs to sled-agent reconcilers
([omicron#10167](#10167
)).

### Sled-side underlay NIC filter programming

- `set_mcast_m2p` / `clear_mcast_m2p` in the OPTE port manager hold UDP
  sockets joined to the underlay multicast group on each underlay NIC.
  Joining the group on a held socket triggers `mac_multicast_add` in the
  kernel, which programs the per-NIC multicast MAC filter so cxgbe
  delivers frames to xde. Workaround for opte#908.
- Eager rehydration at sled-agent startup reopens those filter sockets
  for M2P entries that survive in xde across a restart. Rehydration
  failures clear the surviving M2P entry so convergence retries on the
  next pass instead of black-holing the group.

### Switch-zone integration

- New `MulticastSwitchZoneClient` fans out per-switch MGD and DDM
  clients, discovered via internal DNS SRV records. The reconciler uses
  it for MRIB writes and live peer queries (consuming the
  ddm-admin-client `GET /peers` endpoint that returns `if_name` / port
  info per peer).
- `ServiceName::Ddm` registered in internal DNS via `host_zone_switch`
  (now takes a `ddm_port`) so cross-sled consumers can discover `ddmd`
  in switch zones. RSS, the test starter, and `overridables_for_test`
  thread the new port through. The multicast reconciler is the first
  cross-sled consumer; previously, all `DdmAdminClient` callers were
  sled-local via `DdmAdminClient::localhost`.
- Resolver helper preserves SRV target names alongside resolved sockets,
  enabling per-target correlation when multiple switch zones share an
  address but differ by port.

*Note*: the first reconciler pass after upgrade publishes one new 
`_ddm._tcp` SRV record per switch zone, causing a one-time DNS generation 
bump.

### Instance-scoped multicast subscriptions

- v36 (`VERSION_MCAST_M2P_FORWARDING`) introduces
  `PUT/DELETE /instances/{instance_id}/multicast-group`, replacing the
  earlier VMM-keyed `/vmms/{propolis_id}/multicast-group` shape.
  Sled-agent resolves the active VMM under its instance-state lock and
  dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call
  race where a migration commit could land subscriptions on a stale
  propolis.
- v7 endpoints remain on the trait as deprecated shims that perform the
  propolis-to-instance lookup and delegate to the new handler.
- Nexus drops `cached_propolis_id` and `lookup_propolis_id` plumbing
  through the reconciler entirely. `subscribe_vmm` / `unsubscribe_vmm`
  become `subscribe_instance` / `unsubscribe_instance`. 

### Per-pass sled-to-port resolution

Delivers the design captured in the prior TODO: prefer DDM's
authoritative view of sled-to-port reachability over inventory, with
inventory as cross-validation rather than the primary input.

- Replaces the previous TTL'd sled-mapping cache with a single-pass
  amortization built once at the top of the member reconciler pass and
  threaded through the per-pass reconciler context.
- DDM peer topology is the primary source. Inventory + DPD backplane is
  the fallback and supplements partial DDM coverage (per-sled gap-fill)
  rather than being all-or-nothing.
- Parsed peer port IDs are cross-validated against the DPD backplane
  map.
- Sequential per-switch fallback for shared-state DPD reads (backplane
  map, underlay group fetch), so a single unhealthy switch can't fail
  the whole read.

### Saga and RPW interaction

- Saga state guard widened: the DPD-ensure saga accepts "Active" as
  well as "Creating" so crash-recovery re-execution doesn't roll back
  already-applied DPD state.
- `instance_stop` detaches multicast members and activates the
  reconciler only after sled-agent acknowledges the Stop request,
  avoiding M2P / forwarding teardown for a still-running guest if Stop
  fails.

### Test updates

- Integration coverage for MRIB programming, DDM-vs-inventory drift,
  saga idempotent crash-recovery, per-switch invariant checks, and
  underlay MAC filter lifecycle.
- New `populate_ddm_peers` test helper synthesizes DDM peer topology
  from datastore + inventory so tests exercise the production primary
  path instead of the inventory fallback that an empty `DdmInstance`
  would otherwise force. Cache keyed on the in-service sled-set so
  multi-sled fixtures rebuild on sled transitions.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch from 15e64aa to 92d912b Compare May 1, 2026 16:06
@zeeshanlakhani zeeshanlakhani self-assigned this May 2, 2026
@zeeshanlakhani zeeshanlakhani requested review from internet-diglett and jgallagher and removed request for jgallagher May 2, 2026 08:16
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch 3 times, most recently from 44e7675 to ef44f19 Compare May 7, 2026 12:05
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch from 5b2a9a0 to 76d0d3f Compare May 26, 2026 08:48
Final, pre-review pass on this work. It stacks atop #10070 and inherits
the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance
subscription endpoints.

This also builds on and integrates #10381.

Above these foundations, this work includes the final pass on mgd-ddmd
integration:

* Reconciler correctness:
  * `set_mcast_m2p` rolls back the xde M2P entry on per-NIC join
    failure, so the reconciler converges on a retry instead of
    leaving stale state pointing at the wrong underlay address.
  * `propolis_id` is threaded end-to-end through the sled-agent
    multicast endpoints to deal with live migration ambiguity.
  * MRIB advertisement is gated on a flag rather than running unconditionally
    after the DPD match arm, so that a DPD failure no longer leaves
    a route advertised via DDM with no programmed forwarding state.

* OPTE hardening (illumos-utils):
  * M2P entries upserted into a `BTreeMap<IpAddr, MulticastUnderlay>`
    rather than a Vec on the non-illumos mock, eliminating duplicate-key corner
    cases the production map already avoided.
  * `MulticastFilterMap` encapsulates the per-NIC filter socket and
    refcount state previously open-coded inside `PortManagerInner`,
    concentrating the "join socket per underlay group per NIC"
    invariant into one singular type.
  * underlay_nics typed as &[AddrObject] rather than &[String].
  * Per-NIC IPV6_JOIN_GROUP calls converted from libc::setsockopt to
    nix::sys::socket::setsockopt for the typed bind.

* Sled-agent (real and sim):
  * Sim v7 multicast endpoints fall through to the trait defaults
    instead of overriding with just `unimplemented!()`, matching how
    other versioned endpoints behave in the sim.
  * Sim VMM existence check on join/leave restored.

* Configuration:
  * `MulticastGroupReconcilerConfig` gains a group_concurrency_limit
    and member_concurrency_limit bounding the per-pass fan-out of the RPW's
    buffer_unordered streams.

* Test infra:
  * `populate_ddm_peers` no longer caches the peer map. The previous
    cache was keyed by sled-id set, but the synthesized port names
    embedded each sled's `sp_slot` from inventory, so cache reuse
    within the same sled set could produce stale port mappings.

* Documentation cleanup across the RPW, sled-agent multicast paths, and the
  new(er) sled-agent types module.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-mgd-ddm branch from 76d0d3f to 0e64f2d Compare May 26, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant