[multicast] connect MGD and DDM to Omicron#10346
Open
zeeshanlakhani wants to merge 9 commits into
Open
Conversation
ca48d80 to
15e64aa
Compare
Wires MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain. Multicast is *instance networking* under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers ([omicron#10167](#10167 )). ### Sled-side underlay NIC filter programming - `set_mcast_m2p` / `clear_mcast_m2p` in the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggers `mac_multicast_add` in the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908. - Eager rehydration at sled-agent startup reopens those filter sockets for M2P entries that survive in xde across a restart. Rehydration failures clear the surviving M2P entry so convergence retries on the next pass instead of black-holing the group. ### Switch-zone integration - New `MulticastSwitchZoneClient` fans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-client `GET /peers` endpoint that returns `if_name` / port info per peer). - `ServiceName::Ddm` registered in internal DNS via `host_zone_switch` (now takes a `ddm_port`) so cross-sled consumers can discover `ddmd` in switch zones. RSS, the test starter, and `overridables_for_test` thread the new port through. The multicast reconciler is the first cross-sled consumer; previously, all `DdmAdminClient` callers were sled-local via `DdmAdminClient::localhost`. - Resolver helper preserves SRV target names alongside resolved sockets, enabling per-target correlation when multiple switch zones share an address but differ by port. *Note*: the first reconciler pass after upgrade publishes one new `_ddm._tcp` SRV record per switch zone, causing a one-time DNS generation bump. ### Instance-scoped multicast subscriptions - v36 (`VERSION_MCAST_M2P_FORWARDING`) introduces `PUT/DELETE /instances/{instance_id}/multicast-group`, replacing the earlier VMM-keyed `/vmms/{propolis_id}/multicast-group` shape. Sled-agent resolves the active VMM under its instance-state lock and dispatches to OPTE atomically, eliminating a Nexus-side lookup-vs-call race where a migration commit could land subscriptions on a stale propolis. - v7 endpoints remain on the trait as deprecated shims that perform the propolis-to-instance lookup and delegate to the new handler. - Nexus drops `cached_propolis_id` and `lookup_propolis_id` plumbing through the reconciler entirely. `subscribe_vmm` / `unsubscribe_vmm` become `subscribe_instance` / `unsubscribe_instance`. ### Per-pass sled-to-port resolution Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input. - Replaces the previous TTL'd sled-mapping cache with a single-pass amortization built once at the top of the member reconciler pass and threaded through the per-pass reconciler context. - DDM peer topology is the primary source. Inventory + DPD backplane is the fallback and supplements partial DDM coverage (per-sled gap-fill) rather than being all-or-nothing. - Parsed peer port IDs are cross-validated against the DPD backplane map. - Sequential per-switch fallback for shared-state DPD reads (backplane map, underlay group fetch), so a single unhealthy switch can't fail the whole read. ### Saga and RPW interaction - Saga state guard widened: the DPD-ensure saga accepts "Active" as well as "Creating" so crash-recovery re-execution doesn't roll back already-applied DPD state. - `instance_stop` detaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if Stop fails. ### Test updates - Integration coverage for MRIB programming, DDM-vs-inventory drift, saga idempotent crash-recovery, per-switch invariant checks, and underlay MAC filter lifecycle. - New `populate_ddm_peers` test helper synthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback that an empty `DdmInstance` would otherwise force. Cache keyed on the in-service sled-set so multi-sled fixtures rebuild on sled transitions.
15e64aa to
92d912b
Compare
1 task
44e7675 to
ef44f19
Compare
…nto m2p-forwarding
…actual opte-api check)
5b2a9a0 to
76d0d3f
Compare
Final, pre-review pass on this work. It stacks atop #10070 and inherits the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance subscription endpoints. This also builds on and integrates #10381. Above these foundations, this work includes the final pass on mgd-ddmd integration: * Reconciler correctness: * `set_mcast_m2p` rolls back the xde M2P entry on per-NIC join failure, so the reconciler converges on a retry instead of leaving stale state pointing at the wrong underlay address. * `propolis_id` is threaded end-to-end through the sled-agent multicast endpoints to deal with live migration ambiguity. * MRIB advertisement is gated on a flag rather than running unconditionally after the DPD match arm, so that a DPD failure no longer leaves a route advertised via DDM with no programmed forwarding state. * OPTE hardening (illumos-utils): * M2P entries upserted into a `BTreeMap<IpAddr, MulticastUnderlay>` rather than a Vec on the non-illumos mock, eliminating duplicate-key corner cases the production map already avoided. * `MulticastFilterMap` encapsulates the per-NIC filter socket and refcount state previously open-coded inside `PortManagerInner`, concentrating the "join socket per underlay group per NIC" invariant into one singular type. * underlay_nics typed as &[AddrObject] rather than &[String]. * Per-NIC IPV6_JOIN_GROUP calls converted from libc::setsockopt to nix::sys::socket::setsockopt for the typed bind. * Sled-agent (real and sim): * Sim v7 multicast endpoints fall through to the trait defaults instead of overriding with just `unimplemented!()`, matching how other versioned endpoints behave in the sim. * Sim VMM existence check on join/leave restored. * Configuration: * `MulticastGroupReconcilerConfig` gains a group_concurrency_limit and member_concurrency_limit bounding the per-pass fan-out of the RPW's buffer_unordered streams. * Test infra: * `populate_ddm_peers` no longer caches the peer map. The previous cache was keyed by sled-id set, but the synthesized port names embedded each sled's `sp_slot` from inventory, so cache reuse within the same sled set could produce stale port mappings. * Documentation cleanup across the RPW, sled-agent multicast paths, and the new(er) sled-agent types module.
76d0d3f to
0e64f2d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR stacks atop #10070 and inherits the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance subscription endpoints.
This also builds on and integrates #10381.
Here, we wire MGD (MRIB programming) and DDM (live peer topology for sled-to-switch-port resolution) into the multicast reconciler RPW. The reconciler resolves sled-to-port mapping via DDM peers (primary, live source) and falls back to inventory + DPD backplane when DDM is unavailable. MRIB routes are advertised through MGD and withdrawn when no "Joined" members remain.
Multicast is instance networking under the planned migration of system-level networking from Nexus RPWs to sled-agent reconcilers (omicron#10167).
Sled-side underlay NIC filter programming
set_mcast_m2p/clear_mcast_m2pin the OPTE port manager hold UDP sockets joined to the underlay multicast group on each underlay NIC. Joining the group on a held socket triggersmac_multicast_addin the kernel, which programs the per-NIC multicast MAC filter so cxgbe delivers frames to xde. Workaround for opte#908.Switch-zone integration
MulticastSwitchZoneClientfans out per-switch MGD and DDM clients, discovered via internal DNS SRV records. The reconciler uses it for MRIB writes and live peer queries (consuming the ddm-admin-clientGET /peersendpoint that returnsif_name/ port info per peer).ServiceName::Ddmregistered in internal DNS viahost_zone_switch(now takes addm_port) so cross-sled consumers can discoverddmdin switch zones. RSS, the test starter, andoverridables_for_testthread the new port through. The multicast reconciler is the first cross-sled consumer; previously, allDdmAdminClientcallers were sled-local viaDdmAdminClient::localhost.Note: the first reconciler pass after upgrade publishes one new
_ddm._tcpSRV record per switch zone, causing a one-time DNS generation bump.VMM-scoped multicast subscriptions
VERSION_MCAST_M2P_FORWARDING) keeps the VMM-keyedPUT/DELETE /vmms/{propolis_id}/multicast-groupshape and threadspropolis_idend-to-end through the sled-agent.instance_managerdispatches bypropolis_iddirectly viaself.jobs.get(&propolis_id), removing the live-migration ambiguity where source and target VMMs for the same instance could both hold entries keyed byinstance_id.propolis_id. No propolis-to-instance lookup is performed.Per-pass sled-to-port resolution
Delivers the design captured in the prior TODO: prefer DDM's authoritative view of sled-to-port reachability over inventory, with inventory as cross-validation rather than the primary input.
Saga and RPW interaction
instance_stopdetaches multicast members and activates the reconciler only after sled-agent acknowledges the Stop request, avoiding M2P / forwarding teardown for a still-running guest if "Stop" fails.dpd_syncedflag rather than running unconditionally after the DPD match arm, so a DPD failure no longer leaves a route advertised via DDM with no programmed forwarding state.MulticastGroupReconcilerConfiggainsgroup_concurrency_limit(default 16) andmember_concurrency_limit(default 32) to bound the per-pass fan-out of the reconciler'sbuffer_unorderedstreams.Tests and simulation
populate_ddm_peerssynthesizes DDM peer topology from datastore + inventory so tests exercise the production primary path instead of the inventory fallback an emptyDdmInstancewould otherwise force. It rebuilds the peer map every call: the previous sled-id-keyed cache could reuse stalesp_slot-derived port names if inventory shape changed within the same sled set.unimplemented!()join/leaverestoredsim_force_removetolerates the sameObject-not-present race thatsim_pokealready handles.