You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prior to this PR, we had two different implementations that build up a
`HashMap<SwitchSlot, MgdClient>` via the same process:
1. Call `switch_zone_address_mappings()` to get a `HashMap<SwitchSlot,
Ipv6Addr>` via a couple different DNS lookups + a query to each switch
zone's MGS
2. Look up MGD in DNS
3. Pair up the IP addresses from 1 and 2 to build a `HashMap<SwitchSlot,
MgdClient>` using the ports from 2; for any IPs in 1 that we didn't find
in 2, construct a client pointed to the IP from 1 with the hardcoded
`MGD_PORT`
I believe the history here is that we used to not have step 2 at all,
and we just slapped `MGD_PORT` on all the IPs from 1 (back when MGD
didn't have its own DNS entries). And all of this predates being able to
ask MGD itself which switch slot it is (internally, it asks MGS over
`::1` within its own switch zone).
This PR makes two changes: it squishes us down to a single
implementation, and changes the mechanics of that implementation to:
1. Look up MGD in DNS
2. For each entry returned, construct an MgdClient and ask it for its
switch slot
This potentially introduces a new failure mode: if MGD doesn't know its
own switch slot but we would've been able to ask MGS at the same IP for
the switch slot, the prior implementation would've worked and this PR
won't. But that failure mode ought to only be possible if MGD itself is
buggy: it asking MGS over localhost for the switch slot should, in
general, be more successful than Nexus finding MGS in DNS then asking
MGS for the switch slot, since there are a lot more moving pieces in
that path.
But I think the upsides are enough to make up for that risk - we
streamline the process, reduce duplication, the new implementation has
fewer places to fail transiently overall, and we always use port numbers
from DNS (which should make this more reliable in tests - in tests we
basically never want to assume the fixed `MGD_PORT`, and pairing up IP
addresses also doesn't usually work since they're all `::1`).
I'd like to remove `switch_zone_address_mappings()` altogether, but
removing the remaining two uses of it are both blocked - lldp by #10361,
and building up the scrimlet clients by #10167.
Copy file name to clipboardExpand all lines: dev-tools/omdb/tests/successes.out
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -626,7 +626,7 @@ task: "bfd_manager"
626
626
configured period: every <REDACTED_DURATION>s
627
627
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
628
628
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
629
-
last completion reported error: failed to resolve addresses for Dendrite services: proto error: no records found for Query { name: Name("_dendrite._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
629
+
last completion reported error: proto error: no records found for Query { name: Name("_mgd._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
630
630
631
631
task: "blueprint_planner"
632
632
configured period: every <REDACTED_DURATION>m
@@ -1309,7 +1309,7 @@ task: "bfd_manager"
1309
1309
configured period: every <REDACTED_DURATION>s
1310
1310
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
1311
1311
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
1312
-
last completion reported error: failed to resolve addresses for Dendrite services: proto error: no records found for Query { name: Name("_dendrite._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
1312
+
last completion reported error: proto error: no records found for Query { name: Name("_mgd._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
0 commit comments