[cfgmgrd] Add Monitor Link Group feature#4523
Open
srodd-nexthop wants to merge 12 commits into
Open
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This was referenced Apr 27, 2026
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
049c3d8 to
badc6b2
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Open
6 tasks
e311841 to
c6f660a
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Adds the Monitor Link Group feature to cfgmgr. A monitor link group tracks a set of uplink interfaces and automatically forces downlink interfaces admin-down when operational uplink count drops below the configured min-uplinks threshold, preventing traffic black-holing when upstream connectivity is lost. Design: - monitorlinkgroupmgr (new, runs inside intfmgrd): Subscribes to CONFIG_DB:MONITOR_LINK_GROUP and STATE_DB PORT/LAG tables. Maintains per-group state (up/down/pending) derived from the min-uplinks threshold. Writes STATE_DB:MONITOR_LINK_GROUP_STATE (group visibility) and STATE_DB:MONITOR_LINK_GROUP_MEMBER (per-port force-down signal consumed by portmgrd/teammgrd). Uses a SelectableTimer per group for the link-up-delay, which defers recovery to suppress churn from short uplink flaps. On initial group creation the delay is skipped if uplinks already meet the threshold. A port may serve as uplink in one group and downlink in another, enabling chain topologies where a force-down cascades transitively through a stack of groups. - portmgr: subscribes to STATE_DB:MONITOR_LINK_GROUP_MEMBER. applyEffectiveAdminStatus() combines CONFIG_DB admin_status with the monitor-link member state; the result is what actually reaches the kernel netdev. Both CONFIG_DB and STATE_DB change paths call the same helper. - teammgr: same pattern for PortChannel interfaces. LAG identity is confirmed via CONFIG_DB lookup rather than name-prefix matching. - portmgrd / teammgrd: pass STATE_MONITOR_LINK_GROUP_MEMBER_TABLE as a subscribed TableConnector so the consumer loop wakes on changes. - tests/test_monitorlinkgroupmgr.py: VS integration tests covering group lifecycle, uplink-down force-down and recovery, link-up-delay pending state, config admin_status interaction, and cross-group cascade topology. Companion PRs: sonic-yang-models (YANG model), sonic-utilities (show/config commands), SONiC/SONiC (HLD). Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Prioritize netdev_oper_status over oper_status when both fields are present in a STATE_DB notification. netdev_oper_status is authoritative for Ethernet interfaces; oper_status is used as fallback for PortChannels which do not have a netdev_oper_status field. Previously both fields triggered independent updateMonitorLinkInterfaceState calls within the same notification, which could cause redundant state transitions. Now only one call is made per notification using the correct field for the interface type. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Bug fixes: - handleLinkupDelayExpired else-branch: STATE_DB stays "pending" after timer fires with insufficient uplinks; add writeMonitorLinkGroupStateToDb - doPortTableTask: DEL on STATE_PORT/LAG_TABLE ignored, causing ghost uplink counts; treat port removal as link-down - removeInterfaceFromGroup: called updateMonitorLinkGroupState on each removal causing intermediate state evaluations during interface list swap; deferred to end of updateGroupInterfaceLists - removeInterfaceFromGroup: role membership check tested any-role instead of the specified link_type; now role-specific - getInterfaceOperState: used PORTCHANNEL_PREFIX name check; replaced with try-port-then-lag table lookup Type / cleanup fixes: - down_group_count: declared as int, changed to uint32_t - doMonitorLinkGroupTask: retry return path was dead code; converted to void and removed the conditional wrapper in doTask - startLinkupDelayTimer: SelectableTimer constructed with 1s placeholder; use actual delay_seconds at construction time - PortMgr m_cfgLagMemberTable: initialized but never used; removed - doTask(SelectableTimer&): O(N) scan to find owning group; replaced with m_timerToGroup reverse map for O(1) lookup Refactor: - TeamMgr doLagTask and doMonitorLinkGroupMemberTask inlined the same monitor-link + config admin_status merge logic twice; extracted into applyEffectiveLagAdminStatus() helper, matching PortMgr's pattern Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…ycle detection
Two HLD review action items wrapped together to land before opening the swss PR:
R-4 (terminology rename) — reviewers (Judson Wilson @ Nvidia, seconded by Sudheer
and Eswaran) flagged that the "uplink/downlink" wording is confusing because the
same port can be uplink in one group and downlink in another (cross-role). Rename:
MonitorLinkGroupInfo::
uplink_interfaces -> monitored_interfaces
downlink_interfaces -> managed_interfaces
uplink_up_count -> monitored_up_count
min_uplinks -> min_monitored_links
MonitorLinkInterfaceInfo::
uplink_groups -> monitored_groups
downlink_groups -> managed_groups
Method:
updateDownlinkInterfacesForGroupStateChange -> updateManagedInterfacesForGroupStateChange
CONFIG_DB field names:
"uplinks" -> "monitored-links"
"downlinks" -> "managed-links"
"min-uplinks" -> "min-monitored-links"
STATE_DB field names (MONITOR_LINK_GROUP_STATE):
"uplinks" -> "monitored-links"
"downlinks" -> "managed-links"
Internal link_type literal: "uplink"->"monitored", "downlink"->"managed".
R-6 (cycle detection) — reviewers (Anders Linn, seconded by Sudheer) flagged that
chained groups with cross-referenced monitored/managed sets can deadlock. New
MonitorLinkGroupMgr::wouldCreateDependencyCycle runs a DFS from the proposed
group: edge G -> H exists iff some port appears in G.monitored AND H.managed
(G's recovery waits on H). A cycle implies a no-recovery deadlock; the daemon
rejects the SET (no STATE_DB write, SWSS_LOG_ERROR). The existing m_monitorLinkGroups
is acyclic by induction (this check runs on every SET), so any newly introduced
cycle must involve the proposed group — DFS starts there and back-edges signal a
cycle. Helper parseInterfaceList factored out so the check and updateGroupInterfaceLists
share the same parsing.
VS test file updated for the rename (5 existing test cases). The R-6 VS tests
have been moved to sonic-mgmt-mlg (cycle detection is end-to-end-testable; VS
is sufficient for R-4 coverage).
Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
R-A from the HLD review action items. Operators today can read MLG state
(UP/DOWN/PENDING) but not "when did this last change?", "how far into the
link-up-delay are we?", or "is this flapping?" without grepping syslog.
This patch surfaces that in STATE_DB so `show monitor-link` (companion PR)
can render it directly.
MonitorLinkGroupInfo gets seven new fields:
bool has_state_change // false until first transition observed
std::string last_state_change_from/to // "up" | "down" | "pending"
std::string last_state_change_reason // human-readable
system_clock::time_point last_state_change_time
system_clock::time_point pending_start_time
uint64_t up_to_down_count // direct UP -> DOWN only
uint64_t down_to_up_count // any -> UP
Two new methods:
recordStateTransition(group, from, to, reason)
Sets last_state_change_*, sets pending_start_time when entering PENDING,
bumps counters per the rules above, emits SWSS_LOG_NOTICE. No-op when
from == to.
restoreGroupCountersFromStateDb(group)
Reload up/down counters from STATE_DB on first sighting of a group in
this process, so the counters survive an intfmgrd restart without
going non-monotonic from a telemetry/SNMP consumer's perspective.
Call sites wired up:
updateMonitorLinkGroupState: DOWN -> PENDING, DOWN -> UP, PENDING -> DOWN
(cancel), UP -> DOWN. The !should_be_up branch is restructured to make
the (is_pending && !current_state) vs current_state cases mutually
exclusive; adds a defensive SWSS_LOG_ERROR self-heal for the impossible
state where both is_up and pending_up are true.
handleLinkupDelayExpired: PENDING -> UP (timer expired) and PENDING ->
DOWN (threshold lost late).
handleGroupDelayChange: PENDING -> UP via delay -> 0 and via reduced
delay already elapsed.
handleMonitorLinkGroupSet: calls restoreGroupCountersFromStateDb on
is_new_group right after updateGroupConfiguration.
writeMonitorLinkGroupStateToDb emits new STATE_DB fields on
MONITOR_LINK_GROUP_STATE:
last_state_change_time (epoch seconds; only when has_state_change)
last_state_change_from / _to / _reason (same condition)
pending_start_time (epoch seconds; only when state == "pending")
up_to_down_count, down_to_up_count (always)
An explicit hdel of pending_start_time happens on every non-pending write
so external consumers never see stale data after the timer fires.
Reason text uses the renamed terminology, e.g.:
"monitored N/M >= min-monitored-links K, starting link-up-delay"
"link-up-delay expired (monitored N/M >= min-monitored-links K)"
All new STATE_DB fields are optional; groups that never transitioned
render exactly as before, preserving backward compat with the existing
show CLI parser.
Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…, drop reason field
Two design simplifications on top of PR-A's transition-tracking work.
A. Collapse up_to_down_count + down_to_up_count into a single total_transitions.
Given the rules of the original asymmetric scheme:
d2u: bumped on DOWN->UP and PENDING->UP
u2d: bumped on UP->DOWN only
the invariant d2u - u2d in {0, 1} held in all reachable states (d2u == u2d
when currently DOWN or PENDING; d2u == u2d + 1 when currently UP). The two
fields carried one bit of information beyond last_state_change_to + state.
The replacement single counter is incremented on every recorded transition
(DOWN->PENDING, PENDING->UP, PENDING->DOWN, DOWN->UP, UP->DOWN). A full
UP->DOWN->UP cycle bumps it by 2. PENDING->DOWN now counts as a transition
too -- previously it was a "near-miss-that-failed" silently excluded from
u2d, but distinguishing it from a regular UP->DOWN is already done by
last_state_change_from so the counter doesn't need to encode it.
B. Drop last_state_change_reason.
With last_state_change_{from,to} present, reason is derivable: to=='down'
always means threshold lost, to=='up' or 'pending' always means threshold
met. The only diagnostic detail the free-text reason carried was the
snapshot count "monitored N/M" at transition time and the path
(timer-expired vs delay-reduced for PENDING->UP). Neither answers the
HLD review's core asks ("when did this last change?" / "is it flapping?")
and both can be inferred from the pre-existing SWSS_LOG_NOTICE lines if
needed.
This drops the std::ostringstream reason; reason << ...; block at all 8
recordStateTransition call sites (5 in updateMonitorLinkGroupState, 2 in
handleLinkupDelayExpired, 1 in handleGroupDelayChange wait... actually 2 in
handleGroupDelayChange after counting). Net ~50 lines removed from the cpp.
STATE_DB schema change on MONITOR_LINK_GROUP_STATE:
removed: up_to_down_count, down_to_up_count, last_state_change_reason
added: total_transitions
All three removed fields were introduced in the prior PR-A commit
(239e99c9), so no operator workflow depended on them yet.
Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…nsitional shim STATE_MONITOR_LINK_GROUP_STATE_TABLE_NAME and STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAME are added to sonic-swss-common/common/schema.h by sonic-net/sonic-swss-common#1181. Until that PR merges and the swss-common artifact pulled by sonic-swss CI carries those macros, this PR's build fails with 'STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAME not declared in scope' in cfgmgr/teammgr*, portmgr*, and intfmgrd. Add #ifndef-guarded fallback definitions in cfgmgr/monitorlinkgroupmgr.h and include that header from the four cfgmgr files that reference the macros (portmgr.cpp, portmgrd.cpp, teammgr.cpp, teammgrd.cpp; the MLG manager already includes its own header). Once sonic-swss-common#1181 merges and schema.h carries these macros, the #ifndef guards skip the fallback and the upstream definitions take over. The shim can then be removed as a cleanup, but the #include lines stay since the files legitimately consume MLG-MEMBER state. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
The standalone sonic-swss CI uses a published sonic-swss-common artifact that does not yet contain the CFG_ macro (added by buildimage#27004 via YANG schema generation) or the STATE_ macros (added by swss-common#1181). Add CFG_MONITOR_LINK_GROUP_TABLE_NAME to the existing #ifndef-guarded shim block so the build links cleanly until both dependencies merge. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
d284807 to
5cb0148
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The original doTask paths initialise admin_status to DEFAULT_ADMIN_STATUS_STR
("down") when the port/LAG first appears in CONFIG_DB without an explicit
admin_status. The new applyEffective[Lag]AdminStatus helpers re-read CONFIG_DB
and previously defaulted to up when the field was absent, which silently
flipped a fresh port from down to up. The portmgr_ut DoTask test catches
exactly this case (sets only speed/index, expects 'ip link set Ethernet0
down'). Switch both helpers to default to false to match DEFAULT_ADMIN_STATUS_STR.
Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
chrono::duration<int64_t>::count() returns long long on 32-bit ARM where long is 32-bit. Switch the two SWSS_LOG_INFO sites that print elapsed seconds to %lld with an explicit cast so -Werror=format doesn't trip the BuildArm armhf job. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…linkgroupmgr The monitorlinkgroupmgr daemon prefers netdev_oper_status over oper_status when reading port state from STATE_DB:PORT_TABLE — netdev_oper_status is the authoritative kernel-netdev signal for Ethernet ports (see comment in doPortTableTask). Writing only oper_status is silently ignored if portsyncd in the dvs has already populated netdev_oper_status. This caused the 5 test_monitorlinkgroupmgr cases to fail in CI: with the real portsyncd writing netdev_oper_status=down at dvs boot, our test writes of oper_status=up were dropped, so the cfgmgr never saw the monitored interface come up and the group state machine never transitioned. Fix: write both netdev_oper_status and oper_status in set_port_oper. The netdev_oper_status write is what the cfgmgr will actually consume; the oper_status write keeps the entry self-consistent. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Retrigger Test vstest + TestAsan vstest stages: previous run hit test_acl_mark.py flake plus Azure agent OOM (exit 137, 70 Killed pytest processes). All Build stages (amd64, BuildAsan, BuildArm arm64/armhf, BuildTrixie) pass on this head; only vstest stages need a healthier Azure agent. Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Implement the Monitor Link Group feature in cfgmgr. A monitor link group tracks a set of uplink interfaces and automatically forces downlink interfaces admin-down when the operational uplink count drops below the configured
min-uplinksthreshold, preventing traffic black-holing when upstream connectivity is lost.Files changed:
cfgmgr/monitorlinkgroupmgr.cppcfgmgr/monitorlinkgroupmgr.hcfgmgr/intfmgrd.cppcfgmgr/portmgr.cpp+.hcfgmgr/portmgrd.cppcfgmgr/teammgr.cpp+.hcfgmgr/teammgrd.cpptests/mock_tests/portmgr_ut.cpptests/test_monitorlinkgroupmgr.pyWhy I did it
Prevents traffic black-holing in multi-homed server, aggregation, and border leaf deployments. When uplinks fail, servers connected to downlinks continue forwarding into a black hole. Monitor Link Group takes the downlinks admin-down to signal the failure and trigger failover.
Design
MonitorLinkGroupMgr runs as a sibling
Orchinsideintfmgrd. It subscribes toCONFIG_DB:MONITOR_LINK_GROUPandSTATE_DB:PORT_TABLE/LAG_TABLE. The group state machine has three states:force_downforce_downallow_upOn initial group creation the link-up-delay is skipped if uplinks already meet the threshold.
Enforcement split: MonitorLinkGroupMgr writes
allow_up/force_downper-interface toSTATE_DB:MONITOR_LINK_GROUP_MEMBER. PortMgr and TeamMgr subscribe and merge this with the configuredadmin_statusbefore applying the effective state to the kernel netdev. MonitorLinkGroupMgr never callsip link setdirectly.Multi-group membership: an interface can be uplink in one group and downlink in another simultaneously.
force_downwins if any group forces it down, enabling transitive cascade topologies.LAG identity is confirmed via CONFIG_DB lookup rather than name-prefix matching.
How I verified it
tests/test_monitorlinkgroupmgr.pyVS integration tests: group lifecycle, uplink-down force-down and recovery, link-up-delay PENDING state, config admin_status override, cross-group cascade topologyCompanion PRs
show monitor-linkCLIHLD: sonic-net/SONiC#2308
Yang Changes: sonic-net/sonic-buildimage#27004
Show: sonic-net/sonic-utilities#4497
swss-common: sonic-net/sonic-swss-common#1181