Skip to content

[cfgmgrd] Add Monitor Link Group feature#4523

Open
srodd-nexthop wants to merge 12 commits into
sonic-net:masterfrom
nexthop-ai:srodd.monitor-link-group.v2
Open

[cfgmgrd] Add Monitor Link Group feature#4523
srodd-nexthop wants to merge 12 commits into
sonic-net:masterfrom
nexthop-ai:srodd.monitor-link-group.v2

Conversation

@srodd-nexthop
Copy link
Copy Markdown

@srodd-nexthop srodd-nexthop commented Apr 27, 2026

What I did

Implement the Monitor Link Group feature in cfgmgr. A monitor link group tracks a set of uplink interfaces and automatically forces downlink interfaces admin-down when the operational uplink count drops below the configured min-uplinks threshold, preventing traffic black-holing when upstream connectivity is lost.

Files changed:

File Change
cfgmgr/monitorlinkgroupmgr.cpp New — core state machine
cfgmgr/monitorlinkgroupmgr.h New — class definition
cfgmgr/intfmgrd.cpp Register MonitorLinkGroupMgr in the Select loop
cfgmgr/portmgr.cpp + .h applyEffectiveAdminStatus, doMonitorLinkGroupMemberTask
cfgmgr/portmgrd.cpp Subscribe to STATE_MONITOR_LINK_GROUP_MEMBER_TABLE
cfgmgr/teammgr.cpp + .h Same pattern for PortChannel interfaces
cfgmgr/teammgrd.cpp Subscribe to STATE_MONITOR_LINK_GROUP_MEMBER_TABLE
tests/mock_tests/portmgr_ut.cpp Unit test update for new PortMgr constructor parameter
tests/test_monitorlinkgroupmgr.py New VS integration tests

Why I did it

Prevents traffic black-holing in multi-homed server, aggregation, and border leaf deployments. When uplinks fail, servers connected to downlinks continue forwarding into a black hole. Monitor Link Group takes the downlinks admin-down to signal the failure and trigger failover.

Design

MonitorLinkGroupMgr runs as a sibling Orch inside intfmgrd. It subscribes to CONFIG_DB:MONITOR_LINK_GROUP and STATE_DB:PORT_TABLE/LAG_TABLE. The group state machine has three states:

  • DOWN: uplink count < min-uplinks; downlinks receive force_down
  • PENDING: uplink count >= min-uplinks AND link-up-delay > 0; timer running; downlinks remain force_down
  • UP: threshold met and timer elapsed (or link-up-delay = 0); downlinks receive allow_up

On initial group creation the link-up-delay is skipped if uplinks already meet the threshold.

Enforcement split: MonitorLinkGroupMgr writes allow_up / force_down per-interface to STATE_DB:MONITOR_LINK_GROUP_MEMBER. PortMgr and TeamMgr subscribe and merge this with the configured admin_status before applying the effective state to the kernel netdev. MonitorLinkGroupMgr never calls ip link set directly.

Multi-group membership: an interface can be uplink in one group and downlink in another simultaneously. force_down wins if any group forces it down, enabling transitive cascade topologies.

LAG identity is confirmed via CONFIG_DB lookup rather than name-prefix matching.

How I verified it

  • tests/test_monitorlinkgroupmgr.py VS integration tests: group lifecycle, uplink-down force-down and recovery, link-up-delay PENDING state, config admin_status override, cross-group cascade topology
  • intfmgrd, portmgrd, and teammgrd build cleanly

Companion PRs

  • sonic-swss-common: STATE_DB table name macros
  • sonic-yang-models: YANG model
  • sonic-utilities: show monitor-link CLI
  • SONiC/SONiC: HLD

HLD: sonic-net/SONiC#2308
Yang Changes: sonic-net/sonic-buildimage#27004
Show: sonic-net/sonic-utilities#4497
swss-common: sonic-net/sonic-swss-common#1181

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@srodd-nexthop srodd-nexthop force-pushed the srodd.monitor-link-group.v2 branch from 049c3d8 to badc6b2 Compare April 28, 2026 08:00
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Adds the Monitor Link Group feature to cfgmgr.  A monitor link group
tracks a set of uplink interfaces and automatically forces downlink
interfaces admin-down when operational uplink count drops below the
configured min-uplinks threshold, preventing traffic black-holing when
upstream connectivity is lost.

Design:

- monitorlinkgroupmgr (new, runs inside intfmgrd):
  Subscribes to CONFIG_DB:MONITOR_LINK_GROUP and STATE_DB PORT/LAG
  tables.  Maintains per-group state (up/down/pending) derived from
  the min-uplinks threshold.  Writes STATE_DB:MONITOR_LINK_GROUP_STATE
  (group visibility) and STATE_DB:MONITOR_LINK_GROUP_MEMBER (per-port
  force-down signal consumed by portmgrd/teammgrd).  Uses a
  SelectableTimer per group for the link-up-delay, which defers
  recovery to suppress churn from short uplink flaps.  On initial
  group creation the delay is skipped if uplinks already meet the
  threshold.  A port may serve as uplink in one group and downlink in
  another, enabling chain topologies where a force-down cascades
  transitively through a stack of groups.

- portmgr: subscribes to STATE_DB:MONITOR_LINK_GROUP_MEMBER.
  applyEffectiveAdminStatus() combines CONFIG_DB admin_status with the
  monitor-link member state; the result is what actually reaches the
  kernel netdev.  Both CONFIG_DB and STATE_DB change paths call the
  same helper.

- teammgr: same pattern for PortChannel interfaces.  LAG identity is
  confirmed via CONFIG_DB lookup rather than name-prefix matching.

- portmgrd / teammgrd: pass STATE_MONITOR_LINK_GROUP_MEMBER_TABLE as
  a subscribed TableConnector so the consumer loop wakes on changes.

- tests/test_monitorlinkgroupmgr.py: VS integration tests covering
  group lifecycle, uplink-down force-down and recovery, link-up-delay
  pending state, config admin_status interaction, and cross-group
  cascade topology.

Companion PRs: sonic-yang-models (YANG model), sonic-utilities
(show/config commands), SONiC/SONiC (HLD).

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Prioritize netdev_oper_status over oper_status when both fields are
present in a STATE_DB notification. netdev_oper_status is authoritative
for Ethernet interfaces; oper_status is used as fallback for PortChannels
which do not have a netdev_oper_status field.

Previously both fields triggered independent updateMonitorLinkInterfaceState
calls within the same notification, which could cause redundant state
transitions. Now only one call is made per notification using the correct
field for the interface type.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
Bug fixes:
- handleLinkupDelayExpired else-branch: STATE_DB stays "pending" after
  timer fires with insufficient uplinks; add writeMonitorLinkGroupStateToDb
- doPortTableTask: DEL on STATE_PORT/LAG_TABLE ignored, causing ghost
  uplink counts; treat port removal as link-down
- removeInterfaceFromGroup: called updateMonitorLinkGroupState on each
  removal causing intermediate state evaluations during interface list
  swap; deferred to end of updateGroupInterfaceLists
- removeInterfaceFromGroup: role membership check tested any-role instead
  of the specified link_type; now role-specific
- getInterfaceOperState: used PORTCHANNEL_PREFIX name check; replaced
  with try-port-then-lag table lookup

Type / cleanup fixes:
- down_group_count: declared as int, changed to uint32_t
- doMonitorLinkGroupTask: retry return path was dead code; converted to
  void and removed the conditional wrapper in doTask
- startLinkupDelayTimer: SelectableTimer constructed with 1s placeholder;
  use actual delay_seconds at construction time
- PortMgr m_cfgLagMemberTable: initialized but never used; removed
- doTask(SelectableTimer&): O(N) scan to find owning group; replaced with
  m_timerToGroup reverse map for O(1) lookup

Refactor:
- TeamMgr doLagTask and doMonitorLinkGroupMemberTask inlined the same
  monitor-link + config admin_status merge logic twice; extracted into
  applyEffectiveLagAdminStatus() helper, matching PortMgr's pattern

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…ycle detection

Two HLD review action items wrapped together to land before opening the swss PR:

R-4 (terminology rename) — reviewers (Judson Wilson @ Nvidia, seconded by Sudheer
and Eswaran) flagged that the "uplink/downlink" wording is confusing because the
same port can be uplink in one group and downlink in another (cross-role). Rename:

  MonitorLinkGroupInfo::
    uplink_interfaces        -> monitored_interfaces
    downlink_interfaces      -> managed_interfaces
    uplink_up_count          -> monitored_up_count
    min_uplinks              -> min_monitored_links
  MonitorLinkInterfaceInfo::
    uplink_groups            -> monitored_groups
    downlink_groups          -> managed_groups
  Method:
    updateDownlinkInterfacesForGroupStateChange -> updateManagedInterfacesForGroupStateChange
  CONFIG_DB field names:
    "uplinks"      -> "monitored-links"
    "downlinks"    -> "managed-links"
    "min-uplinks"  -> "min-monitored-links"
  STATE_DB field names (MONITOR_LINK_GROUP_STATE):
    "uplinks"      -> "monitored-links"
    "downlinks"    -> "managed-links"
  Internal link_type literal: "uplink"->"monitored", "downlink"->"managed".

R-6 (cycle detection) — reviewers (Anders Linn, seconded by Sudheer) flagged that
chained groups with cross-referenced monitored/managed sets can deadlock. New
MonitorLinkGroupMgr::wouldCreateDependencyCycle runs a DFS from the proposed
group: edge G -> H exists iff some port appears in G.monitored AND H.managed
(G's recovery waits on H). A cycle implies a no-recovery deadlock; the daemon
rejects the SET (no STATE_DB write, SWSS_LOG_ERROR). The existing m_monitorLinkGroups
is acyclic by induction (this check runs on every SET), so any newly introduced
cycle must involve the proposed group — DFS starts there and back-edges signal a
cycle. Helper parseInterfaceList factored out so the check and updateGroupInterfaceLists
share the same parsing.

VS test file updated for the rename (5 existing test cases). The R-6 VS tests
have been moved to sonic-mgmt-mlg (cycle detection is end-to-end-testable; VS
is sufficient for R-4 coverage).

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
R-A from the HLD review action items. Operators today can read MLG state
(UP/DOWN/PENDING) but not "when did this last change?", "how far into the
link-up-delay are we?", or "is this flapping?" without grepping syslog.
This patch surfaces that in STATE_DB so `show monitor-link` (companion PR)
can render it directly.

MonitorLinkGroupInfo gets seven new fields:
  bool          has_state_change             // false until first transition observed
  std::string   last_state_change_from/to    // "up" | "down" | "pending"
  std::string   last_state_change_reason     // human-readable
  system_clock::time_point last_state_change_time
  system_clock::time_point pending_start_time
  uint64_t      up_to_down_count             // direct UP -> DOWN only
  uint64_t      down_to_up_count             // any -> UP

Two new methods:
  recordStateTransition(group, from, to, reason)
    Sets last_state_change_*, sets pending_start_time when entering PENDING,
    bumps counters per the rules above, emits SWSS_LOG_NOTICE. No-op when
    from == to.
  restoreGroupCountersFromStateDb(group)
    Reload up/down counters from STATE_DB on first sighting of a group in
    this process, so the counters survive an intfmgrd restart without
    going non-monotonic from a telemetry/SNMP consumer's perspective.

Call sites wired up:
  updateMonitorLinkGroupState: DOWN -> PENDING, DOWN -> UP, PENDING -> DOWN
    (cancel), UP -> DOWN. The !should_be_up branch is restructured to make
    the (is_pending && !current_state) vs current_state cases mutually
    exclusive; adds a defensive SWSS_LOG_ERROR self-heal for the impossible
    state where both is_up and pending_up are true.
  handleLinkupDelayExpired: PENDING -> UP (timer expired) and PENDING ->
    DOWN (threshold lost late).
  handleGroupDelayChange: PENDING -> UP via delay -> 0 and via reduced
    delay already elapsed.
  handleMonitorLinkGroupSet: calls restoreGroupCountersFromStateDb on
    is_new_group right after updateGroupConfiguration.

writeMonitorLinkGroupStateToDb emits new STATE_DB fields on
MONITOR_LINK_GROUP_STATE:
  last_state_change_time (epoch seconds; only when has_state_change)
  last_state_change_from / _to / _reason (same condition)
  pending_start_time (epoch seconds; only when state == "pending")
  up_to_down_count, down_to_up_count (always)

An explicit hdel of pending_start_time happens on every non-pending write
so external consumers never see stale data after the timer fires.

Reason text uses the renamed terminology, e.g.:
  "monitored N/M >= min-monitored-links K, starting link-up-delay"
  "link-up-delay expired (monitored N/M >= min-monitored-links K)"

All new STATE_DB fields are optional; groups that never transitioned
render exactly as before, preserving backward compat with the existing
show CLI parser.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…, drop reason field

Two design simplifications on top of PR-A's transition-tracking work.

A. Collapse up_to_down_count + down_to_up_count into a single total_transitions.

   Given the rules of the original asymmetric scheme:
     d2u: bumped on DOWN->UP and PENDING->UP
     u2d: bumped on UP->DOWN only
   the invariant d2u - u2d in {0, 1} held in all reachable states (d2u == u2d
   when currently DOWN or PENDING; d2u == u2d + 1 when currently UP). The two
   fields carried one bit of information beyond last_state_change_to + state.

   The replacement single counter is incremented on every recorded transition
   (DOWN->PENDING, PENDING->UP, PENDING->DOWN, DOWN->UP, UP->DOWN). A full
   UP->DOWN->UP cycle bumps it by 2. PENDING->DOWN now counts as a transition
   too -- previously it was a "near-miss-that-failed" silently excluded from
   u2d, but distinguishing it from a regular UP->DOWN is already done by
   last_state_change_from so the counter doesn't need to encode it.

B. Drop last_state_change_reason.

   With last_state_change_{from,to} present, reason is derivable: to=='down'
   always means threshold lost, to=='up' or 'pending' always means threshold
   met. The only diagnostic detail the free-text reason carried was the
   snapshot count "monitored N/M" at transition time and the path
   (timer-expired vs delay-reduced for PENDING->UP). Neither answers the
   HLD review's core asks ("when did this last change?" / "is it flapping?")
   and both can be inferred from the pre-existing SWSS_LOG_NOTICE lines if
   needed.

   This drops the std::ostringstream reason; reason << ...; block at all 8
   recordStateTransition call sites (5 in updateMonitorLinkGroupState, 2 in
   handleLinkupDelayExpired, 1 in handleGroupDelayChange wait... actually 2 in
   handleGroupDelayChange after counting). Net ~50 lines removed from the cpp.

STATE_DB schema change on MONITOR_LINK_GROUP_STATE:
  removed: up_to_down_count, down_to_up_count, last_state_change_reason
  added:   total_transitions

All three removed fields were introduced in the prior PR-A commit
(239e99c9), so no operator workflow depended on them yet.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
…nsitional shim

STATE_MONITOR_LINK_GROUP_STATE_TABLE_NAME and STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAME
are added to sonic-swss-common/common/schema.h by
sonic-net/sonic-swss-common#1181. Until that PR merges and the
swss-common artifact pulled by sonic-swss CI carries those macros,
this PR's build fails with 'STATE_MONITOR_LINK_GROUP_MEMBER_TABLE_NAME
not declared in scope' in cfgmgr/teammgr*, portmgr*, and intfmgrd.

Add #ifndef-guarded fallback definitions in cfgmgr/monitorlinkgroupmgr.h
and include that header from the four cfgmgr files that reference the
macros (portmgr.cpp, portmgrd.cpp, teammgr.cpp, teammgrd.cpp; the MLG
manager already includes its own header).

Once sonic-swss-common#1181 merges and schema.h carries these macros,
the #ifndef guards skip the fallback and the upstream definitions take
over. The shim can then be removed as a cleanup, but the #include lines
stay since the files legitimately consume MLG-MEMBER state.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
The standalone sonic-swss CI uses a published sonic-swss-common artifact
that does not yet contain the CFG_ macro (added by buildimage#27004 via
YANG schema generation) or the STATE_ macros (added by swss-common#1181).
Add CFG_MONITOR_LINK_GROUP_TABLE_NAME to the existing #ifndef-guarded
shim block so the build links cleanly until both dependencies merge.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@srodd-nexthop srodd-nexthop force-pushed the srodd.monitor-link-group.v2 branch from d284807 to 5cb0148 Compare May 14, 2026 09:19
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The original doTask paths initialise admin_status to DEFAULT_ADMIN_STATUS_STR
("down") when the port/LAG first appears in CONFIG_DB without an explicit
admin_status. The new applyEffective[Lag]AdminStatus helpers re-read CONFIG_DB
and previously defaulted to up when the field was absent, which silently
flipped a fresh port from down to up. The portmgr_ut DoTask test catches
exactly this case (sets only speed/index, expects 'ip link set Ethernet0
down'). Switch both helpers to default to false to match DEFAULT_ADMIN_STATUS_STR.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

chrono::duration<int64_t>::count() returns long long on 32-bit ARM
where long is 32-bit. Switch the two SWSS_LOG_INFO sites that print
elapsed seconds to %lld with an explicit cast so -Werror=format doesn't
trip the BuildArm armhf job.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…linkgroupmgr

The monitorlinkgroupmgr daemon prefers netdev_oper_status over oper_status
when reading port state from STATE_DB:PORT_TABLE — netdev_oper_status is
the authoritative kernel-netdev signal for Ethernet ports (see comment in
doPortTableTask). Writing only oper_status is silently ignored if portsyncd
in the dvs has already populated netdev_oper_status.

This caused the 5 test_monitorlinkgroupmgr cases to fail in CI: with the
real portsyncd writing netdev_oper_status=down at dvs boot, our test writes
of oper_status=up were dropped, so the cfgmgr never saw the monitored
interface come up and the group state machine never transitioned.

Fix: write both netdev_oper_status and oper_status in set_port_oper. The
netdev_oper_status write is what the cfgmgr will actually consume; the
oper_status write keeps the entry self-consistent.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Retrigger Test vstest + TestAsan vstest stages: previous run hit
test_acl_mark.py flake plus Azure agent OOM (exit 137, 70 Killed pytest
processes). All Build stages (amd64, BuildAsan, BuildArm arm64/armhf,
BuildTrixie) pass on this head; only vstest stages need a healthier
Azure agent.

Signed-off-by: Satishkumar Rodd <srodd@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants