Skip to content

Add state duration to DDM FSM#689

Open
taspelund wants to merge 10 commits into
mainfrom
trey/ddm-track-session-duration
Open

Add state duration to DDM FSM#689
taspelund wants to merge 10 commits into
mainfrom
trey/ddm-track-session-duration

Conversation

@taspelund
Copy link
Copy Markdown
Contributor

@taspelund taspelund commented Mar 30, 2026

At its heart, this is a re-implementation of #180 which has been updated for the modern codebase and offers some cleanup in addition to timestamps.

First and foremost, this PR adds duration tracking for DDM FSM states. It does so by updating the FSM state enum variants to hold a std::time::Instant representing the timestamp when the FSM entered that state, and calculating a Duration using .elapsed() when the FSM state is queried (i.e. when ddmadm get-peers is called).

Part of this work was to cleanup the implementation of peer tracking: namely getting rid of the Db.peers table and instead tracking a peer in each FSM. Since the FSMs are instantiated per interface, this naturally gives us the interface-to-peer mapping structure so we don't actually need to maintain a separate table for the peers. Instead, we update/query the peers directly from the FSMs.

There are also a couple quality of life improvements: putting the state duration in a separate column and sorting the output of ddmadm get-peers.

Fixes: #54

@taspelund taspelund requested a review from rcgoodfellow March 30, 2026 15:58
@taspelund taspelund self-assigned this Mar 30, 2026
@taspelund taspelund added ddm Delay Driven Multipath rust Pull requests that update rust code labels Mar 30, 2026
@taspelund
Copy link
Copy Markdown
Contributor Author

I've manually tested this in a4x2 and proven that everything still works as expected. I've also manually verified that ddmadm expire-peer still works as expected as well.

The open question that I still have here: is PeerStatus::Expired worth keeping or should I remove it?
We do not maintain state for Expired peers (setting the Option to None when peer expires) and right now it's only really referenced in a From impl. I think we can safely remove it, but I didn't want to yank it out without discussion in case there were plans for it.

@taspelund
Copy link
Copy Markdown
Contributor Author

One other thing: I am also happy to add an FSM history buffer for DDM that mirrors what we have in BGP, especially if we don't believe that this PR fully addresses the visibility issue described in #54

taspelund added 4 commits May 18, 2026 13:56
Db.peers was only used as a cache for reporting peers to ddm-api clients
calling GET /peers, and the contents of the cache were a bit misleading.
For example, there is no such thing as a peer state in DDM. DDM FSMs
have states, but they are instantiated per interface not per peer. Not
only does the state reported by GET /peers not actually map to a peer,
but it doesn't even directly map to the FSM of the peer's interface.
Instead, that state was populated by manual db updates scattered
throughout the FSM implementation. This removes Db.peers and introduces
a new shared InterfaceState struct that consolidates peer/FSM info for a
given interface. GET /peers now reads from each InterfaceState and
passes along any peers it finds along with interface state.

Also fixes a pre-existing bug in oxstats where interface names
were always empty due to reading from a stale SmContext clone.
After adding PeerIdentity, the peer_address field of SessionStats is now
redundant (and also isn't a stat) so this removes it. The peer address
is tracked in PeerIdentity, so we can just rely on that field behind a
single mutex instead of locking/juggling both stats and identity.
Also adds a couple helper methods for transitioning FSM state and
updating the peer state.
@taspelund taspelund force-pushed the trey/ddm-track-session-duration branch from 67c8745 to e5795f3 Compare May 18, 2026 19:57
@taspelund
Copy link
Copy Markdown
Contributor Author

rebased atop main after #724

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Comment thread ddm/src/admin.rs Outdated
Copy link
Copy Markdown
Collaborator

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @taspelund comments follow.

Something I'd like to make sure we test with this is ddmd restart through svcadm restart mg-ddm to make sure that all the peers come back up how we expect. That was the original use case for the peer database. When we started putting interfaces in SMF that may completely obviate the peers part of the database, but I'd like to be sure with ddm-restart testing. The trio/quartet tests may actually cover this already, but I'd like to be sure.

Comment thread ddm/src/oxstats.rs Outdated
Comment thread ddm/src/sm.rs Outdated
Comment thread ddmadm/src/main.rs Outdated
taspelund added 4 commits May 19, 2026 19:54
Drop the local seconds-precision format_duration helper in favor of the
shared helper already used by mgadm for BGP FSM durations, so ddmadm and
mgadm display durations consistently.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Encapsulate the paired writes to if_index and if_name behind a single
method on InterfaceState so the two fields stay in sync at the one site
that updates them.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
The produce() loop locked peer.iface.if_name twelve times per peer per
scrape. Take the lock once at the top of the loop body and reuse the
cloned String for each sample.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
peer_status() internally locks fsm_state and last_fsm_state_change. Call
it before acquiring if_index and peer_identity in get_peers so it remains
deadlock-safe if it ever takes additional InterfaceState mutexes.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
@taspelund
Copy link
Copy Markdown
Contributor Author

@rcgoodfellow your feedback has been integrated. I will run this through a4x2 to explicitly test out the smf restart

Resolves conflicts from main's split of ddm/src/sm.rs into sm/mod.rs and
sm/state.rs, and ddm/src/discovery.rs into discovery/mod.rs and
discovery/runtime.rs. The branch's InterfaceState/FsmState/PeerIdentity
types and the iface field on SmContext now live in sm/mod.rs alongside
the other shared definitions; the state-machine transition calls and the
discovery::handler signature change move into the new submodules.

Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddm Delay Driven Multipath rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ddm: track peer establishment time

2 participants