Add state duration to DDM FSM#689
Conversation
|
I've manually tested this in a4x2 and proven that everything still works as expected. I've also manually verified that The open question that I still have here: is |
|
One other thing: I am also happy to add an FSM history buffer for DDM that mirrors what we have in BGP, especially if we don't believe that this PR fully addresses the visibility issue described in #54 |
Db.peers was only used as a cache for reporting peers to ddm-api clients calling GET /peers, and the contents of the cache were a bit misleading. For example, there is no such thing as a peer state in DDM. DDM FSMs have states, but they are instantiated per interface not per peer. Not only does the state reported by GET /peers not actually map to a peer, but it doesn't even directly map to the FSM of the peer's interface. Instead, that state was populated by manual db updates scattered throughout the FSM implementation. This removes Db.peers and introduces a new shared InterfaceState struct that consolidates peer/FSM info for a given interface. GET /peers now reads from each InterfaceState and passes along any peers it finds along with interface state. Also fixes a pre-existing bug in oxstats where interface names were always empty due to reading from a stale SmContext clone.
After adding PeerIdentity, the peer_address field of SessionStats is now redundant (and also isn't a stat) so this removes it. The peer address is tracked in PeerIdentity, so we can just rely on that field behind a single mutex instead of locking/juggling both stats and identity. Also adds a couple helper methods for transitioning FSM state and updating the peer state.
67c8745 to
e5795f3
Compare
|
rebased atop main after #724 |
rcgoodfellow
left a comment
There was a problem hiding this comment.
Thanks @taspelund comments follow.
Something I'd like to make sure we test with this is ddmd restart through svcadm restart mg-ddm to make sure that all the peers come back up how we expect. That was the original use case for the peer database. When we started putting interfaces in SMF that may completely obviate the peers part of the database, but I'd like to be sure with ddm-restart testing. The trio/quartet tests may actually cover this already, but I'd like to be sure.
Drop the local seconds-precision format_duration helper in favor of the shared helper already used by mgadm for BGP FSM durations, so ddmadm and mgadm display durations consistently. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
Encapsulate the paired writes to if_index and if_name behind a single method on InterfaceState so the two fields stay in sync at the one site that updates them. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
The produce() loop locked peer.iface.if_name twelve times per peer per scrape. Take the lock once at the top of the loop body and reuse the cloned String for each sample. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
peer_status() internally locks fsm_state and last_fsm_state_change. Call it before acquiring if_index and peer_identity in get_peers so it remains deadlock-safe if it ever takes additional InterfaceState mutexes. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
|
@rcgoodfellow your feedback has been integrated. I will run this through a4x2 to explicitly test out the smf restart |
Resolves conflicts from main's split of ddm/src/sm.rs into sm/mod.rs and sm/state.rs, and ddm/src/discovery.rs into discovery/mod.rs and discovery/runtime.rs. The branch's InterfaceState/FsmState/PeerIdentity types and the iface field on SmContext now live in sm/mod.rs alongside the other shared definitions; the state-machine transition calls and the discovery::handler signature change move into the new submodules. Signed-off-by: Trey Aspelund <trey@oxidecomputer.com>
At its heart, this is a re-implementation of #180 which has been updated for the modern codebase and offers some cleanup in addition to timestamps.
First and foremost, this PR adds duration tracking for DDM FSM states. It does so by updating the FSM state enum variants to hold a std::time::Instant representing the timestamp when the FSM entered that state, and calculating a Duration using .elapsed() when the FSM state is queried (i.e. when
ddmadm get-peersis called).Part of this work was to cleanup the implementation of peer tracking: namely getting rid of the Db.peers table and instead tracking a peer in each FSM. Since the FSMs are instantiated per interface, this naturally gives us the interface-to-peer mapping structure so we don't actually need to maintain a separate table for the peers. Instead, we update/query the peers directly from the FSMs.
There are also a couple quality of life improvements: putting the state duration in a separate column and sorting the output of
ddmadm get-peers.Fixes: #54