Refactor connection state machine.#942
Conversation
A rather high diff PR. But, the recommended sequence - connection_manager.go o Splits out the connection states o Generates a plan for connection sequence for each state - room.go -> removed the initial connection loop. This has moved into `join()` in engine.go. That function is used for full reconnect also. - engine.go -> clean up connection stuff and moved to connection_manager.go. Each stage asks connection maanger for a plan and gets ordered list of servers to try with backoffs, validation timeout etc. Ensure context is threaded through and proper deadlines are set. Please note that there is some repetition of code - in building plans for different state in connection_manager.go - between `join()` and `resumeConnection()` when executing the plan in engine.go But, kept them for easier readability. Can consolidate it later once it is stable. Once again requested Claude for tests and it wrote some exhaustive tests. @milos-lk I probably removed/reworked some stuff you did just last week. Sorry. Felt that the whole things was getting entangled and this is an attempt to hopefully make it cleaner. For manual testing, published video from livekit-cli and had code in there to simulate migration (Go SDK does not support migration, so this also becomes a full reconnect) or node failure or leave request full reconnect every 10 seconds and ensured that it reconnected and published (found an issue with dynacast while doing this which I fixed in #941)
|
no worries - my change from the last week allowed us to ship that fixed region failover logic in some of services. I am going to take a look to check what this buys us. |
|
Since ConnectionState() is now derived from the connection manager, what moves the state machine to Disconnected on a client-initiated leave? setDisconnected() seems to only be called from OnLeave on a server LeaveRequest_DISCONNECT, but room.Disconnect() → cleanup() → engine.Close() never touches the connection manager, so after a local disconnect, ConnectionState() would keep returning Connected? Same question for the give-up path: when the reconnect loop exhausts retries and fires OnDisconnected, the state stays Reconnecting? Would it make sense to call setDisconnected() from engine.Close()? |
Good catch! I did that last as I saw there was another connection state and tried to re-use. Will fix the disconnected bit. |
| } | ||
|
|
||
| // if already resuming, do not take settings that are nil as some internal paths might do a resume without regions | ||
| if c.state == connectionManagerStateResuming { |
There was a problem hiding this comment.
dead code? resuming is != connectionManagerStateConnected - so the check above returns first?
There was a problem hiding this comment.
Good catch, I need to move that up before the previous check. There are cases where the resume happens multiple times. It can happen internally and it can also come from server. So, want to latch the server given list if available.
|
|
||
| c.regionSettings = utils.CloneProto(regionSettigs) | ||
| c.updateState(connectionManagerStateReconnecting) | ||
| return true |
There was a problem hiding this comment.
What prevents a leave: RECONNECT arriving during the initial join (state still Initial - the read loop starts in OnJoinResponse, before setConnected) from spawning the reconnect goroutine here and racing the in-flight join()? The old code gated handleDisconnect on hasConnected. setResuming has a state guard but setReconnecting returns true unconditionally.
There was a problem hiding this comment.
There could be a small window there. Will look at plugging that.
| } | ||
| } | ||
|
|
||
| if !e.reconnecting.CompareAndSwap(false, true) { |
There was a problem hiding this comment.
If the goroutine is between setResumed (state=Connected) and its defer e.reconnecting.Store(false) when a new transport-close arrives, setResuming succeeds, this CAS fails, and the goroutine exits, leaving state=Resuming with no worker. What re-arms recovery at that point? setResuming will refuse every subsequent resume-type event (state != Connected), and a leave: RECONNECT needs a signal connection we no longer have.
A rather high diff PR. But, the recommended sequence
join()in engine.go. That function is used for full reconnect also.Please note that there is some repetition of code
join()andresumeConnection()when executing the plan in engine.go But, kept them for easier readability. Can consolidate it later once it is stable.Once again requested Claude for tests and it wrote some exhaustive tests.
@milos-lk I probably removed/reworked some stuff you did just last week. Sorry. Felt that the whole things was getting entangled and this is an attempt to hopefully make it cleaner.
For manual testing, published video from livekit-cli and had code in there to simulate migration (Go SDK does not support migration, so this also becomes a full reconnect) or node failure or leave request full reconnect every 10 seconds and ensured that it reconnected and published (found an issue with dynacast while doing this which I fixed in #941)