Refactor connection state machine. by boks1971 · Pull Request #942 · livekit/server-sdk-go

boks1971 · 2026-07-02T12:17:08Z

A rather high diff PR. But, the recommended sequence

connection_manager.go -> Splits out the connection states o Generates a plan for connection sequence for each state
room.go -> removed the initial connection loop. This has moved into join() in engine.go. That function is used for full reconnect also.
engine.go -> clean up connection stuff and moved to connection_manager.go. Each stage asks connection manager for a plan and gets ordered list of servers to try with backoffs, validation timeout etc. Ensure context is threaded through and proper deadlines are set.

Please note that there is some repetition of code

in building plans for different state in connection_manager.go
between join() and resumeConnection() when executing the plan in engine.go But, kept them for easier readability. Can consolidate it later once it is stable.

Once again requested Claude for tests and it wrote some exhaustive tests.

@milos-lk I probably removed/reworked some stuff you did just last week. Sorry. Felt that the whole things was getting entangled and this is an attempt to hopefully make it cleaner.

For manual testing, published video from livekit-cli and had code in there to simulate migration (Go SDK does not support migration, so this also becomes a full reconnect) or node failure or leave request full reconnect every 10 seconds and ensured that it reconnected and published (found an issue with dynacast while doing this which I fixed in #941)

@milos-lk

A rather high diff PR. But, the recommended sequence - connection_manager.go o Splits out the connection states o Generates a plan for connection sequence for each state - room.go -> removed the initial connection loop. This has moved into `join()` in engine.go. That function is used for full reconnect also. - engine.go -> clean up connection stuff and moved to connection_manager.go. Each stage asks connection maanger for a plan and gets ordered list of servers to try with backoffs, validation timeout etc. Ensure context is threaded through and proper deadlines are set. Please note that there is some repetition of code - in building plans for different state in connection_manager.go - between `join()` and `resumeConnection()` when executing the plan in engine.go But, kept them for easier readability. Can consolidate it later once it is stable. Once again requested Claude for tests and it wrote some exhaustive tests. @milos-lk I probably removed/reworked some stuff you did just last week. Sorry. Felt that the whole things was getting entangled and this is an attempt to hopefully make it cleaner. For manual testing, published video from livekit-cli and had code in there to simulate migration (Go SDK does not support migration, so this also becomes a full reconnect) or node failure or leave request full reconnect every 10 seconds and ensured that it reconnected and published (found an issue with dynacast while doing this which I fixed in #941)

milos-lk · 2026-07-02T13:32:31Z

no worries - my change from the last week allowed us to ship that fixed region failover logic in some of services. I am going to take a look to check what this buys us.

milos-lk · 2026-07-02T13:59:37Z

Since ConnectionState() is now derived from the connection manager, what moves the state machine to Disconnected on a client-initiated leave? setDisconnected() seems to only be called from OnLeave on a server LeaveRequest_DISCONNECT, but room.Disconnect() → cleanup() → engine.Close() never touches the connection manager, so after a local disconnect, ConnectionState() would keep returning Connected? Same question for the give-up path: when the reconnect loop exhausts retries and fires OnDisconnected, the state stays Reconnecting? Would it make sense to call setDisconnected() from engine.Close()?

boks1971 · 2026-07-02T14:05:47Z

Since ConnectionState() is now derived from the connection manager, what moves the state machine to Disconnected on a client-initiated leave? setDisconnected() seems to only be called from OnLeave on a server LeaveRequest_DISCONNECT, but room.Disconnect() → cleanup() → engine.Close() never touches the connection manager, so after a local disconnect, ConnectionState() would keep returning Connected? Same question for the give-up path: when the reconnect loop exhausts retries and fires OnDisconnected, the state stays Reconnecting? Would it make sense to call setDisconnected() from engine.Close()?

Good catch! I did that last as I saw there was another connection state and tried to re-use. Will fix the disconnected bit.

milos-lk · 2026-07-02T14:09:32Z

+	}
+
+	// if already resuming, do not take settings that are nil as some internal paths might do a resume without regions
+	if c.state == connectionManagerStateResuming {


dead code? resuming is != connectionManagerStateConnected - so the check above returns first?

Good catch, I need to move that up before the previous check. There are cases where the resume happens multiple times. It can happen internally and it can also come from server. So, want to latch the server given list if available.

milos-lk · 2026-07-02T14:17:53Z

+
+	c.regionSettings = utils.CloneProto(regionSettigs)
+	c.updateState(connectionManagerStateReconnecting)
+	return true


What prevents a leave: RECONNECT arriving during the initial join (state still Initial - the read loop starts in OnJoinResponse, before setConnected) from spawning the reconnect goroutine here and racing the in-flight join()? The old code gated handleDisconnect on hasConnected. setResuming has a state guard but setReconnecting returns true unconditionally.

There could be a small window there. Will look at plugging that.

milos-lk · 2026-07-02T15:08:32Z

+		}
+	}
+
+	if !e.reconnecting.CompareAndSwap(false, true) {


If the goroutine is between setResumed (state=Connected) and its defer e.reconnecting.Store(false) when a new transport-close arrives, setResuming succeeds, this CAS fails, and the goroutine exits, leaving state=Resuming with no worker. What re-arms recovery at that point? setResuming will refuse every subsequent resume-type event (state != Connected), and a leave: RECONNECT needs a signal connection we no longer have.

boks1971 requested a review from a team as a code owner July 2, 2026 12:17

github-actions Bot added the slack-ts:cs-devs:1782994640.315439 label Jul 2, 2026

boks1971 added 3 commits July 2, 2026 17:50

clean up

d0b5537

static check

01db756

static check

2e281d0

milos-lk reviewed Jul 2, 2026

View reviewed changes

move disconnected to close

5134d6d

milos-lk reviewed Jul 2, 2026

View reviewed changes

tackle re-resume with settings

ce58e44

milos-lk reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor connection state machine.#942

Refactor connection state machine.#942
boks1971 wants to merge 6 commits into
mainfrom
raja_connection_manager

boks1971 commented Jul 2, 2026 •

edited

Loading

Uh oh!

milos-lk commented Jul 2, 2026

Uh oh!

milos-lk commented Jul 2, 2026

Uh oh!

boks1971 commented Jul 2, 2026

Uh oh!

milos-lk Jul 2, 2026

Uh oh!

boks1971 Jul 2, 2026

Uh oh!

milos-lk Jul 2, 2026

Uh oh!

boks1971 Jul 2, 2026

Uh oh!

milos-lk Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

boks1971 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milos-lk commented Jul 2, 2026

Uh oh!

milos-lk commented Jul 2, 2026

Uh oh!

boks1971 commented Jul 2, 2026

Uh oh!

milos-lk Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

boks1971 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

boks1971 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

boks1971 commented Jul 2, 2026 •

edited

Loading