Skip to content

Refactor connection state machine.#942

Open
boks1971 wants to merge 6 commits into
mainfrom
raja_connection_manager
Open

Refactor connection state machine.#942
boks1971 wants to merge 6 commits into
mainfrom
raja_connection_manager

Conversation

@boks1971

@boks1971 boks1971 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

A rather high diff PR. But, the recommended sequence

  • connection_manager.go -> Splits out the connection states o Generates a plan for connection sequence for each state
  • room.go -> removed the initial connection loop. This has moved into join() in engine.go. That function is used for full reconnect also.
  • engine.go -> clean up connection stuff and moved to connection_manager.go. Each stage asks connection manager for a plan and gets ordered list of servers to try with backoffs, validation timeout etc. Ensure context is threaded through and proper deadlines are set.

Please note that there is some repetition of code

  • in building plans for different state in connection_manager.go
  • between join() and resumeConnection() when executing the plan in engine.go But, kept them for easier readability. Can consolidate it later once it is stable.

Once again requested Claude for tests and it wrote some exhaustive tests.

@milos-lk I probably removed/reworked some stuff you did just last week. Sorry. Felt that the whole things was getting entangled and this is an attempt to hopefully make it cleaner.

For manual testing, published video from livekit-cli and had code in there to simulate migration (Go SDK does not support migration, so this also becomes a full reconnect) or node failure or leave request full reconnect every 10 seconds and ensured that it reconnected and published (found an issue with dynacast while doing this which I fixed in #941)

A rather high diff PR. But, the recommended sequence
- connection_manager.go
  o Splits out the connection states
  o Generates a plan for connection sequence for each state
- room.go -> removed the initial connection loop. This has moved into
  `join()` in engine.go. That function is used for full reconnect also.
- engine.go -> clean up connection stuff and moved to
  connection_manager.go. Each stage asks connection maanger for a plan
  and gets ordered list of servers to try with backoffs, validation
  timeout etc. Ensure context is threaded through and proper deadlines
  are set.

Please note that there is some repetition of code
- in building plans for different state in connection_manager.go
- between `join()` and `resumeConnection()` when executing the plan in
  engine.go
But, kept them for easier readability. Can consolidate it later once it
is stable.

Once again requested Claude for tests and it wrote some exhaustive
tests.

@milos-lk I probably removed/reworked some stuff you did just last week.
Sorry. Felt that the whole things was getting entangled and this is an
attempt to hopefully make it cleaner.

For manual testing, published video from livekit-cli and had code in
there to simulate migration (Go SDK does not support migration, so this
also becomes a full reconnect) or node failure or leave request full
reconnect every 10 seconds and ensured that it reconnected and published
(found an issue with dynacast while doing this which I fixed in #941)
@milos-lk

milos-lk commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

no worries - my change from the last week allowed us to ship that fixed region failover logic in some of services. I am going to take a look to check what this buys us.

@milos-lk

milos-lk commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Since ConnectionState() is now derived from the connection manager, what moves the state machine to Disconnected on a client-initiated leave? setDisconnected() seems to only be called from OnLeave on a server LeaveRequest_DISCONNECT, but room.Disconnect() → cleanup() → engine.Close() never touches the connection manager, so after a local disconnect, ConnectionState() would keep returning Connected? Same question for the give-up path: when the reconnect loop exhausts retries and fires OnDisconnected, the state stays Reconnecting? Would it make sense to call setDisconnected() from engine.Close()?

@boks1971

boks1971 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Since ConnectionState() is now derived from the connection manager, what moves the state machine to Disconnected on a client-initiated leave? setDisconnected() seems to only be called from OnLeave on a server LeaveRequest_DISCONNECT, but room.Disconnect() → cleanup() → engine.Close() never touches the connection manager, so after a local disconnect, ConnectionState() would keep returning Connected? Same question for the give-up path: when the reconnect loop exhausts retries and fires OnDisconnected, the state stays Reconnecting? Would it make sense to call setDisconnected() from engine.Close()?

Good catch! I did that last as I saw there was another connection state and tried to re-use. Will fix the disconnected bit.

Comment thread connection_manager.go
}

// if already resuming, do not take settings that are nil as some internal paths might do a resume without regions
if c.state == connectionManagerStateResuming {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code? resuming is != connectionManagerStateConnected - so the check above returns first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I need to move that up before the previous check. There are cases where the resume happens multiple times. It can happen internally and it can also come from server. So, want to latch the server given list if available.

Comment thread connection_manager.go

c.regionSettings = utils.CloneProto(regionSettigs)
c.updateState(connectionManagerStateReconnecting)
return true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents a leave: RECONNECT arriving during the initial join (state still Initial - the read loop starts in OnJoinResponse, before setConnected) from spawning the reconnect goroutine here and racing the in-flight join()? The old code gated handleDisconnect on hasConnected. setResuming has a state guard but setReconnecting returns true unconditionally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a small window there. Will look at plugging that.

Comment thread engine.go
}
}

if !e.reconnecting.CompareAndSwap(false, true) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goroutine is between setResumed (state=Connected) and its defer e.reconnecting.Store(false) when a new transport-close arrives, setResuming succeeds, this CAS fails, and the goroutine exits, leaving state=Resuming with no worker. What re-arms recovery at that point? setResuming will refuse every subsequent resume-type event (state != Connected), and a leave: RECONNECT needs a signal connection we no longer have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants