Reconnect control stream on network errors instead of terminating#1201
Open
Etschmia wants to merge 3 commits intolichess-bot-devs:masterfrom
Open
Reconnect control stream on network errors instead of terminating#1201Etschmia wants to merge 3 commits intolichess-bot-devs:masterfrom
Etschmia wants to merge 3 commits intolichess-bot-devs:masterfrom
Conversation
When watch_control_stream catches an exception (e.g. read timeout or TCP CLOSE_WAIT without clean EOF), it previously broke out of the reconnect loop and sent a "terminated" event, triggering a full 10-second bot restart. Any transient network error therefore took the bot offline for 10+ seconds. Now the exception is logged at WARNING level and the loop continues immediately (with a 1-second backoff), so the control stream reconnects without interrupting active games or requiring a process restart. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
|
Can you fix the ruff errors? |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
I've let this run for almost three days on my computer, and the exception path was never taken. This PR seems fine, but I don't have a way to test it with actual lichess sessions. |
MarkZH
reviewed
Apr 30, 2026
| except Exception: | ||
| error = traceback.format_exc() | ||
| break | ||
| if stop.terminated: |
Collaborator
There was a problem hiding this comment.
This branch seems redundant with the while-loop condition. Can it be removed?
MarkZH
reviewed
Apr 30, 2026
| time.sleep(1) | ||
|
|
||
| control_queue.put_nowait({"type": "terminated", "error": error}) | ||
| control_queue.put_nowait({"type": "terminated", "error": None}) |
Collaborator
There was a problem hiding this comment.
If the error field is always None, then there's no point in sending it. Remove , "error": None and the logger.debug() call in line 382.
The `if stop.terminated: break` inside the except branch is redundant — the surrounding `while not stop.terminated` already exits on the next iteration. Remove it. The `error` field in the "terminated" message is always None now that exceptions are handled by reconnecting, so drop the field and the matching `logger.debug` that read it.
Author
|
Thanks for testing this! On my host the stuck-CLOSE_WAIT exception happened ~once a week — it likely needs flaky network conditions to surface, so a clean test environment won't see it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
watch_control_streamhandles all exceptions withbreak, which exits the reconnect loop, sends{"type": "terminated"}to the control queue, and triggers a full bot restart (including a 10-second sleep in the outer restart loop). Any transient network error — including a read timeout on a TCP connection stuck in CLOSE_WAIT — therefore takes the bot offline for 10+ seconds.Related discussion: #1200
Fix
On exception, log at WARNING level and
continuethe loop (1-second backoff) instead ofbreak. The stream reconnects immediately without interrupting active game processes or requiring a full restart.Only
stop.terminated(SIGINT) causes the loop to exit and send"terminated", which is the intended shutdown path.Before / After
Notes
timeandtracebackare already imported at the top oflichess_bot.pyerrorfield in the"terminated"message was only ever populated on the exception path; it is now alwaysNonesince exceptions are handled by reconnectingwatch_control_streamis exercised through the fake Lichess server which closes streams cleanly (no exception path hit)