Skip to content

Reconnect control stream on network errors instead of terminating#1201

Open
Etschmia wants to merge 3 commits intolichess-bot-devs:masterfrom
Etschmia:fix/control-stream-reconnect
Open

Reconnect control stream on network errors instead of terminating#1201
Etschmia wants to merge 3 commits intolichess-bot-devs:masterfrom
Etschmia:fix/control-stream-reconnect

Conversation

@Etschmia
Copy link
Copy Markdown

Problem

watch_control_stream handles all exceptions with break, which exits the reconnect loop, sends {"type": "terminated"} to the control queue, and triggers a full bot restart (including a 10-second sleep in the outer restart loop). Any transient network error — including a read timeout on a TCP connection stuck in CLOSE_WAIT — therefore takes the bot offline for 10+ seconds.

Related discussion: #1200

Fix

On exception, log at WARNING level and continue the loop (1-second backoff) instead of break. The stream reconnects immediately without interrupting active game processes or requiring a full restart.

Only stop.terminated (SIGINT) causes the loop to exit and send "terminated", which is the intended shutdown path.

Before / After

# before
except Exception:
    error = traceback.format_exc()
    break                            # kills reconnect loop → full restart

# after
except Exception:
    if stop.terminated:
        break
    logger.warning(f"Control stream error, reconnecting:\n{traceback.format_exc()}")
    time.sleep(1)                    # brief backoff, then reconnect

Notes

  • time and traceback are already imported at the top of lichess_bot.py
  • The error field in the "terminated" message was only ever populated on the exception path; it is now always None since exceptions are handled by reconnecting
  • Existing tests pass unchanged since watch_control_stream is exercised through the fake Lichess server which closes streams cleanly (no exception path hit)

When watch_control_stream catches an exception (e.g. read timeout or
TCP CLOSE_WAIT without clean EOF), it previously broke out of the
reconnect loop and sent a "terminated" event, triggering a full
10-second bot restart. Any transient network error therefore took the
bot offline for 10+ seconds.

Now the exception is logged at WARNING level and the loop continues
immediately (with a 1-second backoff), so the control stream reconnects
without interrupting active games or requiring a process restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AttackingOrDefending
Copy link
Copy Markdown
Member

Can you fix the ruff errors?

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MarkZH
Copy link
Copy Markdown
Collaborator

MarkZH commented Apr 30, 2026

I've let this run for almost three days on my computer, and the exception path was never taken. This PR seems fine, but I don't have a way to test it with actual lichess sessions.

Comment thread lib/lichess_bot.py Outdated
except Exception:
error = traceback.format_exc()
break
if stop.terminated:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch seems redundant with the while-loop condition. Can it be removed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, removed in 66b5167.

Comment thread lib/lichess_bot.py Outdated
time.sleep(1)

control_queue.put_nowait({"type": "terminated", "error": error})
control_queue.put_nowait({"type": "terminated", "error": None})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error field is always None, then there's no point in sending it. Remove , "error": None and the logger.debug() call in line 382.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed both in 66b5167.

The `if stop.terminated: break` inside the except branch is redundant —
the surrounding `while not stop.terminated` already exits on the next
iteration. Remove it.

The `error` field in the "terminated" message is always None now that
exceptions are handled by reconnecting, so drop the field and the
matching `logger.debug` that read it.
@Etschmia
Copy link
Copy Markdown
Author

Etschmia commented May 1, 2026

Thanks for testing this! On my host the stuck-CLOSE_WAIT exception happened ~once a week — it likely needs flaky network conditions to surface, so a clean test environment won't see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants