Skip to content

fix(github-sync): retry transient network errors during GraphQL batch#144

Merged
newstler merged 1 commit intomainfrom
fix/github-sync-transient-errors
Apr 15, 2026
Merged

fix(github-sync): retry transient network errors during GraphQL batch#144
newstler merged 1 commit intomainfrom
fix/github-sync-transient-errors

Conversation

@newstler
Copy link
Copy Markdown
Owner

Summary

Fixes a production EOFError: end of file reached raised from UpdateGithubDataJobUser::GithubSyncable#github_graphql_request.

Root cause

GitHub's GraphQL endpoint occasionally closes connections mid-response, surfacing in Ruby as EOFError (and siblings like Errno::ECONNRESET, OpenSSL::SSL::SSLError). The retry loop in github_graphql_request only rescued Net::OpenTimeout / Net::ReadTimeout, so any connection reset escaped the rescue and failed the entire batch of users without retrying. The UpdateGithubDataJob would bail out partway through, leaving users un-synced.

Fix

Extracted the set of transient network errors into a TRANSIENT_NETWORK_ERRORS constant and rescue them uniformly in the retry loop, so they trigger the existing backoff/retry (2 attempts, 2s sleep) just like timeouts already did. This matches the user's request: don't skip the error — make it work by actually retrying.

Test plan

  • New test/models/concerns/user/github_syncable_test.rb with two cases:
    • batch_sync_github_data! retries on EOFError and succeeds on the second attempt
    • batch_sync_github_data! returns a network error after exhausting retries
  • Confirmed both reproduce the production error when reverted to the old rescue clause.
  • rails test — full suite (334 tests) passes.
  • rubocop clean.

🤖 Generated with Claude Code

GitHub's GraphQL endpoint occasionally closes connections mid-response,
raising EOFError (and siblings like Errno::ECONNRESET / OpenSSL::SSL::SSLError)
out of Net::HTTP. The retry loop in User::GithubSyncable#github_graphql_request
only caught Net::OpenTimeout/Net::ReadTimeout, so any connection reset
escaped the rescue and failed the entire batch of users without retrying —
observed in production as EOFError "end of file reached" from
UpdateGithubDataJob.

Extract the full set of transient network error classes into
TRANSIENT_NETWORK_ERRORS and rescue them uniformly so they trigger the
existing backoff/retry instead of aborting the batch.

Added regression tests covering both paths: retry-then-succeed and
retries-exhausted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newstler newstler merged commit 7c6b8d7 into main Apr 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant