Skip to content

fix(clickhouse): retry reads on transient connection errors#4159

Draft
claude[bot] wants to merge 2 commits into
mainfrom
fix/clickhouse-read-connection-retry
Draft

fix(clickhouse): retry reads on transient connection errors#4159
claude[bot] wants to merge 2 commits into
mainfrom
fix/clickhouse-read-connection-retry

Conversation

@claude

@claude claude Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Requested by Dan Sutton · Slack thread

✅ Checklist

  • I have followed every step in the contributing guide
  • The PR title follows the convention.
  • I ran and tested the code works

Testing

Added client.retry.test.ts — 13 unit tests (no Docker required) covering:

  • retry-then-success on a transient connection error
  • bounded give-up after the configured number of attempts
  • non-retryable server errors (ClickHouseError) failing fast without retry
  • the isRetryableConnectionError classifier (socket code / message matching, cause unwrapping)

Changelog

Before: The ClickHouse read helpers (query, queryWithStats, queryFast) executed the underlying client.query exactly once. If the call failed with a transient connection error — most often a dead keep-alive socket surfacing as ECONNRESET on first use of a pooled connection, but also EPIPE / ETIMEDOUT / socket hang up — the read failed immediately, even though a retry on a fresh socket would have succeeded. Separately, the keepAlive.idleSocketTtl config was silently ignored because @clickhouse/client expects the snake_case key idle_socket_ttl.

After: Reads transparently retry on transient connection errors with a small bounded backoff (default 3 attempts, full-jitter exponential). Server-side errors (e.g. SQL errors / ClickHouseError) are never retried and still fail fast. The idle-socket TTL is now honored.

How: Added isRetryableConnectionError (classifies by socket code / message, unwraps cause, excludes ClickHouseError) and a queryWithConnectionRetry wrapper around the three read client.query calls in internal-packages/clickhouse/src/client/client.ts. Retry attempts/backoff are configurable via a new ClickhouseConfig.connectionRetry option. Fixed the keep-alive mapping to idle_socket_ttl (confirmed against @clickhouse/client@1.12.1 types).


Screenshots

N/A

💯

ClickHouse reads (query, queryWithStats, queryFast) ran the underlying
client.query once and, on any failure, returned a QueryError with no
retry. Transient connection errors — most commonly a dead keep-alive
socket surfacing as ECONNRESET on the first use of a pooled connection,
but also EPIPE/ETIMEDOUT/"socket hang up" — are safe to retry and
should not fail the request.

- Add isRetryableConnectionError to classify transient connection
  errors by socket code or message substring, unwrapping a nested
  cause. Server-side ClickHouseErrors (e.g. SQL errors) are never
  classified as retryable.
- Wrap the read client.query calls in a bounded retry (default 3
  attempts, full-jitter exponential backoff). Attempt count and delays
  are configurable via ClickhouseConfig.connectionRetry so callers and
  tests can tune or disable the backoff.
- Fix the keep-alive config mapping: @clickhouse/client expects
  snake_case idle_socket_ttl, so the camelCase idleSocketTtl was being
  silently dropped and the idle-socket TTL never honored.
- Add unit tests covering retry-then-success, bounded give-up,
  non-retryable server errors, and error classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01B5mP4ULoVXdrAWmiMQh5bG
@changeset-bot

changeset-bot Bot commented Jul 5, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 3a43479

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@claude claude Bot requested a review from d-cs July 5, 2026 10:50
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01B5mP4ULoVXdrAWmiMQh5bG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant