Make RedisTimeSeries metrics export resilient to transient errors by lerman25 · Pull Request #541 · redis-performance/redisbench-admin

lerman25 · 2026-05-18T10:38:19Z

Summary

Make the post-benchmark RedisTimeSeries metrics push resilient to transient transport-level errors (e.g. TCP "Connection reset by peer" / Errno 104) from the RTS sidecar.

We have been observing weekly CI runs (RediSearch Weekly Flow) fail repeatedly with stack traces like:

redis.exceptions.ConnectionError: Connection closed by server / Connection reset by peer
  at exporter_create_ts -> rts.exists(...)
  at push_data_to_redistimeseries
  at export_redis_metrics
  at run_remote_command_logic

The benchmarks themselves complete successfully, but the post-run metrics export raises and the whole run is marked failed. Example: https://github.com/RediSearch/RediSearch/actions/runs/26001993346/job/76427288314 (and ~8 consecutive weekly runs before it).

Changes

Three coordinated changes:

1. Resilient rts client (run_remote/run_remote.py)

Replace the no-op retry_on_timeout=True (which does nothing without an explicit Retry object attached) with a real redis-py retry policy plus reasonable socket timeouts:

rts_retry = Retry(ExponentialBackoff(cap=10, base=1), retries=5)
rts = redis.Redis(
    ...,
    socket_timeout=30,
    socket_connect_timeout=10,
    health_check_interval=30,
    retry=rts_retry,
    retry_on_error=[redis.exceptions.ConnectionError, redis.exceptions.TimeoutError],
)

redis-py now transparently reconnects across transient ConnectionError / TimeoutError events, which covers the vast majority of the RTS-side TCP resets observed in CI.

2. Opt-in non-fatal export (run_remote/run_remote.py + run/args.py)

Once the retry layer is exhausted, both ConnectionError and TimeoutError are caught (was: ConnectionError only) and the fatal raise Exception(...) is gated behind a new opt-in flag:

--continue-on-redistimeseries-export-error      (or CONTINUE_ON_RTS_EXPORT_ERROR=1)

Default behavior is unchanged -- the call still raises -- so this is fully backward compatible. CI pipelines that want best-effort metrics export (benchmark success should not depend on sidecar uptime) can opt in.

The flag is added to common_run_args so it is inherited by every entrypoint that already accepts --push_results_redistimeseries.

3. Per-timeseries fault tolerance (utils/remote.py)

push_data_to_redistimeseries already swallowed TimeoutError per-timeseries; it now also swallows ConnectionError so a single transient hiccup mid-batch only increments the error counter instead of aborting the whole dict (which currently short-circuits all remaining timeseries via the raise in exporter_create_ts).

Verification

python3 -m py_compile on all three edited files.
Retry / ExponentialBackoff imports verified against redis-py 6.4.0 (well inside the redis = "^4.2.2" range in pyproject.toml).
Default code path unchanged -- existing callers see identical behavior unless they pass the new flag.

Follow-up (downstream)

Once this lands and a new redisbench-admin is cut, the consuming repos (e.g. RediSearch tests/benchmarks/requirements.txt + .github/workflows/benchmark-runner.yml) can bump the version pin and pass --continue-on-redistimeseries-export-error to stop letting RTS-sidecar flakiness fail the entire weekly benchmark run.

Co-authored by Augment Code

The post-benchmark push to the RedisTimeSeries sidecar has been failing intermittently with 'Connection reset by peer' (Errno 104) and similar transport-level errors. The benchmarks themselves complete successfully, but the metrics export raises and the whole run is marked as failed. This makes the export resilient in three coordinated ways: 1. run_remote/run_remote.py - the 'rts' client used for the post-test metrics push now configures an explicit redis-py Retry policy with exponential backoff (5 retries, ExponentialBackoff(cap=10, base=1)) and reasonable socket timeouts, retrying on ConnectionError and TimeoutError. The previous 'retry_on_timeout=True' alone is a no-op without a Retry object attached, so the client effectively had zero retries. 2. run_remote/run_remote.py - the post-test exception handler now also catches TimeoutError (both are surfaced once the retry layer is exhausted) and gates the fatal 'raise Exception(...)' behind a new opt-in flag '--continue-on-redistimeseries-export-error' (also settable via CONTINUE_ON_RTS_EXPORT_ERROR=1). Default behaviour is unchanged - the call still raises - so this is backward compatible. 3. utils/remote.py - push_data_to_redistimeseries already swallowed TimeoutError per timeseries; it now also swallows ConnectionError so a single transient hiccup mid-batch only increments the error counter instead of aborting the whole dict. The added CLI flag lives on common_run_args so it is inherited by all sub-commands that already accept --push_results_redistimeseries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make RedisTimeSeries metrics export resilient to transient errors#541

Make RedisTimeSeries metrics export resilient to transient errors#541
lerman25 wants to merge 1 commit into
redis-performance:masterfrom
lerman25:mod-resilient-rts-export

lerman25 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lerman25 commented May 18, 2026

Summary

Changes

1. Resilient rts client (run_remote/run_remote.py)

2. Opt-in non-fatal export (run_remote/run_remote.py + run/args.py)

3. Per-timeseries fault tolerance (utils/remote.py)

Verification

Follow-up (downstream)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant