Make RedisTimeSeries metrics export resilient to transient errors#541
Draft
lerman25 wants to merge 1 commit into
Draft
Make RedisTimeSeries metrics export resilient to transient errors#541lerman25 wants to merge 1 commit into
lerman25 wants to merge 1 commit into
Conversation
The post-benchmark push to the RedisTimeSeries sidecar has been failing intermittently with 'Connection reset by peer' (Errno 104) and similar transport-level errors. The benchmarks themselves complete successfully, but the metrics export raises and the whole run is marked as failed. This makes the export resilient in three coordinated ways: 1. run_remote/run_remote.py - the 'rts' client used for the post-test metrics push now configures an explicit redis-py Retry policy with exponential backoff (5 retries, ExponentialBackoff(cap=10, base=1)) and reasonable socket timeouts, retrying on ConnectionError and TimeoutError. The previous 'retry_on_timeout=True' alone is a no-op without a Retry object attached, so the client effectively had zero retries. 2. run_remote/run_remote.py - the post-test exception handler now also catches TimeoutError (both are surfaced once the retry layer is exhausted) and gates the fatal 'raise Exception(...)' behind a new opt-in flag '--continue-on-redistimeseries-export-error' (also settable via CONTINUE_ON_RTS_EXPORT_ERROR=1). Default behaviour is unchanged - the call still raises - so this is backward compatible. 3. utils/remote.py - push_data_to_redistimeseries already swallowed TimeoutError per timeseries; it now also swallows ConnectionError so a single transient hiccup mid-batch only increments the error counter instead of aborting the whole dict. The added CLI flag lives on common_run_args so it is inherited by all sub-commands that already accept --push_results_redistimeseries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the post-benchmark RedisTimeSeries metrics push resilient to transient transport-level errors (e.g. TCP "Connection reset by peer" / Errno 104) from the RTS sidecar.
We have been observing weekly CI runs (RediSearch Weekly Flow) fail repeatedly with stack traces like:
The benchmarks themselves complete successfully, but the post-run metrics export raises and the whole run is marked failed. Example: https://github.com/RediSearch/RediSearch/actions/runs/26001993346/job/76427288314 (and ~8 consecutive weekly runs before it).
Changes
Three coordinated changes:
1. Resilient rts client (run_remote/run_remote.py)
Replace the no-op retry_on_timeout=True (which does nothing without an explicit Retry object attached) with a real redis-py retry policy plus reasonable socket timeouts:
redis-py now transparently reconnects across transient ConnectionError / TimeoutError events, which covers the vast majority of the RTS-side TCP resets observed in CI.
2. Opt-in non-fatal export (run_remote/run_remote.py + run/args.py)
Once the retry layer is exhausted, both ConnectionError and TimeoutError are caught (was: ConnectionError only) and the fatal raise Exception(...) is gated behind a new opt-in flag:
Default behavior is unchanged -- the call still raises -- so this is fully backward compatible. CI pipelines that want best-effort metrics export (benchmark success should not depend on sidecar uptime) can opt in.
The flag is added to common_run_args so it is inherited by every entrypoint that already accepts --push_results_redistimeseries.
3. Per-timeseries fault tolerance (utils/remote.py)
push_data_to_redistimeseries already swallowed TimeoutError per-timeseries; it now also swallows ConnectionError so a single transient hiccup mid-batch only increments the error counter instead of aborting the whole dict (which currently short-circuits all remaining timeseries via the raise in exporter_create_ts).
Verification
Follow-up (downstream)
Once this lands and a new redisbench-admin is cut, the consuming repos (e.g. RediSearch tests/benchmarks/requirements.txt + .github/workflows/benchmark-runner.yml) can bump the version pin and pass --continue-on-redistimeseries-export-error to stop letting RTS-sidecar flakiness fail the entire weekly benchmark run.
Co-authored by Augment Code