Fix connect attempt retries#13102
Conversation
d9fcfa6 to
c033826
Compare
c033826 to
ab4427d
Compare
|
Now this PR is ready. I removed dependency of #13083. We can do it independently. |
There was a problem hiding this comment.
Pull request overview
This PR fixes origin connect-attempt retry behavior so it consistently honors the active HostDBInfo health state (UP / SUSPECT / DOWN), and updates AuTest replay coverage to validate the corrected single-host and round-robin behavior.
Changes:
- Add
HttpTransact::origin_server_connect_attempts_max_retries(State*)and refactor server connect failure handling to use state-aware retry limits. - Fix round-robin selection to skip DOWN targets while allowing SUSPECT targets (
ResolveInfo::select_next_rr(now, fail_window)), and update HostDB failure timestamping. - Add/adjust gold tests to cover single-record and RR retry behavior and updated failure-count thresholds.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/proxy/http/HttpTransact.cc |
Adds state-aware retry helper; refactors retry path and RR switching logic. |
include/proxy/http/HttpTransact.h |
Declares new helper and updates retry helper signature. |
src/proxy/http/HttpSM.cc |
Updates HostDB failure timestamping (ts_clock::now()), adjusts DOWN threshold logic. |
src/proxy/http/HttpConfig.cc |
Adds a config Warning related to RR + down-server retry configuration. |
src/iocore/hostdb/HostDB.cc |
Updates RR selection to be time/window aware and skip DOWN targets. |
include/iocore/hostdb/HostDBProcessor.h |
Updates ResolveInfo::select_next_rr signature to accept (now, fail_window). |
tests/gold_tests/dns/replay/connect_attempts_single_max_retries.replay.yaml |
New replay to cover single-DNS-record retry/down-server behavior. |
tests/gold_tests/dns/replay/connect_attempts_rr_retries.replay.yaml |
Updates RR retry scenario to match corrected behavior. |
tests/gold_tests/dns/gold/connect_attempts_single_max_retries_error_log.gold |
New gold output for the single-record replay. |
tests/gold_tests/dns/gold/connect_attempts_rr_retries_error_log.gold |
Updates gold output to reflect corrected RR behavior. |
tests/gold_tests/dns/gold/connect_attempts_rr_max_retries_error_log.gold |
Updates gold output for new fail-count threshold. |
tests/gold_tests/dns/connect_attempts.test.py |
Registers the new replay test. |
tests/gold_tests/autest-site/verifier_client.test.ext |
Adds poll_timeout support to verifier-client process config for replays. |
| unsigned max_connect_retries = s->txn_conf->connect_attempts_max_retries; | ||
| TxnDbg(dbg_ctl_http_trans, "max_connect_retries: %d s->current.retry_attempts: %d", max_connect_retries, | ||
| s->current.retry_attempts.get()); | ||
|
|
||
| if (is_request_retryable(s) && s->current.retry_attempts.get() < max_connect_retries && | ||
| !HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) { | ||
| // If this is a round robin DNS entry & we're tried configured | ||
| // number of times, we should try another node | ||
| if (ResolveInfo::OS_Addr::TRY_CLIENT == s->dns_info.os_addr_style) { | ||
| // attempt was based on client supplied server address. Try again using HostDB. | ||
| // Allow DNS attempt | ||
| s->dns_info.resolved_p = false; | ||
| // See if we can get data from HostDB for this. | ||
| s->dns_info.os_addr_style = ResolveInfo::OS_Addr::TRY_HOSTDB; | ||
| // Force host resolution to have the same family as the client. | ||
| // Because this is a transparent connection, we can't switch address | ||
| // families - that is locked in by the client source address. | ||
| ats_force_order_by_family(s->current.server->dst_addr.family(), s->my_txn_conf().host_res_data.order); | ||
| return CallOSDNSLookup(s); | ||
| } else if (ResolveInfo::OS_Addr::USE_API == s->dns_info.os_addr_style && !s->api_server_addr_set_retried) { | ||
| // Plugin set the server address via TSHttpTxnServerAddrSet(). Clear resolution | ||
| // state to allow the OS_DNS hook to be called again, giving the plugin a chance | ||
| // to set a different server address for retry (issue #12611). | ||
| // Only retry once to avoid infinite loops if the plugin keeps setting failing addresses. | ||
| s->api_server_addr_set_retried = true; | ||
| s->dns_info.resolved_p = false; | ||
| s->dns_info.os_addr_style = ResolveInfo::OS_Addr::TRY_DEFAULT; | ||
| // Clear the server request so it can be rebuilt for the new destination | ||
| s->hdr_info.server_request.destroy(); | ||
| TxnDbg(dbg_ctl_http_trans, "Retrying with plugin-set address, returning to OS_DNS hook"); | ||
| return CallOSDNSLookup(s); | ||
| } else { | ||
| if ((s->txn_conf->connect_attempts_rr_retries > 0) && | ||
| ((s->current.retry_attempts.get() + 1) % s->txn_conf->connect_attempts_rr_retries == 0)) { | ||
| s->dns_info.select_next_rr(); | ||
| } | ||
| retry_server_connection_not_open(s, s->current.state, max_connect_retries); | ||
| TxnDbg(dbg_ctl_http_trans, "Error. Retrying..."); | ||
| s->next_action = how_to_open_connection(s); | ||
| } | ||
| } else { | ||
| // Bail out if the request is not retryable, the global retry cap is reached, or we already have a usable response. | ||
| if (!is_request_retryable(s) || s->current.retry_attempts.get() >= max_connect_retries || | ||
| HttpTransact::is_response_valid(s, &s->hdr_info.server_response)) { | ||
| TxnDbg(dbg_ctl_http_trans, "Error. No more retries. %d/%d", s->current.retry_attempts.get(), max_connect_retries); |
There was a problem hiding this comment.
The retry cap here is initialized from connect_attempts_max_retries unconditionally, but the PR description says retry limits should be state-aware (UP/SUSPECT/DOWN). Using the UP config for the early bailout check can allow extra retries (or plugin/client-address re-resolution retries) beyond the intended per-host cap when the active HostDBInfo is SUSPECT/DOWN. Consider initializing max_connect_retries using origin_server_connect_attempts_max_retries(s) (with a deliberate fallback for non-HostDB targets, if needed).
There was a problem hiding this comment.
Here, we need to use connect_attempts_max_retries unconditionally to give a chance for round-robin case which can be new HostDBInfo in UP state. The state-aware limit is set by line 4077 eventually.
// The active target (HostDB) may be SUSPECT state, so re-evaluate the retry limit.
max_connect_retries = origin_server_connect_attempts_max_retries(s);
|
Copilot comments are addressed. |
Summary
We have three configs of connect attempt retry. However, prior to the change, none of them are implemented correctly. This PR makes the retry path honor the HostDBInfo state across the board.
connect_attempts_max_retriesconnect_attempts_max_retries_down_serverconnect_attempts_rr_retriesChanges
New helper
HttpTransact::origin_server_connect_attempts_max_retries(State*)returns the retry limit based on the activeHostDBInfo::State:connect_attempts_max_retriesconnect_attempts_max_retries_down_serverHttpTransact::handle_response_from_serveris refactored and retry logic is fixed.Fix round-robin logic in
ResolveInfo::select_next_rr(now, fail_window)HttpSM::mark_host_failurenow usesorigin_server_connect_attempts_max_retries(s) + 1as the DOWN threshold (was incorrectly usingconnect_attempts_rr_retries).HttpSM::do_hostdb_update_if_necessaryrecords failure time viats_clock::now()instead ofclient_request_timeso multiple connect attempts in one transaction get distinct timestamps.Tests
connect_attempts_single_max_retries.replay.yamlexercisesconnect_attempts_max_retriesandconnect_attempts_max_retries_down_serverfor single DNS Record (no round-robin case)Dependency
This PR is in draft until below PRs are merged.
Cleanup serving stale while origin server down #13083(update: this dependency is gone)