MANTA-5525 Reduce SRV no-records retry TTL from 3600s to 60s#161
Open
MANTA-5525 Reduce SRV no-records retry TTL from 3600s to 60s#161
Conversation
When an SRV lookup returns NXDOMAIN or NODATA, the resolver correctly falls through to A/AAAA resolution but schedules the next SRV retry 3600 seconds later. If A record resolution also experiences a transient failure during this window, the connection pool enters a "failed" state from which it cannot recover until the SRV timer expires — up to one hour. In deployments where services register only A records and not SRV records, this creates a persistent vulnerability: every resolver startup opens a one-hour window during which a brief DNS disruption can cause a complete pool outage requiring a service restart to recover. This change introduces a configurable `srvTTL` constructor option (default: 60 seconds) that controls the retry interval when SRV returns NXDOMAIN or NODATA. The NOTIMP case, where the nameserver does not support SRV queries, retains the existing 3600-second TTL. SOA TTL from the DNS response continues to take precedence when available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
travispaul
reviewed
May 1, 2026
danmcd
previously approved these changes
May 4, 2026
danmcd
left a comment
There was a problem hiding this comment.
Too-narrow-code aside this makes sense.
danmcd
reviewed
May 4, 2026
danmcd
left a comment
There was a problem hiding this comment.
Comment reformatting is great, thank you! Code one has me confused, maybe tabs vs. spaces, or some other indentation subtlety I'm missing?
danmcd
approved these changes
May 8, 2026
danmcd
left a comment
There was a problem hiding this comment.
Nothing that's really blocking at this point. As with others make sure Travis has a look or has his issues resolved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous default of 3600s (1 hour) was chosen with the assumption that "there are probably no SRV records to be had" so retrying frequently was considered wasteful.
This is not bad by itself, but services like authcache don't have _http._tcp SRV records and rely only on A records for discovery. On the initial connection this works fine, cueball queries SRV, gets no records, immediately falls through to A records, and connects.
The problem occurs when the existing A record connections are lost (binder restart, ZK registrar stall, network hiccup). Cueball needs to rediscover the service, but its state machine always restarts from the SRV step. The SRV retry timer set during the previous cycle says "don't retry SRV for another N seconds." Until that timer expires, the pool sits in sleep state and cannot restart the SRV → AAAA → A discovery chain. With the 3600s default, this means up to one hour where the pool is in "failed" state and all requests are rejected.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com