Skip to content

MANTA-5525 Reduce SRV no-records retry TTL from 3600s to 60s#161

Open
cneira wants to merge 3 commits intomasterfrom
accesskey-per-bucket
Open

MANTA-5525 Reduce SRV no-records retry TTL from 3600s to 60s#161
cneira wants to merge 3 commits intomasterfrom
accesskey-per-bucket

Conversation

@cneira
Copy link
Copy Markdown

@cneira cneira commented Apr 28, 2026

The previous default of 3600s (1 hour) was chosen with the assumption that "there are probably no SRV records to be had" so retrying frequently was considered wasteful.

This is not bad by itself, but services like authcache don't have _http._tcp SRV records and rely only on A records for discovery. On the initial connection this works fine, cueball queries SRV, gets no records, immediately falls through to A records, and connects.

The problem occurs when the existing A record connections are lost (binder restart, ZK registrar stall, network hiccup). Cueball needs to rediscover the service, but its state machine always restarts from the SRV step. The SRV retry timer set during the previous cycle says "don't retry SRV for another N seconds." Until that timer expires, the pool sits in sleep state and cannot restart the SRV → AAAA → A discovery chain. With the 3600s default, this means up to one hour where the pool is in "failed" state and all requests are rejected.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

When an SRV lookup returns NXDOMAIN or NODATA, the resolver
correctly falls through to A/AAAA resolution but schedules the
next SRV retry 3600 seconds later.  If A record resolution also
experiences a transient failure during this window, the connection
pool enters a "failed" state from which it cannot recover until
the SRV timer expires — up to one hour.

In deployments where services register only A records and not SRV
records, this creates a persistent vulnerability: every resolver
startup opens a one-hour window during which a brief DNS disruption
can cause a complete pool outage requiring a service restart to
recover.

This change introduces a configurable `srvTTL` constructor option
(default: 60 seconds) that controls the retry interval when SRV
returns NXDOMAIN or NODATA.  The NOTIMP case, where the nameserver
does not support SRV queries, retains the existing 3600-second
TTL.  SOA TTL from the DNS response continues to take precedence
when available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cneira cneira requested a review from a team April 28, 2026 17:38
Copy link
Copy Markdown
Member

@travispaul travispaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nits I can add are to version bump and a changelog.adoc update.

danmcd
danmcd previously approved these changes May 4, 2026
Copy link
Copy Markdown

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too-narrow-code aside this makes sense.

Copy link
Copy Markdown

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment reformatting is great, thank you! Code one has me confused, maybe tabs vs. spaces, or some other indentation subtlety I'm missing?

Comment thread lib/resolver.js
Comment thread lib/resolver.js
Copy link
Copy Markdown

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing that's really blocking at this point. As with others make sure Travis has a look or has his issues resolved.

Comment thread lib/resolver.js
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants