Skip to content

Handle Retry-After header for 429 and 503 in gateway prober#1475

Open
codaMW wants to merge 2 commits intopayjoin:masterfrom
codaMW:fix/retry-after-gateway-prober
Open

Handle Retry-After header for 429 and 503 in gateway prober#1475
codaMW wants to merge 2 commits intopayjoin:masterfrom
codaMW:fix/retry-after-gateway-prober

Conversation

@codaMW
Copy link
Copy Markdown

@codaMW codaMW commented Apr 12, 2026

Closes #1474

When a gateway returns 429 Too Many Requests or 503 Service Unavailable,
the Retry-After response header specifies how long to wait before
retrying. Previously these responses fell back to static TTL defaults
(http_4xx = 1 week, http_5xx = 5 seconds), ignoring the
server-specified duration entirely.

This PR adds a retry_after_or helper that parses the Retry-After
header value as seconds and uses it as the cache TTL, falling back to
the configured default when the header is absent.

Pull Request Checklist

  • I have disclosed my use of AI in the body of this PR.
  • I have read CONTRIBUTING.md and rebased my branch to produce hygienic commits.

Disclosure: I consulted Claude to understand the codebase structure,
but the solution was authored by myself.

Copy link
Copy Markdown
Contributor

@caarloshenriq caarloshenriq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cACK 1a85d7b

Reviewed the retry_after_or helper and the new branches for 429 and 503.
Logic is correct, specific status checks are ordered before the generic
is_client_error() / is_server_error() catches.

Note: Retry-After as an HTTP-date (per RFC 7231) is not handled, but
silently falls back to the configured default, acceptable given the issue scope.

Added tests to verify the behavior:

  • 429 with Retry-After header respects the server-specified TTL
  • 503 with Retry-After header uses the parsed duration
  • 429 without Retry-After falls back to http_4xx default
    #[tokio::test(start_paused = true)]
    async fn test_mock_429_retry_after() {
        let mut server = Server::new_async().await;
        let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
        let prober = Prober::default();

        let retry_after_secs = 120u64;
        let mock_429 = server
            .mock("GET", RFC_9540_GATEWAY_PATH)
            .match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
            .with_status(429)
            .with_header("Retry-After", &retry_after_secs.to_string())
            .create();

        let status = prober.check_opt_in(&url).await.expect("probing must succeed");
        mock_429.assert();
        assert!(!status.bip77_allowed, "rate-limited gateway should not be considered opt-in");

        // Advance time by less than Retry-After — should use cached result (no new request)
        tokio::time::advance(Duration::from_secs(retry_after_secs - 1)).await;
        let status = prober.check_opt_in(&url).await.expect("cached probe must succeed");
        assert!(!status.bip77_allowed, "result should still be cached before Retry-After expires");

        // Advance past Retry-After — cache should be expired, new request would be sent
        tokio::time::advance(Duration::from_secs(2)).await;
        let mock_429_again = server
            .mock("GET", RFC_9540_GATEWAY_PATH)
            .match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
            .with_status(429)
            .with_header("Retry-After", &retry_after_secs.to_string())
            .create();
        let _ = prober.check_opt_in(&url).await;
        mock_429_again.assert();
    }

    #[tokio::test]
    async fn test_mock_503_retry_after() {
        let mut server = Server::new_async().await;
        let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
        let prober = Prober::default();

        let mock_503 = server
            .mock("GET", RFC_9540_GATEWAY_PATH)
            .match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
            .with_status(503)
            .with_header("Retry-After", "30")
            .create();

        let status = prober.check_opt_in(&url).await.expect("probing must succeed");
        mock_503.assert();
        assert!(!status.bip77_allowed, "unavailable gateway should not be considered opt-in");
    }

    #[tokio::test]
    async fn test_mock_429_no_retry_after_uses_default() {
        let mut server = Server::new_async().await;
        let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
        let prober = Prober::default();

        // No Retry-After header — should fall back to http_4xx TTL (1 week)
        let mock_429 = server
            .mock("GET", RFC_9540_GATEWAY_PATH)
            .match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
            .with_status(429)
            .create();

        let status = prober.check_opt_in(&url).await.expect("probing must succeed");
        mock_429.assert();
        assert!(!status.bip77_allowed, "rate-limited gateway without Retry-After should not be opt-in");

        // Should be cached — no new request sent
        let status = prober.check_opt_in(&url).await.expect("cached probe must succeed");
        assert!(!status.bip77_allowed, "result should be cached with default TTL");
    }

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the missing tests (thanks @caarloshenriq!), this makes an existing DoS concern which can be addressed by rate limiting potentially more serious since it gives the attacker control over delays.

A DoS resistant version of this would need to be able to maintain a larger set of known gateways with potentially indefinite opt-out status. This could be done by flushing long lived opt-outs to persistent storage, so that they are not just stored in an in memory hashmap.

Probing is regularly done on the open internet, including much more malicious probing. I have not seen the retry-after header or 429 status code actually used anywhere that i can remember... But arguably if a server explicitly asks the relay not to probe it until some time, that should be respected.

I took note of it since that's the theoretically correct behavior but i'm not really sure it's desired behavior, so perhaps i shouldn't have noted it as a TODO comment (which is also why an issue was never opened)

The first question is how often does this happen? And assuming it's common enough, at what complexity cost should we support it?

}

/// Probes a target gateway by attempting to send a GET request.
fn retry_after_or<B>(res: &hyper::Response<B>, default: Duration) -> Duration {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring of probe was displaced by this function

res.headers()
.get(hyper::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.rfc-editor.org/rfc/rfc7231#section-7.1.3

this can also be an HTTP date in theory but that would add significant complexity

} else if status == hyper::StatusCode::GATEWAY_TIMEOUT {
ttls.http_504_gateway_timeout
} else if status == hyper::StatusCode::TOO_MANY_REQUESTS {
Self::retry_after_or(res, ttls.http_4xx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this introduces a DoS vulnerability

an attacker can send requests to KnownGateways.capacity many URLs under their control, and specify an arbitrarily large number here. if done repeatedly as honest gateways' entries expire (by simply continuing such probing requests aggressively) then the server will be disabled for a duration that the attacker simply specifies.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per my main review comment, "introduces" overstating things, since a similar attack already exists, what this changes is that the attacker no longer needs to do any maintenance after the longest honest ttl in the DB expires

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the displaced docstring:

Fixed: docstring is now on probe where it belongs.

On the HTTP-date format:

Acknowledged: only seconds format is handled, HTTP-date silently falls back to default. Acceptable for now given the scope.

On the DoS concern:

Fixed: Retry-Afteris now capped at the configured default with.map(|d| d.min(default))`. An attacker-controlled gateway can no longer specify delays beyond what the config already allows.

@codaMW
Copy link
Copy Markdown
Author

codaMW commented Apr 14, 2026

Thanks @nothingmuch and @caarloshenriq for the thorough review.
On the DoS concern, I'll cap the parsed Retry-After at the configured default so it can only shorten the TTL, never extend beyond what the config allows. That removes attacker control over arbitrary delays.
I'll also fix the displaced docstring and add the tests @caarloshenriq provided.
On the broader question of whether this feature is worth supporting ,I'll defer to your judgment. If the consensus is that the TODO was aspirational and the complexity cost isn't justified, I'm happy to close this.

@codaMW codaMW force-pushed the fix/retry-after-gateway-prober branch from 1a85d7b to c44157f Compare April 14, 2026 00:28
@codaMW
Copy link
Copy Markdown
Author

codaMW commented Apr 14, 2026

Updated: fixed displaced docstring, capped Retry-After at configured default to address DoS concern, and added 3 tests covering 429 with header, 503 with header, and 429 without header.

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nothingmuch and @caarloshenriq for the thorough review. On the DoS concern, I'll cap the parsed Retry-After at the configured default so it can only shorten the TTL, never extend beyond what the config allows. That removes attacker control over arbitrary delays.

Yep, that's a reasonable compromise. Concept ACK. See below for rationale/context for my suggestion.

Please add Co-authored-by crediting @caarloshenriq for the tests.

As usual I defer to @spacebear21 on whether or not the maintenance burden is worth it. For now this code would mostly handle what non mailroom URLs respond with that (GET /.well-known/ohttp-gateway?allowed_purposes seems more likely to result in 404 or 200 than 429 or 503).

Either here or in a followup PR, the default capacity and default TTLs should be revisited. The capacity is very small (chosen when there was only payjo.in). The 4xx default ttl is long (perhaps too long), and 5xx default is too short.

See #941 and related issues for various discussions rate limiting, because when rate limiting is actually implemented appropriate values for all of these can be chosen without needing to add any configuration knobs. If no more than r requests per second can be made, then setting capacity = r * max_ttl would mean that it would never be a limiting constraint.

Note that a global rate limit doesn't fix the DoS concern either. What that can do is reduce the complexity of the attack surface, since with a global rate limit shared by all clients an attacker can just flood with requests and honest users will be rate limited too, but by ensuring the known gateways capacity is never the limiting constraint the attack has only one "mode" or "phase" for lack of a better term.

What we then would need to ensure is that honest users can get some fair share of that global rate limit.

Perhaps that can be done heuristically, for example using IPs and asmap to attempt to have separate rate limits for each, ensuring that an attacker can only rate limits user with which it shares an AS.

A more comprehensive solution would be to issue anonymous credentials for enforcing a rate limit for each user separately. To deter sybil attack on this mechanism, these should be issued in exchange for ownership proofs on sufficiently old, non-dust UTXO, hashcash, a small lightning payment, ... This is alluded to in BIP77 with respect to the directory but potentially applies to the relay as well, and has not been specified yet.

Anyway hopefully this clears up the long term plan for how to fully address the more fundamental DoS concern.

.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok())
.map(Duration::from_secs)
.map(|d| d.min(default))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(|d| d.min(default))
.map(|d| d.min(MAX_TTL))

see discussion in the main review comment. this could also perhaps be defined as constant factor multiplying default, it doesn't need to strictly reduce we just need to know that it's a reasonable bound.

for 503 we should allow more than 5 seconds (the default).

for 4xx we should probably lower the default of 1 week, i think i was overly concerned with avoiding dos of upstream servers and not enough with avoiding dos of the relay itself

MAX_TTL on the order of a day probably makes sense?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed context and Concept ACK. I'll define a MAX_TTL constant (1 day) and cap against that instead of default. That handles the case where http_5xx default (5s) is too short for a legitimate 503 with a longer Retry-After.
Will also add the Co-authored-by credit for @caarloshenriq.

When a gateway returns 429 Too Many Requests or 503 Service
Unavailable, the Retry-After response header specifies how long
to wait before retrying. Previously these responses fell back to
static TTL defaults, ignoring the server-specified duration.

Add a retry_after_or helper that parses the Retry-After header
value as seconds and uses it as the cache TTL, capped at
MAX_RETRY_AFTER_TTL (24 hours) to prevent attacker-controlled
delays. Falls back to the configured default when absent.

Add tests covering 429 with header, 503 with header, and 429
without header fallback to default TTL.

Closes payjoin#1474

Co-authored-by: caarloshenriq
@codaMW codaMW force-pushed the fix/retry-after-gateway-prober branch from c44157f to b06390b Compare April 14, 2026 02:20
@codaMW
Copy link
Copy Markdown
Author

codaMW commented Apr 14, 2026

Updated:

Replaced .min(default) with .min(MAX_RETRY_AFTER_TTL) where MAX_RETRY_AFTER_TTL = 24 hours — this allows legitimate longer 503 delays while bounding attacker-controlled values
Added Co-authored-by credit for @caarloshenriq

Thanks @nothingmuch for the guidance.

@caarloshenriq
Copy link
Copy Markdown
Contributor

Added Co-authored-by credit for @caarloshenriq

Thanks for the co-author credit! Just a note for future reference, the full format GitHub recognizes is:

Co-authored-by: NAME NAME@EXAMPLE.COM

for more details, you can read this doc

@codaMW
Copy link
Copy Markdown
Author

codaMW commented Apr 14, 2026

Thanks for flagging that @caarloshenriq! I'll update the co-author line to include your email in the correct format. Could you share the email address you'd like me to use so i can update accordingly? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

payjoin-mailroom: respect Retry-After header for 429 and 503 responses in gateway prober

3 participants