Handle Retry-After header for 429 and 503 in gateway prober#1475
Handle Retry-After header for 429 and 503 in gateway prober#1475codaMW wants to merge 2 commits intopayjoin:masterfrom
Conversation
caarloshenriq
left a comment
There was a problem hiding this comment.
cACK 1a85d7b
Reviewed the retry_after_or helper and the new branches for 429 and 503.
Logic is correct, specific status checks are ordered before the generic
is_client_error() / is_server_error() catches.
Note: Retry-After as an HTTP-date (per RFC 7231) is not handled, but
silently falls back to the configured default, acceptable given the issue scope.
Added tests to verify the behavior:
- 429 with
Retry-Afterheader respects the server-specified TTL - 503 with
Retry-Afterheader uses the parsed duration - 429 without
Retry-Afterfalls back tohttp_4xxdefault
#[tokio::test(start_paused = true)]
async fn test_mock_429_retry_after() {
let mut server = Server::new_async().await;
let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
let prober = Prober::default();
let retry_after_secs = 120u64;
let mock_429 = server
.mock("GET", RFC_9540_GATEWAY_PATH)
.match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
.with_status(429)
.with_header("Retry-After", &retry_after_secs.to_string())
.create();
let status = prober.check_opt_in(&url).await.expect("probing must succeed");
mock_429.assert();
assert!(!status.bip77_allowed, "rate-limited gateway should not be considered opt-in");
// Advance time by less than Retry-After — should use cached result (no new request)
tokio::time::advance(Duration::from_secs(retry_after_secs - 1)).await;
let status = prober.check_opt_in(&url).await.expect("cached probe must succeed");
assert!(!status.bip77_allowed, "result should still be cached before Retry-After expires");
// Advance past Retry-After — cache should be expired, new request would be sent
tokio::time::advance(Duration::from_secs(2)).await;
let mock_429_again = server
.mock("GET", RFC_9540_GATEWAY_PATH)
.match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
.with_status(429)
.with_header("Retry-After", &retry_after_secs.to_string())
.create();
let _ = prober.check_opt_in(&url).await;
mock_429_again.assert();
}
#[tokio::test]
async fn test_mock_503_retry_after() {
let mut server = Server::new_async().await;
let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
let prober = Prober::default();
let mock_503 = server
.mock("GET", RFC_9540_GATEWAY_PATH)
.match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
.with_status(503)
.with_header("Retry-After", "30")
.create();
let status = prober.check_opt_in(&url).await.expect("probing must succeed");
mock_503.assert();
assert!(!status.bip77_allowed, "unavailable gateway should not be considered opt-in");
}
#[tokio::test]
async fn test_mock_429_no_retry_after_uses_default() {
let mut server = Server::new_async().await;
let url = GatewayUri::from_str(&server.url()).expect("must be able to parse mock server URL");
let prober = Prober::default();
// No Retry-After header — should fall back to http_4xx TTL (1 week)
let mock_429 = server
.mock("GET", RFC_9540_GATEWAY_PATH)
.match_query(mockito::Matcher::Regex("^allowed_purposes$".into()))
.with_status(429)
.create();
let status = prober.check_opt_in(&url).await.expect("probing must succeed");
mock_429.assert();
assert!(!status.bip77_allowed, "rate-limited gateway without Retry-After should not be opt-in");
// Should be cached — no new request sent
let status = prober.check_opt_in(&url).await.expect("cached probe must succeed");
assert!(!status.bip77_allowed, "result should be cached with default TTL");
}There was a problem hiding this comment.
Apart from the missing tests (thanks @caarloshenriq!), this makes an existing DoS concern which can be addressed by rate limiting potentially more serious since it gives the attacker control over delays.
A DoS resistant version of this would need to be able to maintain a larger set of known gateways with potentially indefinite opt-out status. This could be done by flushing long lived opt-outs to persistent storage, so that they are not just stored in an in memory hashmap.
Probing is regularly done on the open internet, including much more malicious probing. I have not seen the retry-after header or 429 status code actually used anywhere that i can remember... But arguably if a server explicitly asks the relay not to probe it until some time, that should be respected.
I took note of it since that's the theoretically correct behavior but i'm not really sure it's desired behavior, so perhaps i shouldn't have noted it as a TODO comment (which is also why an issue was never opened)
The first question is how often does this happen? And assuming it's common enough, at what complexity cost should we support it?
| } | ||
|
|
||
| /// Probes a target gateway by attempting to send a GET request. | ||
| fn retry_after_or<B>(res: &hyper::Response<B>, default: Duration) -> Duration { |
There was a problem hiding this comment.
docstring of probe was displaced by this function
| res.headers() | ||
| .get(hyper::header::RETRY_AFTER) | ||
| .and_then(|v| v.to_str().ok()) | ||
| .and_then(|s| s.parse::<u64>().ok()) |
There was a problem hiding this comment.
https://www.rfc-editor.org/rfc/rfc7231#section-7.1.3
this can also be an HTTP date in theory but that would add significant complexity
| } else if status == hyper::StatusCode::GATEWAY_TIMEOUT { | ||
| ttls.http_504_gateway_timeout | ||
| } else if status == hyper::StatusCode::TOO_MANY_REQUESTS { | ||
| Self::retry_after_or(res, ttls.http_4xx) |
There was a problem hiding this comment.
this introduces a DoS vulnerability
an attacker can send requests to KnownGateways.capacity many URLs under their control, and specify an arbitrarily large number here. if done repeatedly as honest gateways' entries expire (by simply continuing such probing requests aggressively) then the server will be disabled for a duration that the attacker simply specifies.
There was a problem hiding this comment.
per my main review comment, "introduces" overstating things, since a similar attack already exists, what this changes is that the attacker no longer needs to do any maintenance after the longest honest ttl in the DB expires
There was a problem hiding this comment.
On the displaced docstring:
Fixed: docstring is now on
probewhere it belongs.
On the HTTP-date format:
Acknowledged: only seconds format is handled, HTTP-date silently falls back to default. Acceptable for now given the scope.
On the DoS concern:
Fixed: Retry-After
is now capped at the configured default with.map(|d| d.min(default))`. An attacker-controlled gateway can no longer specify delays beyond what the config already allows.
|
Thanks @nothingmuch and @caarloshenriq for the thorough review. |
1a85d7b to
c44157f
Compare
|
nothingmuch
left a comment
There was a problem hiding this comment.
Thanks @nothingmuch and @caarloshenriq for the thorough review. On the DoS concern, I'll cap the parsed Retry-After at the configured default so it can only shorten the TTL, never extend beyond what the config allows. That removes attacker control over arbitrary delays.
Yep, that's a reasonable compromise. Concept ACK. See below for rationale/context for my suggestion.
Please add Co-authored-by crediting @caarloshenriq for the tests.
As usual I defer to @spacebear21 on whether or not the maintenance burden is worth it. For now this code would mostly handle what non mailroom URLs respond with that (GET /.well-known/ohttp-gateway?allowed_purposes seems more likely to result in 404 or 200 than 429 or 503).
Either here or in a followup PR, the default capacity and default TTLs should be revisited. The capacity is very small (chosen when there was only payjo.in). The 4xx default ttl is long (perhaps too long), and 5xx default is too short.
See #941 and related issues for various discussions rate limiting, because when rate limiting is actually implemented appropriate values for all of these can be chosen without needing to add any configuration knobs. If no more than r requests per second can be made, then setting capacity = r * max_ttl would mean that it would never be a limiting constraint.
Note that a global rate limit doesn't fix the DoS concern either. What that can do is reduce the complexity of the attack surface, since with a global rate limit shared by all clients an attacker can just flood with requests and honest users will be rate limited too, but by ensuring the known gateways capacity is never the limiting constraint the attack has only one "mode" or "phase" for lack of a better term.
What we then would need to ensure is that honest users can get some fair share of that global rate limit.
Perhaps that can be done heuristically, for example using IPs and asmap to attempt to have separate rate limits for each, ensuring that an attacker can only rate limits user with which it shares an AS.
A more comprehensive solution would be to issue anonymous credentials for enforcing a rate limit for each user separately. To deter sybil attack on this mechanism, these should be issued in exchange for ownership proofs on sufficiently old, non-dust UTXO, hashcash, a small lightning payment, ... This is alluded to in BIP77 with respect to the directory but potentially applies to the relay as well, and has not been specified yet.
Anyway hopefully this clears up the long term plan for how to fully address the more fundamental DoS concern.
| .and_then(|v| v.to_str().ok()) | ||
| .and_then(|s| s.parse::<u64>().ok()) | ||
| .map(Duration::from_secs) | ||
| .map(|d| d.min(default)) |
There was a problem hiding this comment.
| .map(|d| d.min(default)) | |
| .map(|d| d.min(MAX_TTL)) |
see discussion in the main review comment. this could also perhaps be defined as constant factor multiplying default, it doesn't need to strictly reduce we just need to know that it's a reasonable bound.
for 503 we should allow more than 5 seconds (the default).
for 4xx we should probably lower the default of 1 week, i think i was overly concerned with avoiding dos of upstream servers and not enough with avoiding dos of the relay itself
MAX_TTL on the order of a day probably makes sense?
There was a problem hiding this comment.
Thanks for the detailed context and Concept ACK. I'll define a MAX_TTL constant (1 day) and cap against that instead of default. That handles the case where http_5xx default (5s) is too short for a legitimate 503 with a longer Retry-After.
Will also add the Co-authored-by credit for @caarloshenriq.
When a gateway returns 429 Too Many Requests or 503 Service Unavailable, the Retry-After response header specifies how long to wait before retrying. Previously these responses fell back to static TTL defaults, ignoring the server-specified duration. Add a retry_after_or helper that parses the Retry-After header value as seconds and uses it as the cache TTL, capped at MAX_RETRY_AFTER_TTL (24 hours) to prevent attacker-controlled delays. Falls back to the configured default when absent. Add tests covering 429 with header, 503 with header, and 429 without header fallback to default TTL. Closes payjoin#1474 Co-authored-by: caarloshenriq
c44157f to
b06390b
Compare
|
Updated: Replaced .min(default) with .min(MAX_RETRY_AFTER_TTL) where MAX_RETRY_AFTER_TTL = 24 hours — this allows legitimate longer 503 delays while bounding attacker-controlled values Thanks @nothingmuch for the guidance. |
Thanks for the co-author credit! Just a note for future reference, the full format GitHub recognizes is: Co-authored-by: NAME NAME@EXAMPLE.COM for more details, you can read this doc |
|
Thanks for flagging that @caarloshenriq! I'll update the co-author line to include your email in the correct format. Could you share the email address you'd like me to use so i can update accordingly? Thank you. |
Closes #1474
When a gateway returns 429 Too Many Requests or 503 Service Unavailable,
the
Retry-Afterresponse header specifies how long to wait beforeretrying. Previously these responses fell back to static TTL defaults
(
http_4xx= 1 week,http_5xx= 5 seconds), ignoring theserver-specified duration entirely.
This PR adds a
retry_after_orhelper that parses theRetry-Afterheader value as seconds and uses it as the cache TTL, falling back to
the configured default when the header is absent.
Pull Request Checklist
Disclosure: I consulted Claude to understand the codebase structure,
but the solution was authored by myself.