feat(upload): Retry upstream requests#5975
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch `UploadRequest` to intercept upstream status errors so 502-504 responses (and other transport failures) trigger the retry path. The PATCH body is now wrapped in `TakeOnce`/`RetryableStream`, allowing a retry as long as the stream has not been polled. Once the body has been streamed, `retry()` returns false so we surface the original error instead of attempting a replay we cannot fulfill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Upload { | ||
| location: SignedLocation, | ||
| stream: Option<BoundedStream<MeteredStream<ByteStream>>>, | ||
| stream: TakeOnce<BoundedStream<MeteredStream<ByteStream>>>, |
There was a problem hiding this comment.
This type is already in use by the objectstore service to enable retries.
|
|
||
| fn intercept_status_errors(&self) -> bool { | ||
| false // same as ForwardRequest | ||
| true |
There was a problem hiding this comment.
Necessary to trigger retries on 503.
| { | ||
| let Some(body) = stream.take() else { | ||
| relay_log::error!("upload request was retried or never initialized"); | ||
| let Some(body) = RetryableStream::new(stream.clone()) else { |
There was a problem hiding this comment.
This type is already in use by the objectstore service to enable retries.
| ) | ||
|
|
||
|
|
||
| def test_post_retries(mini_sentry, relay, project_config): |
There was a problem hiding this comment.
I initially also had a test_patch_retries, but retrying PATCH requests only works on network errors, which are harder to simulate.
| fn intercept_status_errors(&self) -> bool { | ||
| false // same as ForwardRequest | ||
| true | ||
| } |
There was a problem hiding this comment.
Bug: Changing intercept_status_errors to true causes upstream 4xx/5xx errors to be incorrectly surfaced as 500 Internal Server Error, masking the original error from the client.
Severity: HIGH
Suggested Fix
Update the IntoResponse implementation for UpstreamRequestError to handle the ResponseError and RateLimited variants specifically. Extract the original status code from the wrapped response and use it to generate the final response, ensuring that client-side errors are propagated correctly instead of being converted to a generic 500 error.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: relay-server/src/services/upload.rs#L582-L584
Potential issue: Enabling `intercept_status_errors` in `UploadRequest` causes the
`transform_response` method to convert all non-2xx upstream responses into
`UpstreamRequestError` variants. When this error propagates to the upload endpoint, the
`IntoResponse` implementation for `UpstreamRequestError` lacks specific handlers for
`ResponseError` or `RateLimited`. Consequently, it falls back to a generic case,
returning a `500 Internal Server Error`. This masks the original client or upstream
error (e.g., 400 Bad Request, 413 Payload Too Large, 429 Too Many Requests), preventing
proper error handling and reporting to the end user.
Retry requests from the
Uploadservice to the upstream relay, for example on 503s. Streaming PATCH requests can only be retried if no part of the body has been sent.ref: INGEST-910