Skip to content

feat(upload): Retry upstream requests#5975

Open
jjbayer wants to merge 6 commits into
masterfrom
upload-retries
Open

feat(upload): Retry upstream requests#5975
jjbayer wants to merge 6 commits into
masterfrom
upload-retries

Conversation

@jjbayer
Copy link
Copy Markdown
Member

@jjbayer jjbayer commented May 11, 2026

Retry requests from the Upload service to the upstream relay, for example on 503s. Streaming PATCH requests can only be retried if no part of the body has been sent.

ref: INGEST-910

jjbayer and others added 3 commits May 11, 2026 14:40
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch `UploadRequest` to intercept upstream status errors so 502-504
responses (and other transport failures) trigger the retry path. The
PATCH body is now wrapped in `TakeOnce`/`RetryableStream`, allowing a
retry as long as the stream has not been polled. Once the body has been
streamed, `retry()` returns false so we surface the original error
instead of attempting a replay we cannot fulfill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 11, 2026

INGEST-910

Upload {
location: SignedLocation,
stream: Option<BoundedStream<MeteredStream<ByteStream>>>,
stream: TakeOnce<BoundedStream<MeteredStream<ByteStream>>>,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is already in use by the objectstore service to enable retries.


fn intercept_status_errors(&self) -> bool {
false // same as ForwardRequest
true
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary to trigger retries on 503.

{
let Some(body) = stream.take() else {
relay_log::error!("upload request was retried or never initialized");
let Some(body) = RetryableStream::new(stream.clone()) else {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is already in use by the objectstore service to enable retries.

)


def test_post_retries(mini_sentry, relay, project_config):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially also had a test_patch_retries, but retrying PATCH requests only works on network errors, which are harder to simulate.

@jjbayer jjbayer marked this pull request as ready for review May 11, 2026 14:44
@jjbayer jjbayer requested a review from a team as a code owner May 11, 2026 14:44
Comment thread tests/integration/test_upload.py Outdated
Comment on lines 582 to 584
fn intercept_status_errors(&self) -> bool {
false // same as ForwardRequest
true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Changing intercept_status_errors to true causes upstream 4xx/5xx errors to be incorrectly surfaced as 500 Internal Server Error, masking the original error from the client.
Severity: HIGH

Suggested Fix

Update the IntoResponse implementation for UpstreamRequestError to handle the ResponseError and RateLimited variants specifically. Extract the original status code from the wrapped response and use it to generate the final response, ensuring that client-side errors are propagated correctly instead of being converted to a generic 500 error.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: relay-server/src/services/upload.rs#L582-L584

Potential issue: Enabling `intercept_status_errors` in `UploadRequest` causes the
`transform_response` method to convert all non-2xx upstream responses into
`UpstreamRequestError` variants. When this error propagates to the upload endpoint, the
`IntoResponse` implementation for `UpstreamRequestError` lacks specific handlers for
`ResponseError` or `RateLimited`. Consequently, it falls back to a generic case,
returning a `500 Internal Server Error`. This masks the original client or upstream
error (e.g., 400 Bad Request, 413 Payload Too Large, 429 Too Many Requests), preventing
proper error handling and reporting to the end user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant