Skip to content

fix(cleanup): retry transient S3 errors during batch deletes#7

Open
jan-exa wants to merge 1 commit into
mainfrom
add-delete-batch-retry
Open

fix(cleanup): retry transient S3 errors during batch deletes#7
jan-exa wants to merge 1 commit into
mainfrom
add-delete-batch-retry

Conversation

@jan-exa
Copy link
Copy Markdown

@jan-exa jan-exa commented May 28, 2026

Summary

When deleting millions of files during cleanup (e.g. 35.9M deletion vectors in atlas_v3.lance), S3 occasionally returns transient InternalError responses. Previously, a single transient error immediately aborted the entire cleanup run, wasting the ~5 hours spent re-reading manifests.

Added delete_batch_with_retry() which wraps each batch deletion with up to 3 retries and exponential backoff (5s, 10s, 20s). Only transient S3 errors are retried:

  • InternalError — S3 transient internal failure
  • 503 / SlowDown — S3 throttling
  • ServiceUnavailable — S3 temporarily unavailable
  • RequestTimeout — S3 request timeout

Permanent errors (AccessDenied, NoSuchBucket, etc.) propagate immediately.

Review & Testing Checklist for Human

  • Verify the transient error string matching covers real S3 error messages (check against actual errors in Databricks stderr logs)
  • Verify the retry backoff timing (5s, 10s, 20s) is appropriate — not too aggressive to worsen throttling, not too slow to waste time

Notes

In our atlas_v3.lance cleanup, attempt 0 deleted 3.14M files before S3 InternalError (6.1h wasted), and attempt 1 deleted 775K files before the same error (5.4h wasted). With this fix, those transient errors would be retried 3 times with backoff instead of aborting the entire run.

Link to Devin session: https://app.devin.ai/sessions/bb810ab5769542e0a12b08c2505d2ae1
Requested by: @jan-exa

When deleting millions of files, S3 occasionally returns InternalError
or 503 responses. Previously, a single transient error aborted the
entire cleanup run, wasting hours of manifest reading.

Now delete_batch_with_retry wraps each batch deletion with up to 3
retries and exponential backoff (5s, 10s, 20s). Only transient errors
(InternalError, 503, SlowDown, ServiceUnavailable, RequestTimeout) are
retried; permanent errors propagate immediately.

Co-Authored-By: Jan van der Vegt <jan@exa.ai>
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions github-actions Bot added the bug Something isn't working label May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant