Skip to content

feat(backend): S3 garbage collection#6543

Open
maverbiest wants to merge 25 commits into
mainfrom
s3-garbage-collection
Open

feat(backend): S3 garbage collection#6543
maverbiest wants to merge 25 commits into
mainfrom
s3-garbage-collection

Conversation

@maverbiest

@maverbiest maverbiest commented May 31, 2026

Copy link
Copy Markdown
Contributor

resolves #3935

(alternative to #6323)

This PR introduces a scheduled task in the backend that finds and deletes orphaned files on S3.

Orphan files are defined as files that have been uploaded by users or the prepocessing pipeline that are not referenced by any sequence submission in the database. Files that are referenced only by processed_data entries generated by pipeline version older than the current pipeline version are also considered orphans. Orphan files older than loculus.s3.orphan-file-max-age-days (currently set to 7 in application.properties) are deleted.

The main difference with #6323 is that this PR does not introduce a submitted_at column to the files table to identify orphan files. Instead, the unprocessed_data and processed_data jsonb columns are scanned directly for file references. This avoids the bookkeeping that would be needed to ensure the submitted_at column stays in sync with the actual state in the sequence_entries tables (especially regarding deletions, multiple sequence entries referencing the same S3 file etc.).

Manual testing

Directly after I upload a file to S3, I can stat it in the preview deployment:

➜  loculus git:(s3-garbage-collection) kubectl exec -n "$NS" "$POD" -- mc stat "local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9"
Name      : c49bd8da-6661-4063-8718-2d6703a6c8f9
Date      : 2026-06-09 12:17:01 UTC
Size      : 20 B
ETag      : 415ab09be58b32f2bd922650fa08e3d4
Type      : file
Metadata  :
  Content-Type: binary/octet-stream

5 minutes later, after I see that the GC task has run, the file is no longer there:

➜  loculus git:(s3-garbage-collection) kubectl exec -n "$NS" "$POD" -- mc stat "local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9"
mc: <ERROR> Unable to stat `local/loculus-preview-private/files/c49bd8da-6661-4063-8718-2d6703a6c8f9`. Object does not exist.
command terminated with exit code 1

Open issues/weirdness

I currently see these logs pop up in the backend saying >70 files are deleted by the GC even though I only upload one file (or even no files). What is being deleted from S3 then??

2026-06-09 12:20:39,598 INFO [scheduling-1] [] [] - org.loculus.backend.service.submission.S3GarbageCollectionTask: Deleted 80 orphans that were not referenced by a submission after 0 days

Separately, I'm also seeing S3 errors in the backend logs, not sure why/if it's related:

2026-06-09 12:24:33,991 ERROR [http-nio-8079-exec-3] [0d016c50-67d4-4f94-8b79-3c8c8666859e] [cchf] - org.loculus.backend.controller.ExceptionHandler: Caught unexpected exception: Unexpected S3 error: NoSuchKey
java.lang.RuntimeException: Unexpected S3 error: NoSuchKey

UPDATE

Both of the above may be caused by GC running while a batch of sequences is in the middle of being preprocessed: detecting EMBL flatfiles produced for CCHF by the prepro pipeline as orphans because I put the orphan threshold at 0 for testing:

  • CCHF has create flatfiles: true
  • prepro creates and uploads a flatfiles for a batch to S3
  • GC runs; identifies the flatfiles as orphans (they are not yet referenced by entries in processed_data), deletes them -> logs that it deleted a bunch of files
  • prepro attempts to set the files to public that no longer exist -> logs the NoSuchKey error

This shouldn't be an issue with a more realistic orphan threshold, but it's worth keeping in mind. We could consider hardcoding a minimum value for the orphan threshold, and having the GC task use max(configured_value, minimum_value)

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://s3-garbage-collection.loculus.org

@claude claude Bot added the backend related to the loculus backend component label May 31, 2026
@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

This PR may be related to: #3935 (Build a garbage collector for file sharing)

@anna-parker

Copy link
Copy Markdown
Contributor

I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted

@maverbiest maverbiest force-pushed the s3-garbage-collection branch from ff145ac to 1eb50e1 Compare June 5, 2026 13:47
@maverbiest maverbiest marked this pull request as ready for review June 8, 2026 13:18
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @maverbiest's task in 2m 23s —— View job


Code Review

  • Read and analyze all changed files
  • Review code quality, security, performance, testing, and documentation
  • Post findings

Good implementation overall. The approach of scanning unprocessed_data/processed_data directly to find orphans rather than adding a submitted_at bookkeeping column is sound. The SQL query is well-constructed and the test coverage for the orphan-detection logic is thorough. Two bugs to fix before merge:

Bugs (inline comments posted):

  1. Missing space in audit log string (S3GarbageCollectionTask.kt:58–59): "referenced by a" + "submission" concatenates to "referenced by asubmission".
  2. Incorrect deletion count in log.info (S3GarbageCollectionTask.kt:53): Reports orphans.size deleted even when some failed — should be orphans.size - deleteFailures.

Minor notes:

  • The error-handling catch block (lines 45–48) has a subtle but benign retry behaviour: if S3 deletion succeeds but deleteFileEntry throws, the DB entry survives and the next GC run will call deleteObject again — AWS S3's deleteObject is idempotent, so this self-heals. Worth a comment.
  • No index exists on files.upload_requested_at. For large files tables the WHERE upload_requested_at < ? filter will full-scan. Probably fine for a daily task at current scale, but worth a migration if the table grows.
  • The failure path (deleteFailures > 0) has no test coverage. A test that makes s3Service.deleteFile throw for one file and verifies the other orphan is still cleaned up would improve confidence in the error path.

@maverbiest

maverbiest commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted

Looking for file references in the unprocessed_data and processed_data jsonb columns should take care of this, I think! (sorry @anna-parker not sure what state this PR was in when you commented)

@maverbiest maverbiest requested review from anna-parker and tombch June 8, 2026 13:56
@maverbiest maverbiest added the preview Triggers a deployment to argocd label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component preview Triggers a deployment to argocd update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build a garbage collector for file sharing

3 participants