feat(backend): S3 garbage collection#6543
Conversation
|
This PR may be related to: #3935 (Build a garbage collector for file sharing) |
|
I believe the changes here: #6323 will still be required to ensure submissions that were discarded on the website are properly deleted |
ff145ac to
1eb50e1
Compare
|
Claude finished @maverbiest's task in 2m 23s —— View job Code Review
Good implementation overall. The approach of scanning Bugs (inline comments posted):
Minor notes:
|
Looking for file references in the |
resolves #3935
(alternative to #6323)
This PR introduces a scheduled task in the backend that finds and deletes orphaned files on S3.
Orphan files are defined as files that have been uploaded by users or the prepocessing pipeline that are not referenced by any sequence submission in the database. Files that are referenced only by
processed_dataentries generated by pipeline version older than the current pipeline version are also considered orphans. Orphan files older thanloculus.s3.orphan-file-max-age-days(currently set to 7 inapplication.properties) are deleted.The main difference with #6323 is that this PR does not introduce a
submitted_atcolumn to thefilestable to identify orphan files. Instead, theunprocessed_dataandprocessed_datajsonb columns are scanned directly for file references. This avoids the bookkeeping that would be needed to ensure thesubmitted_atcolumn stays in sync with the actual state in the sequence_entries tables (especially regarding deletions, multiple sequence entries referencing the same S3 file etc.).Manual testing
Directly after I upload a file to S3, I can stat it in the preview deployment:
5 minutes later, after I see that the GC task has run, the file is no longer there:
Open issues/weirdness
I currently see these logs pop up in the backend saying >70 files are deleted by the GC even though I only upload one file (or even no files). What is being deleted from S3 then??
Separately, I'm also seeing S3 errors in the backend logs, not sure why/if it's related:
UPDATE
Both of the above may be caused by GC running while a batch of sequences is in the middle of being preprocessed: detecting EMBL flatfiles produced for CCHF by the prepro pipeline as orphans because I put the orphan threshold at 0 for testing:
NoSuchKeyerrorThis shouldn't be an issue with a more realistic orphan threshold, but it's worth keeping in mind. We could consider hardcoding a minimum value for the orphan threshold, and having the GC task use
max(configured_value, minimum_value)PR Checklist
🚀 Preview: https://s3-garbage-collection.loculus.org