✨(storage) implement tiered storage by sylvinus · Pull Request #486 · suitenumerique/messages

sylvinus · 2026-01-16T11:09:29Z

This allows to use S3-compatible object storage to offload blobs, making Postgres much lighter. We design for storing ~1B emails on a single instance.

Fixes #185.

Summary by CodeRabbit

Release Notes

New Features
- Blob offloading to S3-compatible object storage with automatic hourly scheduling based on age and size thresholds
- Optional AES-256-GCM encryption for blob data at rest with configurable key rotation
- Automated garbage collection system for unreferenced blobs
- Enhanced admin interface displaying blob storage location and encryption key assignment
Documentation
- New tiered storage architecture guide and deployment configuration instructions

coderabbitai · 2026-01-16T11:09:42Z

📝 Walkthrough

Walkthrough

Adds S3-backed tiered blob storage with AES-GCM encryption, schema/migration, GC with reservations, offload/verify/restore tooling, API/MDA refactors, admin updates, Redis-only coalescer, importer Range reads, settings/env/docs/CI updates, Celery beat, and extensive tests.

Changes

Tiered Blob Storage, Encryption, GC, and Integrations

Layer / File(s)	Summary
Data contracts and schema `src/backend/core/enums.py`, `src/backend/core/migrations/...`, `src/backend/core/models.py`	Adds compression parser and storage enum; migrates Blob for `storage_location`/`encryption_key_id` and nullable `raw_content`; introduces `MailboxBlob`; sets PROTECT FKs and partial index.
Core storage/encryption and GC services `src/backend/core/services/tiered_storage.py`, `src/backend/core/services/blob_gc.py`	Implements AEAD encrypt/decrypt, upload/download, rotate/restore, orphan cleanup; Redis candidate set, GC draining, and upload reservation helpers.
Signals `src/backend/core/signals.py`	Post-delete handlers schedule related blobs for GC; removes pre-delete eager deletion.
Model logic `src/backend/core/models.py`, `src/backend/core/utils.py`	`create_blob` now dedups, compresses, encrypts; adds `is_referenced`/`user_can_access`; updates `get_content`; `JSONValue.to_python` hardens error handling.
API, serializers, metrics, MDA `src/backend/core/api/...`, `src/backend/core/mda/...`, `src/backend/core/factories.py`	Upload size-cap and reservation flow; GC scheduling on template updates; metrics draft path; MDA compose-and-sign, atomic Blob+Message, attachment provenance.
Tasks and commands `src/backend/core/services/tiered_storage_tasks.py`, `.../management/commands/*`	Hourly offload loop and per-blob worker; `re_store_blobs` rotation/restore; `verify_blobs` audit; `delete_orphan_attachments`.
Search/importer `src/backend/core/services/search/*`, `src/backend/core/services/importer/eml_tasks.py`	Coalescer now Redis-only; search flag check tightened; importer uses bounded S3 Range read.
Admin/UI `src/backend/core/admin.py`, `src/backend/core/templates/.../blob/change_form.html`	Admin restricted to superusers; Blob listing adds storage/encryption; adds download action.
Settings/docs/CI `src/backend/messages/settings.py`, `env.d/...`, `docs/*`, `.github/workflows/messages.yml`, `Makefile`, `src/backend/messages/celery_app.py`, `src/backend/core/apps.py`, `src/backend/core/checks.py`	Adds message-blobs storage and blob settings; system checks; tiered-storage docs; env defaults; CI bucket creation; beat schedule; app imports checks.
Tests/fixtures/markers `src/backend/core/tests/**`, `src/backend/pyproject.toml`	Adds Redis/bucket fixtures and marker; comprehensive tests for storage, GC, tasks, commands, API/MDA, metrics, search, and signals; adjusts factories usage.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API
  participant DB
  participant GC as GC/Reservations
  participant Tiered as TieredStorageService
  participant S3 as Object Storage

  Client->>API: POST /blob/upload
  API->>GC: upload_and_reserve_blob(mailbox, content)
  GC->>DB: Create Blob + MailboxBlob (atomic)
  API-->>Client: blobId,size,sha256

  Note over API,Tiered: Later (hourly offload)
  API->>Tiered: offload_one_blob(id)
  Tiered->>S3: PUT blobs/{key_id}/{shard}/{sha}
  Tiered->>DB: Update location/encryption, clear raw_content

  Client->>API: GET /blob/{id}
  API->>DB: user_can_access(user, blob.id)
  alt Offloaded
    API->>Tiered: download_blob(id)
    Tiered->>S3: GET object
    Tiered-->>API: decrypted bytes
  else In Postgres
    API->>DB: read raw_content and decrypt
  end
  API-->>Client: content

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

suitenumerique/messages#485: Overlaps draft attachment forwarding/provenance logic in core/mda/draft.py.
suitenumerique/messages#507: Touches MDA inbound/outbound and models near these refactors.
suitenumerique/messages#556: Aligns Makefile/workflow/object-storage setup with similar bucket provisioning changes.

Suggested reviewers

jbpenrath
sdemagny

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/backend/core/management/commands/verify_tiered_storage.py`:
- Around line 466-495: The current flow writes the newly encrypted object via
self.service.storage.save(storage_key, ...) before updating
blob.encryption_key_id inside transaction.atomic(), risking storage/DB
inconsistency if the DB update fails; instead, write the new encrypted bytes to
a temporary object (e.g. derive a temp key from storage_key and new_key_id)
using self.service.storage.save(temp_key, ContentFile(encrypted)), then perform
the DB update inside transaction.atomic() (update blob.encryption_key_id and
save), and only after the transaction succeeds atomically remove/rename the temp
object to the final storage_key (or copy temp→final and delete temp) so storage
and DB remain consistent; reference symbols: self.service.storage.save,
storage_key, temp_key (create), self.service.encrypt, blob.encryption_key_id,
transaction.atomic.

In `@src/backend/core/services/tiered_storage.py`:
- Around line 31-43: In __init__, the enabled gate currently checks for an
OPTIONS.endpoint_url which wrongly disables valid S3 setups; instead set
self.enabled based on presence of the "message-blobs" storage config itself
(e.g. check that settings.STORAGES contains a non-empty "message-blobs" entry).
Update the assignment to self.enabled to use
settings.STORAGES.get("message-blobs") (or "message-blobs" in settings.STORAGES
and truthy) rather than digging for OPTIONS.endpoint_url so AWS S3 configs
without endpoint_url remain enabled.

🧹 Nitpick comments (3)

src/backend/core/services/tiered_storage_tasks.py (1)
68-133: Consider adding retry for transient failures.

The task handles lock contention gracefully by returning "locked" status, but transient failures (network issues, temporary S3 unavailability) at line 131 are logged and returned as errors without retry. The periodic offload_blobs_task will eventually re-queue these blobs, but adding explicit retry behavior for transient exceptions (e.g., ConnectionError, Timeout) could improve reliability.
💡 Optional: Add retry for transient failures
-@celery_app.task(bind=True)
+@celery_app.task(bind=True, autoretry_for=(ConnectionError, TimeoutError), retry_backoff=True, max_retries=3)
 def offload_single_blob_task(self, blob_id: str) -> Dict[str, Any]:
src/backend/core/models.py (1)
1536-1557: Enforce storage_location/raw_content invariants at the DB layer.
With raw_content now nullable, inconsistent states (e.g., OBJECT_STORAGE + non-null content)
become possible and will surface as runtime errors in get_content. A check constraint makes
the invariant explicit and avoids silent drift. This will require a migration.
♻️ Proposed constraint
         constraints = [
             models.CheckConstraint(
                 check=(
                     models.Q(mailbox__isnull=False) | models.Q(maildomain__isnull=False)
                 ),
                 name="blob_has_owner",
             ),
+            models.CheckConstraint(
+                check=(
+                    models.Q(
+                        storage_location=BlobStorageLocationChoices.POSTGRES,
+                        raw_content__isnull=False,
+                    )
+                    | models.Q(
+                        storage_location=BlobStorageLocationChoices.OBJECT_STORAGE,
+                        raw_content__isnull=True,
+                    )
+                ),
+                name="blob_storage_location_matches_content",
+            ),
         ]
As per coding guidelines, enforce data integrity with model constraints.

Also applies to: 1583-1589
src/backend/core/services/tiered_storage.py (1)
244-281: Guard against orphan-delete races and capture delete errors.
There’s a TOCTOU window between the reference count (Line 259-263) and deletion (Line 274-275);
a concurrent offload could add a reference after the count and still have its object deleted.
Consider an advisory lock keyed by SHA256 or a transactional guard around the check+delete.

Also, capture the storage deletion exception to Sentry so cleanup failures are observable.
♻️ Suggested Sentry capture
 from cryptography.fernet import Fernet
+from sentry_sdk import capture_exception
@@
-        except Exception as e:  # pylint: disable=broad-except
-            logger.warning("Failed to delete blob from storage %s: %s", key, e)
+        except Exception as exc:  # pylint: disable=broad-except
+            capture_exception(exc)
+            logger.warning("Failed to delete blob from storage %s: %s", key, exc)
             return False
As per coding guidelines, capture and report exceptions to Sentry.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec4c7ef and 8858822.

📒 Files selected for processing (22)

compose.yaml
env.d/development/backend.defaults
src/backend/core/api/viewsets/config.py
src/backend/core/enums.py
src/backend/core/management/commands/verify_tiered_storage.py
src/backend/core/migrations/0014_blob_encryption_key_id_blob_storage_location_and_more.py
src/backend/core/models.py
src/backend/core/services/search/search.py
src/backend/core/services/tiered_storage.py
src/backend/core/services/tiered_storage_tasks.py
src/backend/core/signals.py
src/backend/core/tests/commands/__init__.py
src/backend/core/tests/commands/test_verify_tiered_storage.py
src/backend/core/tests/conftest.py
src/backend/core/tests/services/__init__.py
src/backend/core/tests/services/test_tiered_storage.py
src/backend/core/tests/tasks/__init__.py
src/backend/core/tests/tasks/test_task_send_message.py
src/backend/core/tests/tasks/test_tiered_storage_tasks.py
src/backend/core/utils.py
src/backend/messages/celery_app.py
src/backend/messages/settings.py

🧰 Additional context used

📓 Path-based instructions (6)

src/backend/**/*.py