fix(BA-6635): retry image rescan result commits#12449
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes the container registry image rescan commit phase resilient to transient PostgreSQL serialization conflicts (SQLSTATE 40001). It extracts the DB-mutating portion of BaseContainerRegistry.commit_rescan_result() into a new _commit_rescan_result_once() helper wrapped with a Resilience retry policy, so a failed commit is retried rather than aborting the whole rescan. Crucially, the helper now copies the pending updates per attempt (pending_updates = dict(original_updates)) instead of mutating the shared all_updates ContextVar dict, so retries see the full update set again. Progress messages are collected during the transaction and only emitted (logs + ProgressReporter) after a successful commit.
Changes:
- Added a module-level
commit_rescan_result_resilience(RetryPolicy,max_retries=10, fixed 0.1s delay,BackendAIErrornon-retryable), matching the repository-layer retry shape. - Refactored
commit_rescan_result()into an outer method that emits progress post-commit plus a retried_commit_rescan_result_once()that performs all DB work on a per-attempt copy of updates. - Added a unit test that simulates a
DBAPIErrorserialization failure on first flush and verifies the retry succeeds without losing pending updates.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/ai/backend/manager/container_registry/base.py |
Adds retry resilience policy and splits commit logic into a retried helper that copies updates per attempt and defers progress emission until after a successful commit. |
tests/unit/manager/test_container_registry_base.py |
New unit test exercising serialization-failure retry of commit_rescan_result() via a fake DB/session, asserting two attempts and preserved update state. |
changes/12449.fix.md |
Towncrier changelog entry describing the retry-on-serialization-conflict fix. |
I verified the key correctness aspects: the public commit_rescan_result() signature is unchanged so both call sites (rescan_single_registry, scan_single_ref) remain valid; the shared all_updates dict is no longer mutated (only the per-attempt shallow copy is, and inner update dicts are never mutated, so the copy is safe); the retried scope includes both session.flush() and the implicit commit on begin_session exit, so commit-time serialization failures are also retried; progress events are rebuilt each attempt and emitted exactly once on success, preserving the original ordering and reporter-increment behavior; and _TestRegistry implements the only @abstractmethod (fetch_repositories). I found no concrete defects to flag.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| image_identifiers: list[tuple[str, str]], | ||
| ) -> tuple[list[ImageData], list[tuple[str, str]]]: | ||
| scanned_images: list[ImageData] = [] | ||
| progress_events: list[tuple[str, str]] = [] |
There was a problem hiding this comment.
Aggregating the logs doesn't seem like a good approach.
Wouldn't it be better to handle logging and reporter.update separately?
Even if we decide to aggregate logs, collecting raw strings such as "warning" and "info" and then dispatching them through a switch statement seems quite fragile.
There was a problem hiding this comment.
Since retry logic has been implemented, it seems more appropriate to aggregate the logs rather than log them each time, and display them all at once when the operation succeeds. I’ve changed "warning" and "info" to enums for now.
Replace the (level, message) tuple buffer with a dedicated `_RescanProgressEvent` dataclass and reuse the shared `LogLevel` enum instead of raw "info"/"warning" strings, per review feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Resolved by #12464 |
Summary
Test plan
Resolves BA-6635