Skip to content

fix(BA-6635): retry image rescan result commits#12449

Closed
seedspirit wants to merge 4 commits into
mainfrom
fix/BA-6635
Closed

fix(BA-6635): retry image rescan result commits#12449
seedspirit wants to merge 4 commits into
mainfrom
fix/BA-6635

Conversation

@seedspirit

Copy link
Copy Markdown
Contributor

Summary

  • Retry the container registry image rescan commit phase with the repository resilience retry decorator.
  • Preserve pending image updates across retry attempts and emit progress only after a successful DB commit.
  • Add a unit test covering serialization failure retry during rescan result commit.

Test plan

  • pants fmt ::
  • pants fix ::
  • pants lint --changed-since=origin/main
  • pants fmt --changed-since=HEAD~1
  • pants fix --changed-since=HEAD~1
  • pants lint --changed-since=HEAD~1
  • pants check --changed-since=HEAD~1
  • pants test --changed-since=HEAD~1

Resolves BA-6635

@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component labels Jun 30, 2026
@seedspirit seedspirit marked this pull request as ready for review June 30, 2026 07:39
@seedspirit seedspirit requested a review from a team as a code owner June 30, 2026 07:39
Copilot AI review requested due to automatic review settings June 30, 2026 07:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the container registry image rescan commit phase resilient to transient PostgreSQL serialization conflicts (SQLSTATE 40001). It extracts the DB-mutating portion of BaseContainerRegistry.commit_rescan_result() into a new _commit_rescan_result_once() helper wrapped with a Resilience retry policy, so a failed commit is retried rather than aborting the whole rescan. Crucially, the helper now copies the pending updates per attempt (pending_updates = dict(original_updates)) instead of mutating the shared all_updates ContextVar dict, so retries see the full update set again. Progress messages are collected during the transaction and only emitted (logs + ProgressReporter) after a successful commit.

Changes:

  • Added a module-level commit_rescan_result_resilience (RetryPolicy, max_retries=10, fixed 0.1s delay, BackendAIError non-retryable), matching the repository-layer retry shape.
  • Refactored commit_rescan_result() into an outer method that emits progress post-commit plus a retried _commit_rescan_result_once() that performs all DB work on a per-attempt copy of updates.
  • Added a unit test that simulates a DBAPIError serialization failure on first flush and verifies the retry succeeds without losing pending updates.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/ai/backend/manager/container_registry/base.py Adds retry resilience policy and splits commit logic into a retried helper that copies updates per attempt and defers progress emission until after a successful commit.
tests/unit/manager/test_container_registry_base.py New unit test exercising serialization-failure retry of commit_rescan_result() via a fake DB/session, asserting two attempts and preserved update state.
changes/12449.fix.md Towncrier changelog entry describing the retry-on-serialization-conflict fix.

I verified the key correctness aspects: the public commit_rescan_result() signature is unchanged so both call sites (rescan_single_registry, scan_single_ref) remain valid; the shared all_updates dict is no longer mutated (only the per-attempt shallow copy is, and inner update dicts are never mutated, so the copy is safe); the retried scope includes both session.flush() and the implicit commit on begin_session exit, so commit-time serialization failures are also retried; progress events are rebuilt each attempt and emitted exactly once on success, preserving the original ordering and reporter-increment behavior; and _TestRegistry implements the only @abstractmethod (fetch_repositories). I found no concrete defects to flag.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@seedspirit seedspirit requested a review from HyeockJinKim June 30, 2026 08:46
@seedspirit seedspirit added this to the 26.4 milestone Jun 30, 2026
Comment thread src/ai/backend/manager/container_registry/base.py Outdated
image_identifiers: list[tuple[str, str]],
) -> tuple[list[ImageData], list[tuple[str, str]]]:
scanned_images: list[ImageData] = []
progress_events: list[tuple[str, str]] = []

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregating the logs doesn't seem like a good approach.

Wouldn't it be better to handle logging and reporter.update separately?

Even if we decide to aggregate logs, collecting raw strings such as "warning" and "info" and then dispatching them through a switch statement seems quite fragile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since retry logic has been implemented, it seems more appropriate to aggregate the logs rather than log them each time, and display them all at once when the operation succeeds. I’ve changed "warning" and "info" to enums for now.

seedspirit and others added 2 commits July 1, 2026 12:07
Replace the (level, message) tuple buffer with a dedicated
`_RescanProgressEvent` dataclass and reuse the shared `LogLevel` enum
instead of raw "info"/"warning" strings, per review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@seedspirit

seedspirit commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Resolved by #12464

@seedspirit seedspirit closed this Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants