fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict#173
Conversation
…te conflict The RD object the auto-snapshot tick stamps (NextAutoId + last-at annotation) comes from Tick's List, and the Snapshot create right before the stamp wakes reconcilers that update the RD concurrently — so the stamp's bare Update lands on a stale resourceVersion and 409s. Tick swallows per-RD errors, so the bookkeeping was silently dropped: the next tick re-derived the SAME id every interval, and only createAutoSnapshot's AlreadyExists guard kept the loop idempotent. The race fires deterministically in the L-integration stack (TestGroupG/AutoSnapshotPeriodicTick red on current runners). Stamp now re-reads the RD fresh inside retry.RetryOnConflict, so a racing reconciler write no longer costs the tick its bookkeeping. Regression: TestAutoSnapshotStampSurvivesRDUpdateConflict (L1, fail-on-bug proven — interceptor 409s the first RD Update; pre-fix the stamp is dropped with the exact integration-test signature, post-fix the retry lands it). TestGroupG/AutoSnapshotPeriodicTick goes from red 3/3 to green 3/3 locally. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds retry-on-conflict handling to ChangesAutosnapshot RD Update Conflict Handling
Estimated code review effort: 3 (Moderate) | ~20 minutes CI-Scaled Test Timeouts
Sequence Diagram(s)sequenceDiagram
participant Tick as AutoSnapshotRunnable.Tick
participant Stamp as stampRDAfterCreate
participant API as Kubernetes API
Tick->>Stamp: invoke after snapshot creation
Stamp->>API: Get ResourceDefinition
API-->>Stamp: return RD
Stamp->>Stamp: set NextAutoId, last-at annotation
Stamp->>API: Update RD
API-->>Stamp: 409 Conflict
Stamp->>API: Get ResourceDefinition (retry)
API-->>Stamp: return fresh RD
Stamp->>Stamp: reapply NextAutoId, last-at
Stamp->>API: Update RD
API-->>Stamp: success
Related PRs: None mentioned in provided context. Suggested labels: bug, tests Suggested reviewers: None specified. 🐰 A conflict came, the RD retried, ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces retry logic on conflict when updating the ResourceDefinition in stampRDAfterCreate to handle concurrent updates gracefully, and adds a corresponding unit test to verify this behavior. The feedback suggests updating the passed-in rd pointer with the fresh object upon a successful update to prevent potential bugs from stale pointers in the caller.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Bare error: RetryOnConflict matches on the apierrors type. | ||
| return r.Client.Update(ctx, &fresh) |
There was a problem hiding this comment.
In the previous implementation, stampRDAfterCreate modified the passed-in rd object in-place. With the new retry logic, rd is left unmodified because the updates are applied to a local fresh variable. Although the current caller does not rely on the updated fields of rd after this call, leaving the passed-in pointer stale can easily lead to subtle bugs in the future if the caller or subsequent logic is extended to use rd.
To prevent this, update the dereferenced rd pointer with the fresh object upon a successful API update.
// Bare error: RetryOnConflict matches on the apierrors type.
updateErr := r.Client.Update(ctx, &fresh)
if updateErr == nil {
*rd = fresh
}
return updateErr…flake The per-group convergence constants (30s) were tuned on dev machines; GitHub-hosted runners under a full suite run are several times slower, so a random group blows its budget and the Integration lane rotate-flakes (GroupI's ResourceConnectionPathCreate one run, GroupJ's CSICreateVolumeFromEmpty the next — both pass locally 3/3). Eventually carries only positive-convergence asserts, so green runs pay nothing for the stretch; only genuinely failing runs report slower, still capped by the job-level -timeout. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
What
The auto-snapshot tick's RD bookkeeping stamp (
NextAutoId+last-atannotation) did a bareUpdateon the RD object taken from Tick'sList. The Snapshot create right before the stamp wakes reconcilers that update the RD concurrently, so the stamp lands on a staleresourceVersionand 409s.Tickswallows per-RD errors → the bookkeeping is silently dropped: the next tick re-derives the SAME id every interval, and onlycreateAutoSnapshot'sAlreadyExistsguard keeps the loop idempotent (wasting a tick per interval and spamming conflict errors on a busy cluster).The stamp now re-reads the RD fresh inside
retry.RetryOnConflict.Why now
TestGroupG/AutoSnapshotPeriodicTickfails deterministically (3/3) on current runners — it is what's red on PR #172's Integration lane, and it reproduces on main's store code unchanged, i.e. it's a latent main bug surfaced by runner timing, not a regression from any open PR.Testing
TestAutoSnapshotStampSurvivesRDUpdateConflict(L1, fail-on-bug proven): interceptor 409s the first RD Update; pre-fix the stamp is dropped with the exact integration-test signature (NextAutoId got "" want "2",last-atunset), post-fix the retry lands it.TestGroupG/AutoSnapshotPeriodicTick: red 3/3 → green 3/3 locally (envtest).internal/controllergreen, lint 0 issues.Note: not a CLI-verb change — L1 + the existing L-integration cell are the appropriate tiers.
Summary by CodeRabbit