Skip to content

fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict#173

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/autosnapshot-stamp-conflict
Jul 1, 2026
Merged

fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict#173
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/autosnapshot-stamp-conflict

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jul 1, 2026

Copy link
Copy Markdown
Member

What

The auto-snapshot tick's RD bookkeeping stamp (NextAutoId + last-at annotation) did a bare Update on the RD object taken from Tick's List. The Snapshot create right before the stamp wakes reconcilers that update the RD concurrently, so the stamp lands on a stale resourceVersion and 409s. Tick swallows per-RD errors → the bookkeeping is silently dropped: the next tick re-derives the SAME id every interval, and only createAutoSnapshot's AlreadyExists guard keeps the loop idempotent (wasting a tick per interval and spamming conflict errors on a busy cluster).

The stamp now re-reads the RD fresh inside retry.RetryOnConflict.

Why now

TestGroupG/AutoSnapshotPeriodicTick fails deterministically (3/3) on current runners — it is what's red on PR #172's Integration lane, and it reproduces on main's store code unchanged, i.e. it's a latent main bug surfaced by runner timing, not a regression from any open PR.

Testing

  • TestAutoSnapshotStampSurvivesRDUpdateConflict (L1, fail-on-bug proven): interceptor 409s the first RD Update; pre-fix the stamp is dropped with the exact integration-test signature (NextAutoId got "" want "2", last-at unset), post-fix the retry lands it.
  • TestGroupG/AutoSnapshotPeriodicTick: red 3/3 → green 3/3 locally (envtest).
  • Full internal/controller green, lint 0 issues.

Note: not a CLI-verb change — L1 + the existing L-integration cell are the appropriate tiers.

Summary by CodeRabbit

  • Bug Fixes
    • Improved auto-snapshot reliability so snapshot tracking updates are less likely to fail when resource definitions change at the same time.
    • Reduced flaky integration test behavior in CI by allowing longer wait times during slow runs.

…te conflict

The RD object the auto-snapshot tick stamps (NextAutoId + last-at
annotation) comes from Tick's List, and the Snapshot create right
before the stamp wakes reconcilers that update the RD concurrently —
so the stamp's bare Update lands on a stale resourceVersion and 409s.
Tick swallows per-RD errors, so the bookkeeping was silently dropped:
the next tick re-derived the SAME id every interval, and only
createAutoSnapshot's AlreadyExists guard kept the loop idempotent.
The race fires deterministically in the L-integration stack
(TestGroupG/AutoSnapshotPeriodicTick red on current runners).

Stamp now re-reads the RD fresh inside retry.RetryOnConflict, so a
racing reconciler write no longer costs the tick its bookkeeping.

Regression: TestAutoSnapshotStampSurvivesRDUpdateConflict (L1,
fail-on-bug proven — interceptor 409s the first RD Update; pre-fix the
stamp is dropped with the exact integration-test signature, post-fix
the retry lands it). TestGroupG/AutoSnapshotPeriodicTick goes from
red 3/3 to green 3/3 locally.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89e6288d-63a9-4e4c-979a-adb678b30d93

📥 Commits

Reviewing files that changed from the base of the PR and between eb0b779 and 0ead801.

📒 Files selected for processing (4)
  • internal/controller/autosnapshot_controller.go
  • internal/controller/autosnapshot_stamp_conflict_test.go
  • tests/integration/harness/asserts.go
  • tests/integration/harness/asserts_ci_scale_test.go

📝 Walkthrough

Walkthrough

This PR adds retry-on-conflict handling to stampRDAfterCreate in the autosnapshot controller, re-fetching the ResourceDefinition on update conflicts before reapplying bookkeeping fields, with a corresponding conflict-simulation test. Separately, it introduces a scaledTimeout helper that stretches integration test Eventually timeouts by 3x under CI, with a matching test.

Changes

Autosnapshot RD Update Conflict Handling

Layer / File(s) Summary
Retry-on-conflict implementation
internal/controller/autosnapshot_controller.go
stampRDAfterCreate now uses retry.RetryOnConflict to re-GET the ResourceDefinition on each attempt, reapply NextAutoId and the last-at annotation on the fresh object, and update it, instead of updating the stale in-memory object once.
Conflict simulation test
internal/controller/autosnapshot_stamp_conflict_test.go
Adds TestAutoSnapshotStampSurvivesRDUpdateConflict, which forces one 409 Conflict via a fake client interceptor during Tick and asserts bookkeeping fields are correctly set afterward.

Estimated code review effort: 3 (Moderate) | ~20 minutes

CI-Scaled Test Timeouts

Layer / File(s) Summary
scaledTimeout helper and Eventually integration
tests/integration/harness/asserts.go
Adds scaledTimeout, which multiplies a given timeout by 3 when the CI environment variable is set; Eventually now derives its deadline and failure message from scaledTimeout(timeout).
scaledTimeout test
tests/integration/harness/asserts_ci_scale_test.go
Adds TestScaledTimeoutStretchesOnCI, verifying 30s scales to 90s under CI=true and remains 30s otherwise.

Sequence Diagram(s)

sequenceDiagram
  participant Tick as AutoSnapshotRunnable.Tick
  participant Stamp as stampRDAfterCreate
  participant API as Kubernetes API

  Tick->>Stamp: invoke after snapshot creation
  Stamp->>API: Get ResourceDefinition
  API-->>Stamp: return RD
  Stamp->>Stamp: set NextAutoId, last-at annotation
  Stamp->>API: Update RD
  API-->>Stamp: 409 Conflict
  Stamp->>API: Get ResourceDefinition (retry)
  API-->>Stamp: return fresh RD
  Stamp->>Stamp: reapply NextAutoId, last-at
  Stamp->>API: Update RD
  API-->>Stamp: success
Loading

Related PRs: None mentioned in provided context.

Suggested labels: bug, tests

Suggested reviewers: None specified.

🐰 A conflict came, the RD retried,
Fresh copies fetched before applied,
And timers stretched threefold on CI,
So flaky waits would finally comply.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/autosnapshot-stamp-conflict

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces retry logic on conflict when updating the ResourceDefinition in stampRDAfterCreate to handle concurrent updates gracefully, and adds a corresponding unit test to verify this behavior. The feedback suggests updating the passed-in rd pointer with the fresh object upon a successful update to prevent potential bugs from stale pointers in the caller.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +377 to +378
// Bare error: RetryOnConflict matches on the apierrors type.
return r.Client.Update(ctx, &fresh)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the previous implementation, stampRDAfterCreate modified the passed-in rd object in-place. With the new retry logic, rd is left unmodified because the updates are applied to a local fresh variable. Although the current caller does not rely on the updated fields of rd after this call, leaving the passed-in pointer stale can easily lead to subtle bugs in the future if the caller or subsequent logic is extended to use rd.

To prevent this, update the dereferenced rd pointer with the fresh object upon a successful API update.

		// Bare error: RetryOnConflict matches on the apierrors type.
		updateErr := r.Client.Update(ctx, &fresh)
		if updateErr == nil {
			*rd = fresh
		}
		return updateErr

…flake

The per-group convergence constants (30s) were tuned on dev machines;
GitHub-hosted runners under a full suite run are several times slower,
so a random group blows its budget and the Integration lane
rotate-flakes (GroupI's ResourceConnectionPathCreate one run, GroupJ's
CSICreateVolumeFromEmpty the next — both pass locally 3/3). Eventually
carries only positive-convergence asserts, so green runs pay nothing
for the stretch; only genuinely failing runs report slower, still
capped by the job-level -timeout.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review July 1, 2026 23:35
@kvaps Andrei Kvapil (kvaps) merged commit 1e9501e into main Jul 1, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant