fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict by kvaps · Pull Request #173 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-07-01T22:31:30Z

What

The auto-snapshot tick's RD bookkeeping stamp (NextAutoId + last-at annotation) did a bare Update on the RD object taken from Tick's List. The Snapshot create right before the stamp wakes reconcilers that update the RD concurrently, so the stamp lands on a stale resourceVersion and 409s. Tick swallows per-RD errors → the bookkeeping is silently dropped: the next tick re-derives the SAME id every interval, and only createAutoSnapshot's AlreadyExists guard keeps the loop idempotent (wasting a tick per interval and spamming conflict errors on a busy cluster).

The stamp now re-reads the RD fresh inside retry.RetryOnConflict.

Why now

TestGroupG/AutoSnapshotPeriodicTick fails deterministically (3/3) on current runners — it is what's red on PR #172's Integration lane, and it reproduces on main's store code unchanged, i.e. it's a latent main bug surfaced by runner timing, not a regression from any open PR.

Testing

TestAutoSnapshotStampSurvivesRDUpdateConflict (L1, fail-on-bug proven): interceptor 409s the first RD Update; pre-fix the stamp is dropped with the exact integration-test signature (NextAutoId got "" want "2", last-at unset), post-fix the retry lands it.
TestGroupG/AutoSnapshotPeriodicTick: red 3/3 → green 3/3 locally (envtest).
Full internal/controller green, lint 0 issues.

Note: not a CLI-verb change — L1 + the existing L-integration cell are the appropriate tiers.

Summary by CodeRabbit

Bug Fixes
- Improved auto-snapshot reliability so snapshot tracking updates are less likely to fail when resource definitions change at the same time.
- Reduced flaky integration test behavior in CI by allowing longer wait times during slow runs.

…te conflict The RD object the auto-snapshot tick stamps (NextAutoId + last-at annotation) comes from Tick's List, and the Snapshot create right before the stamp wakes reconcilers that update the RD concurrently — so the stamp's bare Update lands on a stale resourceVersion and 409s. Tick swallows per-RD errors, so the bookkeeping was silently dropped: the next tick re-derived the SAME id every interval, and only createAutoSnapshot's AlreadyExists guard kept the loop idempotent. The race fires deterministically in the L-integration stack (TestGroupG/AutoSnapshotPeriodicTick red on current runners). Stamp now re-reads the RD fresh inside retry.RetryOnConflict, so a racing reconciler write no longer costs the tick its bookkeeping. Regression: TestAutoSnapshotStampSurvivesRDUpdateConflict (L1, fail-on-bug proven — interceptor 409s the first RD Update; pre-fix the stamp is dropped with the exact integration-test signature, post-fix the retry lands it). TestGroupG/AutoSnapshotPeriodicTick goes from red 3/3 to green 3/3 locally. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-07-01T22:31:48Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89e6288d-63a9-4e4c-979a-adb678b30d93

📥 Commits

Reviewing files that changed from the base of the PR and between eb0b779 and 0ead801.

📒 Files selected for processing (4)

internal/controller/autosnapshot_controller.go
internal/controller/autosnapshot_stamp_conflict_test.go
tests/integration/harness/asserts.go
tests/integration/harness/asserts_ci_scale_test.go

📝 Walkthrough

Walkthrough

This PR adds retry-on-conflict handling to stampRDAfterCreate in the autosnapshot controller, re-fetching the ResourceDefinition on update conflicts before reapplying bookkeeping fields, with a corresponding conflict-simulation test. Separately, it introduces a scaledTimeout helper that stretches integration test Eventually timeouts by 3x under CI, with a matching test.

Changes

Autosnapshot RD Update Conflict Handling

Layer / File(s)	Summary
Retry-on-conflict implementation `internal/controller/autosnapshot_controller.go`	`stampRDAfterCreate` now uses `retry.RetryOnConflict` to re-GET the ResourceDefinition on each attempt, reapply `NextAutoId` and the last-at annotation on the fresh object, and update it, instead of updating the stale in-memory object once.
Conflict simulation test `internal/controller/autosnapshot_stamp_conflict_test.go`	Adds `TestAutoSnapshotStampSurvivesRDUpdateConflict`, which forces one `409 Conflict` via a fake client interceptor during `Tick` and asserts bookkeeping fields are correctly set afterward.

Estimated code review effort: 3 (Moderate) | ~20 minutes

CI-Scaled Test Timeouts

Layer / File(s)	Summary
scaledTimeout helper and Eventually integration `tests/integration/harness/asserts.go`	Adds `scaledTimeout`, which multiplies a given timeout by 3 when the `CI` environment variable is set; `Eventually` now derives its deadline and failure message from `scaledTimeout(timeout)`.
scaledTimeout test `tests/integration/harness/asserts_ci_scale_test.go`	Adds `TestScaledTimeoutStretchesOnCI`, verifying 30s scales to 90s under `CI=true` and remains 30s otherwise.

Sequence Diagram(s)

sequenceDiagram
  participant Tick as AutoSnapshotRunnable.Tick
  participant Stamp as stampRDAfterCreate
  participant API as Kubernetes API

  Tick->>Stamp: invoke after snapshot creation
  Stamp->>API: Get ResourceDefinition
  API-->>Stamp: return RD
  Stamp->>Stamp: set NextAutoId, last-at annotation
  Stamp->>API: Update RD
  API-->>Stamp: 409 Conflict
  Stamp->>API: Get ResourceDefinition (retry)
  API-->>Stamp: return fresh RD
  Stamp->>Stamp: reapply NextAutoId, last-at
  Stamp->>API: Update RD
  API-->>Stamp: success

Related PRs: None mentioned in provided context.

Suggested labels: bug, tests

Suggested reviewers: None specified.

🐰 A conflict came, the RD retried,
Fresh copies fetched before applied,
And timers stretched threefold on CI,
So flaky waits would finally comply.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/autosnapshot-stamp-conflict

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces retry logic on conflict when updating the ResourceDefinition in stampRDAfterCreate to handle concurrent updates gracefully, and adds a corresponding unit test to verify this behavior. The feedback suggests updating the passed-in rd pointer with the fresh object upon a successful update to prevent potential bugs from stale pointers in the caller.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-01T22:32:53Z

+		// Bare error: RetryOnConflict matches on the apierrors type.
+		return r.Client.Update(ctx, &fresh)


In the previous implementation, stampRDAfterCreate modified the passed-in rd object in-place. With the new retry logic, rd is left unmodified because the updates are applied to a local fresh variable. Although the current caller does not rely on the updated fields of rd after this call, leaving the passed-in pointer stale can easily lead to subtle bugs in the future if the caller or subsequent logic is extended to use rd.

To prevent this, update the dereferenced rd pointer with the fresh object upon a successful API update.

// Bare error: RetryOnConflict matches on the apierrors type. updateErr := r.Client.Update(ctx, &fresh) if updateErr == nil { *rd = fresh } return updateErr

…flake The per-group convergence constants (30s) were tuned on dev machines; GitHub-hosted runners under a full suite run are several times slower, so a random group blows its budget and the Integration lane rotate-flakes (GroupI's ResourceConnectionPathCreate one run, GroupJ's CSICreateVolumeFromEmpty the next — both pass locally 3/3). Eventually carries only positive-convergence asserts, so green runs pay nothing for the stretch; only genuinely failing runs report slower, still capped by the job-level -timeout. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

Andrei Kvapil (kvaps) marked this pull request as ready for review July 1, 2026 23:35

Andrei Kvapil (kvaps) merged commit 1e9501e into main Jul 1, 2026
15 checks passed

Andrei Kvapil (kvaps) mentioned this pull request Jul 2, 2026

chore(release): v0.1.16 #174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict#173

fix(controller): retry the auto-snapshot RD bookkeeping stamp on update conflict#173
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/autosnapshot-stamp-conflict

Andrei Kvapil (kvaps) commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// Bare error: RetryOnConflict matches on the apierrors type.
		return r.Client.Update(ctx, &fresh)

Uh oh!

Conversation

Andrei Kvapil (kvaps) commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why now

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading