Skip to content

[FLINK-39967] Fix backoffLimit=-1 to mean unlimited retries in FlinkStateSnapshot#1084

Merged
gyfora merged 3 commits into
apache:mainfrom
vsantwana:FLINK-snapshot-backoff-bug-test
Jun 22, 2026
Merged

[FLINK-39967] Fix backoffLimit=-1 to mean unlimited retries in FlinkStateSnapshot#1084
gyfora merged 3 commits into
apache:mainfrom
vsantwana:FLINK-snapshot-backoff-bug-test

Conversation

@vsantwana

@vsantwana vsantwana commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

FlinkStateSnapshotSpec.backoffLimit defaults to -1, documented as meaning "unlimited" retries. However, the retry decision in FlinkStateSnapshotController did a plain numeric comparison failures > backoffLimit. With the default -1, after the first failure failures is 1, so 1 > -1 is always true and the snapshot was immediately marked failed with no retry - the exact opposite of the documented behavior.

This PR fixes the comparison so that a negative backoffLimit (the -1 sentinel) correctly means "retry indefinitely", while 0 means no retries and N means up to N retries.

Brief change log

  • Treat backoffLimit < 0 as "unlimited retries" in FlinkStateSnapshotController, so the documented -1 default actually retries indefinitely instead of failing on the first error.
  • Added a test proving the bug and verifying the fix (snapshot with default backoffLimit=-1 is retried after a failure rather than immediately failing).

Verifying this change

This change added tests and can be verified as follows:

  • FlinkStateSnapshotControllerTest covers a snapshot failure with the default backoffLimit=-1 and asserts the snapshot is rescheduled for retry (with exponential backoff) instead of being marked failed with no retry.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes (snapshot reconciler retry handling)

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@vsantwana vsantwana force-pushed the FLINK-snapshot-backoff-bug-test branch from d25b9df to 8e18ecb Compare April 8, 2026 11:22
@gyfora

gyfora commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Can you please attach a JIRA to this PR?

vsantwana and others added 3 commits June 22, 2026 18:10
…eSnapshot

The default backoffLimit of -1 (documented as "unlimited retries") causes
snapshots to immediately fail after the first error. The check
`failures > backoffLimit` evaluates to `1 > -1` which is always true,
so no retries ever happen with the default configuration.

This test is expected to fail until the bug is fixed.
The spec documents backoffLimit=-1 (the default) as "unlimited retries",
but the controller compared `failures > backoffLimit` without handling the
-1 sentinel. After the first failure (failures=1), `1 > -1` was true, so the
snapshot was marked FAILED with withNoRetry() -- the opposite of unlimited.

Guard the no-retry branch with `backoffLimit >= 0` so:
  -1  -> retries indefinitely (default)
   0  -> no retries
   N  -> retries up to N times

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vsantwana vsantwana changed the title Add test proving backoffLimit=-1 (unlimited retries) bug in FlinkStat… [FLINK-39967] Fix backoffLimit=-1 to mean unlimited retries in FlinkStateSnapshot Jun 22, 2026
@vsantwana vsantwana force-pushed the FLINK-snapshot-backoff-bug-test branch from cfdf159 to 91654f3 Compare June 22, 2026 12:46
@vsantwana vsantwana changed the title [FLINK-39967] Fix backoffLimit=-1 to mean unlimited retries in FlinkStateSnapshot [FLINK-39967] Fix backoffLimit=-1 to mean unlimited retries in FlinkStateSnapshot Jun 22, 2026
@gyfora gyfora merged commit 5350465 into apache:main Jun 22, 2026
128 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants