Skip to content

Retry interrupted at-most-once steps#399

Closed
kiwigitops wants to merge 1 commit into
aws:mainfrom
kiwigitops:fix-at-most-once-step-retry
Closed

Retry interrupted at-most-once steps#399
kiwigitops wants to merge 1 commit into
aws:mainfrom
kiwigitops:fix-at-most-once-step-retry

Conversation

@kiwigitops

@kiwigitops kiwigitops commented May 24, 2026

Copy link
Copy Markdown

Summary

  • let interrupted AT_MOST_ONCE_PER_RETRY steps use the configured retry strategy instead of failing before retry handling
  • keep the no-retry checkpoint-failure path covered and add a regression for retrying after a STARTED checkpoint replay

Closes #394.

Testing

  • docker run --rm -v ${PWD}:/workspace -w /workspace maven:3.9-eclipse-temurin-21 mvn -pl sdk-integration-tests -am -Dtest=StepSemanticsIntegrationTest -Dsurefire.failIfNoSpecifiedTests=false test
  • git diff --check

@kiwigitops kiwigitops requested a review from a team May 24, 2026 21:32
var retryDecision = config.retryStrategy().makeRetryDecision(exception, attempt);

if (isRetryable && retryDecision.shouldRetry()) {
if (retryDecision.shouldRetry()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this change leading to retry and break the at-most-once semantic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this keeps the at-most-once guarantee scoped to each retry attempt.

For a replayed STARTED at-most-once step, we still do not call executeStepLogic for that same attempt. The interrupted attempt is converted through the retry strategy into either RETRY/PENDING or FAIL. User code only runs again after the operation becomes READY, which is the next retry attempt.

That is what the two tests are intended to pin down: testAtMostOnceNoRetryFailsAfterCheckpointFailure proves the started attempt is not re-executed when retries are disabled, and testAtMostOnceRetriesAfterCheckpointFailure proves a configured retry creates exactly one next-attempt execution.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change to users who rely on the existing behavior.

@kiwigitops

Copy link
Copy Markdown
Author

Thanks, agreed. The existing integration tests make this clearer: for AT_MOST_ONCE_PER_RETRY, an interrupted started step is currently exposed as StepInterruptedException, and user code can rely on catching that or on the execution failing rather than entering retry/pending state.

This PR changes that observable behavior under the normal retry path, so I am going to close it rather than keep pushing a breaking semantic change. If retrying interrupted at-most-once steps is still desirable later, it probably needs an explicit opt-in API or a design discussion rather than changing the current semantics in place.

@kiwigitops kiwigitops closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: step does not retry with at_most_once_per_retry config

2 participants