Skip to content

fix: detect ROLLBACK_COMPLETE nodegroup stacks during create#8713

Open
costela wants to merge 1 commit intoeksctl-io:mainfrom
costela:fix/rollback-complete-nodegroup-detection
Open

fix: detect ROLLBACK_COMPLETE nodegroup stacks during create#8713
costela wants to merge 1 commit intoeksctl-io:mainfrom
costela:fix/rollback-complete-nodegroup-detection

Conversation

@costela
Copy link
Copy Markdown

@costela costela commented Apr 16, 2026

Summary

  • eksctl create nodegroup now fails fast with an actionable error when it encounters a nodegroup stack in ROLLBACK_COMPLETE state that matches a nodegroup in the user's config, instead of silently skipping it and exiting 0
  • Rewords the misleading "all nodegroups have up-to-date cloudformation templates" log message to accurately describe what is checked (shared security group compatibility)
  • Removes ROLLBACK_COMPLETE from nonTransitionalReadyStackStatuses, which incorrectly grouped failed-create stacks with healthy terminal states like CREATE_COMPLETE

Fixes #8712
Related: #4006 (same symptom, closed by stale-bot without a fix)

Root cause

Two bugs combined to produce the silent failure:

  1. The create-nodegroup filter treated ROLLBACK_COMPLETE stacks as healthy existing nodegroups. NodeGroupFilter.SetOnlyLocal calls loadLocalAndRemoteNodegroups, which lists all nodegroup stacks and marks them as "remote" (i.e. already exists, skip creation). ListNodeGroupStacks only filters out DELETE_COMPLETE/DELETE_FAILED, so ROLLBACK_COMPLETE stacks pass through — causing the nodegroup to be silently excluded from the create plan.

  2. The post-create compatibility check used a misleading log message. ValidateExistingNodeGroupsForCompatibility only checks shared security group CFN outputs via isNodeGroupCompatible, but logged "all nodegroups have up-to-date cloudformation templates" — reading like a general health assertion. Its stack filter StackStatusIsNotTransitional also included ROLLBACK_COMPLETE in the "ready" set, so broken stacks passed the check.

Changes

File Change
pkg/ctl/cmdutils/filter/nodegroup_filter.go Detect ROLLBACK_COMPLETE stacks in loadLocalAndRemoteNodegroups (gated on f.onlyLocal so only the create path is affected); return actionable error with eksctl delete nodegroup hint
pkg/eks/nodegroup_service.go Reword line 309 log message to "all nodegroups have compatible shared security group configuration"
pkg/cfn/manager/api.go Remove StackStatusRollbackComplete from nonTransitionalReadyStackStatuses (keep UPDATE_ROLLBACK_COMPLETE which is genuinely healthy)
pkg/ctl/cmdutils/filter/nodegroup_filter_test.go 3 new test cases: error on config nodegroup in ROLLBACK_COMPLETE, no error for unrelated nodegroup, no error on delete path

Test plan

  • New unit tests pass: go test -tags=release ./pkg/ctl/cmdutils/filter/... -ginkgo.focus "ROLLBACK_COMPLETE"
  • Full test suites pass for all modified packages: go test -tags=release ./pkg/ctl/cmdutils/filter/... ./pkg/eks/... ./pkg/cfn/manager/...
  • golangci-lint run reports 0 issues on modified packages
  • go build -tags=release ./... compiles cleanly
  • Manual E2E: create a nodegroup with an invalid config so CFN reaches ROLLBACK_COMPLETE, then re-run eksctl create nodegroup — should now fail fast with the actionable error

🤖 Generated with Claude Code

When `eksctl create nodegroup` encounters an existing CloudFormation
stack in ROLLBACK_COMPLETE state for a nodegroup in the user's config,
it now fails fast with an actionable error instead of silently treating
the broken stack as a healthy existing nodegroup.

Also fixes a misleading log message that claimed "all nodegroups have
up-to-date cloudformation templates" when the check only validates
shared security group compatibility, and removes ROLLBACK_COMPLETE
from the set of non-transitional "ready" stack statuses.

Fixes eksctl-io#8712

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ROLLBACK_COMPLETE stacks pass compatibility check with misleading log

1 participant