Skip to content

fix: EKS cluster template and kro RGD issues#744

Merged
shapirov103 merged 6 commits into
mainfrom
fix/eks-cluster-template-issues
Jul 2, 2026
Merged

fix: EKS cluster template and kro RGD issues#744
shapirov103 merged 6 commits into
mainfrom
fix/eks-cluster-template-issues

Conversation

@allamand

@allamand allamand commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes multiple issues in the EKS cluster Backstage template and kro ResourceGraphDefinition.

Closes #743

Changes

Severity File Fix
High platform/backstage/templates/eks-cluster-template/template.yaml Add missing enableAwsEfsCsiDriver parameter definition
High gitops/addons/charts/kro/resource-groups/manifests/eks/rg-eks-vpc.yaml Fix readyWhen: stateclusterState to match EksCluster RGD status schema
Medium rg-eks-vpc.yaml Fix duplicate YAML key argocd.argoproj.io/sync-options by combining values
Medium template.yaml Fix component link: add gitops/ prefix, use dynamic ${{ parameters.tenant }}
Low template.yaml Fix spec.owner: guestsuser:guest

Testing

  • Validate kro readyWhen condition resolves correctly with an active EKS cluster
  • Verify Backstage template renders enableAwsEfsCsiDriver parameter correctly
  • Confirm ArgoCD sync-options are both applied after the annotation fix

Workshop User added 4 commits June 9, 2026 17:10
- Add missing enableAwsEfsCsiDriver parameter to Backstage template
- Fix readyWhen condition: use clusterState instead of state in rg-eks-vpc.yaml
- Fix duplicate argocd.argoproj.io/sync-options annotation (YAML key collision)
- Fix Backstage component link: add gitops/ prefix and use dynamic tenant param
- Fix spec.owner: use user:guest instead of bare guests
The ACK secretsmanager controller running as an EKS Capability needs
ClusterRole permissions to manage secretsmanager.services.k8s.aws CRDs.
Without this, the controller cannot resolve secretString references to
k8s secrets, causing clusterConfigSecret to never sync.
…conciliation

- Add critical section in eks-best-practices.md to prevent deleting K8s resources to force reconciliation
- Add troubleshooting best practices in aws-env-best-practices.md emphasizing patience and non-destructive debugging
- These rules prevent breaking status propagation chains in Kro resource graphs and ACK controllers
- Recommend 15-30 minute wait time for controller reconciliation after config changes
@allamand

Copy link
Copy Markdown
Contributor Author

Missing: secretsmanager IAMRoleSelector in iam-role-selectors.yaml

The PR adds RBAC for secretsmanager.services.k8s.aws in eks-capabilities-rbac.yaml (good), but the rg-eks.yaml creates Secrets Manager resources (clusterConfigSecret, clusterSecretsSecret) in spoke namespaces. Without an IAMRoleSelector for secretsmanager, ACK uses the default capability role which has no secretsmanager:* permissions → the secrets fail with AccessDeniedException.

Fix needed

Add to gitops/addons/charts/multi-acct/templates/iam-role-selectors.yaml (inside the range loop):

---
apiVersion: services.k8s.aws/v1alpha1
kind: IAMRoleSelector
metadata:
  name: secretsmanager-{{ $key }}
spec:
  arn: "arn:aws:iam::{{ $value }}:role/{{ $.Values.global.resourcePrefix }}-cluster-mgmt-secretsmanager"
  namespaceSelector:
    names:
      - {{ $key }}
  resourceTypeSelector:
    - group: secretsmanager.services.k8s.aws
      kind: ""
      version: ""

Reference

This was fixed on feature/cloudfront-on-agent-platform in commit 7f123cde — see the full template there.

Also: apply the PR #744 changes to feature/cloudfront-on-agent-platform

The following changes from this PR should be cherry-picked onto our branch:

  1. Merge duplicate argocd.argoproj.io/sync-options annotations into one line
  2. Change observability defaults to true (enable_cni_metrics_helper, enable_prometheus_node_exporter, enable_kube_state_metrics)
  3. Fix readyWhen to use eks.status.clusterState instead of eks.status.state
  4. Add secretsmanager RBAC in eks-capabilities-rbac.yaml

allamand pushed a commit that referenced this pull request Jun 11, 2026
- Merge duplicate sync-options annotations
- Enable observability by default (cni_metrics, prometheus_node_exporter, kube_state_metrics)
- Fix readyWhen: use eks.status.clusterState instead of eks.status.state
- Add secretsmanager RBAC in eks-capabilities-rbac
allamand added a commit that referenced this pull request Jun 12, 2026
- Merge duplicate sync-options annotations
- Enable observability by default (cni_metrics, prometheus_node_exporter, kube_state_metrics)
- Fix readyWhen: use eks.status.clusterState instead of eks.status.state
- Add secretsmanager RBAC in eks-capabilities-rbac
@allamand allamand marked this pull request as ready for review June 13, 2026 10:36
@allamand

Copy link
Copy Markdown
Contributor Author

@shapirov103 shapirov103 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, looks great, please see my comment.

# Get admin role name from current AWS context
ADMIN_ROLE_NAME=$(aws sts get-caller-identity --query 'Arn' --output text | sed 's|.*assumed-role/||' | sed 's|/.*||')
# Use WSParticipantRole (Workshop Studio standard role). Fallback to caller identity for self-paced deployments.
ADMIN_ROLE_NAME=$(aws iam list-roles --query 'Roles[?contains(RoleName,`WSParticipantRole`)].RoleName' --output text 2>/dev/null)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WSParticipant role is specific to the workshop. Should go into the gitlab repo. What is the plan to start moving this out of the repo? Otherwise the refactoring effort for agent-platform will break the workshop and will be hard to merge back to main.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use utils.sh in the agentic branch (pr-709) but will need to double check there that we don't reference it out of the workshop pattern

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't use this in the agentic branch, but the agentic branch is going to target main soon - what will happen with this change? Does it mean someone will have to replicate in the workshop? if so, why not do it now, save on the merge, I am afraid it maybe lost otherwise after the merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes if the plan is to merge agentic work into main quickly then we can save effort. but for now, main is broken with the optional module to create eks staging cluster not working. so everything depend on the timeline we are agree to merge agentic work to main, and I think there is still some work to do before that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can apply the fix now, the sh change is one of the things in the PR, other changes seem to be applicable regardless of the workshop

@allamand

Copy link
Copy Markdown
Contributor Author

All 5 fixes from this PR are already present in feature/cloudfront-on-agent-platform (PR #709) — no separate port needed. They will land on main when PR #709 merges.

@allamand allamand added the ready-to-merge this PR is ready to be merged label Jun 24, 2026

@shapirov103 shapirov103 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (utils.sh change to move out of the repo, deferred)

@hmuthusamy hmuthusamy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shapirov103 shapirov103 merged commit fcc7536 into main Jul 2, 2026
6 checks passed
@shapirov103 shapirov103 deleted the fix/eks-cluster-template-issues branch July 2, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge this PR is ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: EKS cluster template and kro RGD issues in eks-cluster-template

3 participants