Skip to content

Commit fcc7536

Browse files
authored
Merge pull request #744 from aws-samples/fix/eks-cluster-template-issues
fix: EKS cluster template and kro RGD issues
2 parents cbc6918 + c628b11 commit fcc7536

7 files changed

Lines changed: 41 additions & 15 deletions

File tree

.kiro/steering/aws-env-best-practices.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,14 @@ Enforces environment variable usage for all AWS operations to prevent hardcoded
2828
- When using Terraform specifically, use deployment scripts instead of direct terraform commands (ID: AWS_ENV_TF_SCRIPTS)
2929
- NEVER commit files containing actual AWS account IDs, access keys, or other sensitive data (ID: AWS_ENV_NO_COMMIT_SENSITIVE)
3030

31+
### Troubleshooting and Reconciliation Best Practices
32+
33+
- AWS controllers (ACK, Kro, ArgoCD via EKS Capabilities) reconcile asynchronously - ALWAYS wait for natural reconciliation before attempting destructive actions (ID: AWS_RECONCILE_PATIENCE)
34+
- When changes are made (IAM roles, policies, selectors), expect 15-30 minutes for full reconciliation across distributed controllers (ID: AWS_RECONCILE_TIME_EXPECTATION)
35+
- Troubleshooting should be non-destructive: check status, events, logs, describe resources - NEVER delete resources to "force" reconciliation (ID: AWS_TROUBLESHOOT_NON_DESTRUCTIVE)
36+
- If a resource appears stuck: (1) Check events with get_k8s_events, (2) Check logs, (3) Verify IAM permissions, (4) Wait longer - do NOT delete (ID: AWS_TROUBLESHOOT_SEQUENCE)
37+
- Resource deletion breaks status propagation in resource graphs (Kro), disrupts GitOps sync (ArgoCD), and can orphan AWS resources - avoid unless explicitly cleaning up (ID: AWS_DELETE_CONSEQUENCES)
38+
3139
## Priority
3240

3341
Critical

.kiro/steering/eks-best-practices.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@ Ensures all EKS and Kubernetes operations leverage the EKS MCP server capabiliti
1515
- For CloudWatch metrics and logs, prefer `get_cloudwatch_logs` and `get_cloudwatch_metrics` MCP tools over AWS CLI commands (ID: EKS_MCP_MONITORING)
1616
- When applying YAML manifests, use the `apply_yaml` MCP tool to benefit from validation and error handling (ID: EKS_MCP_APPLY)
1717

18+
### Critical: Never Delete Resources During Reconciliation
19+
20+
- **NEVER delete Kubernetes resources (kubectl delete, manage_k8s_resource with delete operation) to "force" reconciliation or troubleshoot** (ID: EKS_NO_DELETE_FORCE_RECONCILE)
21+
- Controllers (including EKS Capabilities-managed ACK, Kro, ArgoCD) will eventually reconcile - be patient and wait (ID: EKS_WAIT_FOR_RECONCILE)
22+
- Deleting resources is **destructive** and breaks status propagation chains, especially with Kro resource graphs that depend on child resource status (ID: EKS_DELETE_BREAKS_STATUS)
23+
- After making changes like adding IAMRoleSelectors, wait 15-30 minutes for reconciliation before investigating further (ID: EKS_RECONCILE_WAIT_TIME)
24+
- If a resource appears stuck, investigate via events, logs, and status conditions - do NOT delete unless explicitly required for cleanup (ID: EKS_INVESTIGATE_DONT_DELETE)
25+
- The only acceptable deletions are: (1) Removing finalizers from stuck resources during controlled cleanup, (2) Cleaning up test resources at end of validation, (3) Explicitly requested by user (ID: EKS_DELETE_ONLY_WHEN_REQUIRED)
26+
1827
### ArgoCD CLI Authentication
1928

2029
- If the `argocd` CLI returns authentication errors or token expired messages, run the `argocd-refresh-token` bash function to obtain a new 12-hour token (ID: EKS_ARGOCD_REFRESH_TOKEN)

gitops/addons/charts/kro/resource-groups/manifests/eks/rg-eks-vpc.yaml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,8 @@ kind: ResourceGraphDefinition
33
metadata:
44
name: eksclusterwithvpc.kro.run
55
annotations:
6-
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
6+
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true,Replace=false
77
argocd.argoproj.io/sync-wave: "-1"
8-
argocd.argoproj.io/sync-options: Replace=false
98
spec:
109
schema:
1110
apiVersion: v1alpha1
@@ -65,9 +64,9 @@ spec:
6564
enable_kyverno: string | default="true"
6665
enable_kyverno_policies: string | default="true"
6766
enable_kyverno_policy_reporter: string | default="true"
68-
enable_cni_metrics_helper: string | default="false"
69-
enable_prometheus_node_exporter: string | default="false"
70-
enable_kube_state_metrics: string | default="false"
67+
enable_cni_metrics_helper: string | default="true"
68+
enable_prometheus_node_exporter: string | default="true"
69+
enable_kube_state_metrics: string | default="true"
7170
enable_opentelemetry_operator: string | default="false"
7271
enable_cert_manager: string | default="true"
7372
enable_aws_efs_csi_driver: string | default="true"
@@ -131,7 +130,7 @@ spec:
131130
privateSubnet2Cidr: ${schema.spec.cidr.privateSubnet2Cidr}
132131
- id: eks
133132
readyWhen:
134-
- ${eks.status.state == "ACTIVE"}
133+
- ${eks.status.clusterState == "ACTIVE"}
135134
template:
136135
apiVersion: kro.run/v1alpha1
137136
kind: EksCluster

gitops/addons/charts/multi-acct/templates/eks-capabilities-rbac.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@ rules:
5151
- apiGroups: ["kro.run"]
5252
resources: ["*"]
5353
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
54+
# ACK Secrets Manager
55+
- apiGroups: ["secretsmanager.services.k8s.aws"]
56+
resources: ["*"]
57+
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
5458
---
5559
apiVersion: rbac.authorization.k8s.io/v1
5660
kind: ClusterRoleBinding

platform/backstage/templates/eks-cluster-template/template.yaml

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -85,10 +85,7 @@ spec:
8585
type: string
8686
default: platform-on-eks-workshop
8787
ui:widget: hidden
88-
ingressDomainName:
89-
type: string
90-
default: <ingress_domain_name>
91-
ui:widget: hidden
88+
9289
- title: Git Repository URLs
9390
properties:
9491
addonsRepoUrl:
@@ -183,6 +180,11 @@ spec:
183180
type: boolean
184181
description: Enable Cert Manager addon
185182
default: true
183+
enableAwsEfsCsiDriver:
184+
title: Enable AWS EFS CSI Driver
185+
type: boolean
186+
description: Enable AWS EFS CSI Driver addon
187+
default: false
186188
steps:
187189
- id: fetchSystem
188190
name: Fetch System
@@ -223,11 +225,11 @@ spec:
223225
- url: https://${{ steps['fetchSystem'].output.entity.spec.argocd_hostname }}/applications/argocd/clusters
224226
title: ArgoCD App URL
225227
icon: externalLink
226-
- url: https://${{ steps['fetchSystem'].output.entity.spec.gitlab_hostname }}/${{ steps['fetchSystem'].output.entity.spec.gituser }}/${{ parameters.repoName }}/-/blob/main/fleet/kro-values/tenants/tenant1/kro-clusters/values.yaml
228+
- url: https://${{ steps['fetchSystem'].output.entity.spec.gitlab_hostname }}/${{ steps['fetchSystem'].output.entity.spec.gituser }}/${{ parameters.repoName }}/-/blob/main/gitops/fleet/kro-values/tenants/${{ parameters.tenant }}/kro-clusters/values.yaml
227229
title: Git Repo URL
228230
icon: github
229231
spec:
230-
owner: guests
232+
owner: user:guest
231233
lifecycle: experimental
232234
type: service
233235
# - id: mergeConfig
@@ -272,7 +274,7 @@ spec:
272274
- Added Backstage component definition
273275
sourcePath: ./repo
274276
targetPath: .
275-
branchName: cluster-${{ parameters.clusterName }}-${{ parameters.environment }}
277+
branchName: cluster-${{ parameters.clusterName }}
276278
removeSourceBranch: true
277279
commitAction: auto
278280
- id: register

platform/infra/terraform/scripts/utils.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -430,7 +430,11 @@ update_backstage_defaults() {
430430
CATALOG_INFO_PATH="${GIT_ROOT_PATH}/platform/backstage/templates/catalog-info.yaml"
431431

432432
# Get admin role name from current AWS context
433-
ADMIN_ROLE_NAME=$(aws sts get-caller-identity --query 'Arn' --output text | sed 's|.*assumed-role/||' | sed 's|/.*||')
433+
# Use WSParticipantRole (Workshop Studio standard role). Fallback to caller identity for self-paced deployments.
434+
ADMIN_ROLE_NAME=$(aws iam list-roles --query 'Roles[?contains(RoleName,`WSParticipantRole`)].RoleName' --output text 2>/dev/null)
435+
if [[ -z "$ADMIN_ROLE_NAME" ]]; then
436+
ADMIN_ROLE_NAME=$(aws sts get-caller-identity --query 'Arn' --output text | sed 's|.*assumed-role/||' | sed 's|/.*||')
437+
fi
434438

435439
# Get ArgoCD URL from EKS capability if available
436440
ARGOCD_SERVER_URL=$(aws eks describe-capability --cluster-name ${CLUSTER_NAME:-${RESOURCE_PREFIX}-hub} --capability-name argocd --query 'capability.configuration.argoCd.serverUrl' --output text 2>/dev/null || echo "")

scripts/validation/backstage-auth.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ backstage_scaffolder() {
8989
}
9090

9191
# When sourced, export BS_TOKEN for direct use
92-
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
92+
if [[ "${BASH_SOURCE[0]:-}" == "${0}" ]] || [[ "${ZSH_EVAL_CONTEXT:-}" == "toplevel" ]]; then
9393
backstage_get_token
9494
else
9595
export BS_TOKEN

0 commit comments

Comments
 (0)