Skip to content

feat: move GitLab from EKS to EC2 docker-compose#755

Draft
allamand wants to merge 201 commits into
feature/cloudfront-on-agent-platformfrom
feature/cloudfront-on-agent-platform-without-gitlab
Draft

feat: move GitLab from EKS to EC2 docker-compose#755
allamand wants to merge 201 commits into
feature/cloudfront-on-agent-platformfrom
feature/cloudfront-on-agent-platform-without-gitlab

Conversation

@allamand

Copy link
Copy Markdown
Contributor

Moves GitLab CE out of the EKS hub cluster onto the IDE EC2 instance via docker-compose, exposed through a CDK-managed NLB + CloudFront distribution.

Changes

  • Disable gitlab ArgoCD addon (enabled-addons, hub-config, platform.yaml)
  • Remove GitLab Keycloak SSO client registration (no SSO for EC2 GitLab)
  • Remove gitlab-nlb + gitlab-distribution from Taskfile.cloudfront.yaml (CloudFront now CDK-managed in platform-engineering-on-eks)
  • Add gitlab:init-ec2 task: wait for GitLab CE readiness, create root token, user1, repos via GitLab API (replaces kubectl exec into the k8s pod)
  • Replace k8s Job wait in clone-repos with CloudFront readiness poll
  • GITLAB_DOMAIN_INT now uses EC2 private IP (private/gitlab-ec2-private-ip) for in-cluster ArgoCD git access
  • Remove git_token from seed-secrets (seeded by CDK deploy-time Lambda)

Companion MR

platform-engineering-on-eks: feat/pr-709-kind-crossplane-gitlab-on-ec2feat/pr-709-kind-crossplane

Closes #754

- Disable GitLab ArgoCD addon (enabled-addons, hub-config, platform.yaml registry)
- Remove Keycloak SSO client for GitLab
- Remove gitlab-nlb/gitlab-distribution Taskfile.cloudfront.yaml tasks
- Add gitlab:init-ec2 task: wait for CE readiness, create root token, user1, repos
- Replace k8s Job wait in clone-repos with CF readiness poll
- GITLAB_DOMAIN_INT uses EC2 private IP for in-cluster ArgoCD git access
- Remove git_token from seed-secrets (seeded by CDK at deploy time)

Refs #754
@allamand allamand force-pushed the feature/cloudfront-on-agent-platform-without-gitlab branch from f5ad84b to ea0b1b1 Compare June 26, 2026 09:24
allamand added 23 commits June 26, 2026 11:28
Added: urls, hub:set-overlay-repo, hub:restart-langfuse, hub:wait-for-full-sync,
secrets-manager:seed-secrets, secrets-manager:seed-observability,
hub:create-mgmt-roles, hub:restart-identity-pods, hub:update,
spokes:enable-crossplane/kro, spokes:create-capabilities,
spokes:disable-crossplane/kro/all, spokes:seed-provider-identity

Removed: idc:setup from install task (moved to workshop:Taskfile.yaml)
- Add parallel install phases (phase1/phase2) to kind-kro-ack install task
- Move gitlab:init-ec2 + gitlab:clone-repos to workshop/Taskfile.yaml
- crossplane-system refs in copied tasks left intentional (Crossplane addon on hub EKS)
- Restore argocd login block in ssm-setup-ide-logs.sh
- Restore progressive-app image_name/service name (rollout-demo → progressive-app)
RGDs ekscluster.kro.run and eksclusterwithvpc.kro.run were Inactive because:
1. ACK SecretsManager controller was missing (rg-eks.yaml references secretsmanager.services.k8s.aws/v1alpha1)
2. ESO was not installed on the bootstrap kind cluster (rg-eks.yaml references external-secrets.io/v1)

Fixes:
- Add ACK_SECRETSMANAGER_VERSION=1.3.1 and install it in ack:install
- Add kro:install-eso-bootstrap task (idempotent) before kro:apply-rgds
…dd argoCdCapabilityRoleArn

- Remove spec.argocdCapability block (not in EksclusterWithVpc RGD schema, causes strict decode error)
- Add missing argoCdCapabilityRoleArn field (required by RGD)
- Remove now-unused IDC sed substitution lines
…lity

Port from feature/platform-cluster-kro-ack:
- rg-eks.yaml: add argocdCapability schema object, replace accessEntryArgoCdCapability
  with conditional argocdCapabilityRole + argocdCapability (EKS Capability with IDC) +
  argocdCapabilityAccessEntry (all guarded by includeWhen enabled==true)
- rg-eks-vpc.yaml: add argocdCapability schema object, pass through to nested EksCluster
- hub:claim: restore argocdCapability block with IDC sed substitutions
Deletes eksclusters.kro.run and eksclusterwithvpcs.kro.run CRDs on the
bootstrap kind cluster. Needed when RGD schema changes remove fields
(KRO breaking-change protection blocks the update otherwise).
Run: task kind-kro-ack:kro:reset-crds, then task install.
Hardcoded 10.0.x.0/24 subnets break when vpcCidr is not 10.0.0.0/16.
Add HUB_VPC_PREFIX var (first two octets of HUB_VPC_CIDR) and use it
in hub:claim subnet CIDRs.
…xist

When the EKS ArgoCD Capability is used, ArgoCD CRDs are already present
on the hub cluster. Helm install then fails with 'CRD already exists'.
Add a second status check: skip if applications.argoproj.io CRD exists.
allamand added 13 commits July 1, 2026 21:43
…pendency)

Switch spoke-dev from spokes:enable-crossplane + create-capabilities to
spokes:enable-kro, matching spoke-prod. Both spokes are now provisioned
via the KRO EksclusterWithVpc RGD + ACK controllers.

This eliminates the Crossplane provider credential chicken-and-egg for
spoke provisioning entirely — the hub's Crossplane providers are still
used for pod-identity/IAM on the hub itself, but spoke EKS clusters no
longer depend on them.

Also makes the environment label dynamic (derived from cluster name) so
spoke-dev correctly gets 'environment: dev' instead of hardcoded 'prod'.
The task failed immediately (exit 1) because under 'set -e -o pipefail'
(the global Taskfile setting), the kubectl jsonpath that queries unhealthy
providers returns non-zero when no items match (empty result). Combined
with pipefail, this propagated through the pipe to wc and killed the
script on the first loop iteration — even when all providers were healthy.

Fix:
- Split kubectl|wc pipeline: kubectl with '|| true', then wc separately
- Replace '&& exit 0' conditional chain with if/then/break (chains with
  set -e are a trap: any false part returns non-zero → errexit fires)
- Use 'break' instead of 'exit 0' so cleanup (rm kubeconfig) still runs
- Add '|| true' to the progress printf line (same && chain issue)
…-for-providers

The rm -f at the end of hub:wait-for-providers deletes a kubeconfig that
the very next task immediately recreates. It adds no value (the file
doesn't go stale within a single install run) and can cause failures if
a subsequent task expects it to exist.
…served

Wait for applicationsets/applications.argoproj.io CRDs to be Established
before applying root-appset.yaml, to avoid 'no matches for kind
ApplicationSet in version argoproj.io/v1alpha1' when the ArgoCD EKS
Capability has not yet reconciled after the EksCluster becomes ACTIVE.
…dfrontDomain prereq gate

- kind-crossplane: run cloudfront:sync-domain right after setup-exposure and
  before hub:seed (mirroring kind-kro-ack) so cloudfrontDomain is written to
  config.local.yaml + private/cloudfront-domain before hub:seed reads it. Covers
  install retries where setup-exposure's status guard skips because the
  <hub>-platform distribution already exists, which previously left hub:seed with
  an empty ingress_domain_name.
- workshop:pre-install: cloudfrontDomain (platform CloudFront) is created by
  hub-distribution during install, not a prerequisite. Remove the hard exit 1 on
  empty cloudfrontDomain and correct messaging (create-config.sh sets
  cloudfront.gitlabDomain, not cloudfrontDomain).
…rs values

phase2 ran spokes:enable-kro for spoke-dev and spoke-prod concurrently. Both
commit+push to the same fleet-config main branch and write the same shared files,
so the second push was rejected, its retry rebased two add/add commits into an
unresolvable conflict, left a detached HEAD, and failed the whole task install
(exit 128).

- install:phase2-parallel now runs spoke-dev then spoke-prod sequentially (the
  tasks only do git clone/commit/push; EKS provisioning is async via ArgoCD/KRO,
  so serialising costs ~seconds).
- spokes:enable-kro now deep-merges its cluster block into
  gitops/fleet/.../kro-clusters/values.yaml via 'yq . *= load(...)' instead of a
  full 'cat >' overwrite that dropped the sibling spoke's block (last-writer-wins
  data loss, and the source of the add/add rebase conflict).
…p config.local.yaml storage

The platform CloudFront domain is created mid-install by cloudfront:hub-distribution,
but consumers read it via taskfile-level (global) vars that go-task evaluates ONCE at
load time — before the distribution exists — so the value was frozen empty. That is
why idc:configure got an empty --keycloak-dns and 'urls' printed https:/// even though
config.local.yaml and private/cloudfront-domain both held the domain.

Make private/cloudfront-domain the single source of truth (runtime artifact), resolved
lazily in each consumer's own task-level vars with an AWS 'Comment==<hub>-platform'
fallback. Stop storing/reading it in config.local.yaml entirely.

common/Taskfile.cloudfront.yaml:
- hub-distribution writes only private/cloudfront-domain (drop yq -i config write)
- sync-domain reads private -> AWS (drop config read), writes only private

workshop/Taskfile.yaml (shared by both providers):
- idc:configure KEYCLOAK_DNS is now a self-contained task-level resolver (private -> AWS)
- setup-env CF_DOMAIN resolves lazily; gitlab domain from gitlabDomain -> private file
- top-level CLOUDFRONT_DOMAIN reads private only (informational/pre-install); messaging updated

kind-crossplane + kind-kro-ack:
- hub:seed ingress_domain_name/exposure_mode use a task-level lazy CLOUDFRONT_DOMAIN
- urls fallback resolves inline (private -> AWS) instead of the frozen global
- crossplane hub:update-ingress-domain reads private -> AWS
- top-level CLOUDFRONT_DOMAIN reads private only
…nt mode

The platform ALB is pre-created as scheme=internal (CloudFront VPC Origin backend),
but the platform IngressClassParams hardcoded scheme=internet-facing. ALB scheme is
immutable, so when the AWS LBC adopts the ALB by loadBalancerName it refuses to
reconcile on the scheme conflict — no listener rules are attached, every platform
ingress (keycloak, backstage, grafana, ...) gets no address, and all requests hit the
ALB default 404 action. This blocked idc:configure (Keycloak SAML descriptor 404) and
made all platform URLs return 404, for both kind-crossplane and kind-kro-ack.

- ingress-class-alb chart: scheme is now templated ({{ .Values.scheme | default internet-facing }})
- core.yaml appset: pass scheme=internal when exposure_mode=cloudfront, internet-facing otherwise
…s to EKS RGD

The kro-ack EksCluster RGD created PodIdentityAssociations for external-secrets,
external-dns, adot, policy-reporter, cni-metrics-helper and cloudwatch-agent, but
NOT for two service accounts that make AWS calls:

- kube-system/aws-load-balancer-controller-sa: without creds the LBC can't build
  the ingress model (IMDS fallback fails: 'no EC2 IMDS role found'), attaches no
  listener rules to the platform ALB, so every platform ingress (keycloak,
  backstage, grafana, argo-workflows) gets no address and the ALB serves its
  default 404 — which blocked idc:configure (Keycloak SAML descriptor 404).
- keycloak/keycloak-config: the config job runs 'aws secretsmanager create/
  put-secret-value' to publish keycloak-clients; without creds ESO never syncs it
  and Backstage/Argo SSO break.

Adds an ACK Policy + Role (pods.eks.amazonaws.com trust) + PodIdentityAssociation
for each, modeled on the existing external-dns chain. LBC policy reuses the
canonical AWS LBC IAM policy; keycloak-config gets least-privilege Secrets Manager
permissions. (Crossplane provider already covers these via crossplane-pod-identity.)
…amp hub provider label

On the kro-ack provider, pod identities are created by the KRO EksCluster RGD (ACK).
Crossplane's own providers are not bootstrapped on the kro-ack hub, so the
crossplane-pod-identity (pod_identities) app there produces permanently-Degraded
CRs (SYNCED=False) and is a latent 409 conflict with the ACK-created associations.

- core.yaml: pod_identities selector now excludes clusters labelled provider=kro-ack
  (NotIn also matches clusters with no provider label, so crossplane/legacy clusters
  — including crossplane's KRO-provisioned spoke-prod, which needs the crossplane
  provider bootstrap identities — are unaffected).
- kind-kro-ack hub:seed writes addons='{"provider":"kro-ack"}' into peeks-hub/config;
  the fleet-secret ExternalSecret renders the addons key as labels, so the kro-ack hub
  cluster secret gets provider=kro-ack and pod_identities is not generated for it.

Note: kro-ack spoke secrets still need provider=kro-ack wired through
spokes:enable-kro -> EksclusterWithVpc/EksCluster schema -> argocdSecret label (follow-up).
… pod_identities

Threads a provider field from spokes:enable-kro through the KRO claim to the spoke
argocd cluster secret label, so kro-ack spokes (like the hub) are excluded from the
crossplane pod_identities app (provider NotIn [kro-ack]).

- kind-kro-ack spokes:enable-kro writes provider: kro-ack into the kro-clusters values
- kro-clusters chart maps $cluster.provider -> EksclusterWithVpc.spec.provider
- rg-eks-vpc EksclusterWithVpc schema + passthrough to nested EksCluster
- rg-eks EksCluster schema + argocdSecret label provider: ${schema.spec.provider}

Default is empty, so crossplane-provisioned clusters (and crossplane's KRO-provisioned
spoke-prod, whose spokes:enable-kro does not set it) keep provider unset -> NotIn matches
-> pod_identities stays enabled for them (preserves crossplane provider bootstrap identities).
…GD (temporary bridge)

Until devlake (and other data resources) are migrated off Crossplane, kro-ack still
needs Crossplane working on the hub. But on kro-ack the Crossplane AWS providers have
no credentials — provider-aws-iam/eks pod identities are themselves crossplane CRs
(chicken-and-egg) — so all downstream crossplane roles/PIAs (amp/rds/grafana/devlake)
stay SYNCED=False and devlake RDS / AMP / Grafana break.

Seed the two ROOT providers' pod identities via ACK in the EksCluster RGD (which
provisions both hub and spokes on kro-ack). Once provider-aws-iam/eks have creds,
Crossplane reconciles the rest itself.

- rg-eks.yaml: add crossplaneProviderRole (AdministratorAccess, pods.eks trust) +
  PodIdentityAssociations for crossplane-system/provider-aws-iam and provider-aws-eks.
  Gated includeWhen provider=='kro-ack' && enable_crossplane_aws=='true' so it does
  NOT run on crossplane-provisioned clusters (incl. crossplane's KRO-provisioned
  spoke-prod, where crossplane-pod-identity already creates these -> avoids 409).
- platform-cluster-kro hub claim: set provider=kro-ack so the gate fires on the hub.

TEMPORARY: remove these resources once the devlake->kro-ack migration lands.
…created ALB (curl 000)

Document the failure where platform URLs time out with curl 000 in cloudfront mode
because the CloudFront VPC origin is bound to a deleted ALB ARN (the LBC delete-recreated
the ALB, classically due to an IngressClassParams scheme mismatch). Includes the
direct-ALB isolation test, the VPC-origin-ARN vs current-ALB-ARN check, the CloudTrail
DeleteLoadBalancer lookup, and the fix (scheme=internal to stop churn + recreate/swap the
VPC origin to re-point CloudFront).
allamand added 16 commits July 2, 2026 21:01
…s ALB in place (no churn)

Root-cause hardening for the stale-VPC-origin issue: the LBC recreated the platform ALB
when its subnet set didn't match what create-alb built, orphaning the CloudFront VPC origin.

- create-alb now selects PRIVATE subnets by internal-elb tag then by MapPublicIpOnLaunch==false
  (instead of a fragile *private* name tag), and tags the chosen subnets
  kubernetes.io/role/internal-elb=1 so the AWS LBC discovers the SAME set and adopts the ALB in
  place instead of calling SetSubnets / recreating it.
- steering/troubleshooting.md: document the implemented prevention (scheme=internal + subnet
  tagging) and the proposed-but-NOT-implemented cloudfront:sync-vpc-origin reconcile as future
  hardening (design + wiring), to add only if ALB churn recurs.
…kfile (not RGD)

The hub's crossplane iam/eks providers had no AWS credentials, so crossplane-base's
provider roles/PIAs (amp/rds/grafana/devlake) never reconciled -> no AMP/Grafana/RDS ->
crossplane-base, observability-aws, grafana-dashboards, devlake all Degraded.

Root cause: crossplane-base declares the downstream provider PIAs but iam/eks are
createIdentity:false (chicken-and-egg), and nothing on the kro-ack hub seeds the first
credential. The prior RGD-based bootstrap (6f8fb2c) can't fix the hub because the hub's
EksclusterWithVpc lives on the transient kind bootstrap cluster and is never reconciled by
the hub-EKS RGD, so it can't self-heal an already-running hub.

Fix (mirrors the crossplane/terraform flows): new hub:bootstrap-crossplane-identity task,
run after hub:wait-for-sync, that idempotently creates the peeks-hub-crossplane-provider
admin role + provider-aws-iam/eks pod identities, then restarts the crossplane provider
pods (label pkg.crossplane.io/revision) twice — pass 1 credentials iam/eks, they reconcile
the downstream provider roles/PIAs, pass 2 credentials those providers. Idempotent via a
status guard (skips once CrossplaneAMPProviderRole exists). RGD bootstrap left in place for
spokes.
…moved in bb387b0

The previous commit's strReplace anchored on the repeated 'rm -f private/hub-kubeconfig'
boilerplate and inadvertently dropped the spokes:create-capabilities task. Restore it
verbatim; the hub:bootstrap-crossplane-identity task and its install wiring are unchanged.
Task count back to 55 (was erroneously net-neutral at 54).
… + async spokes + tolerant spoke gate

Replace the imperative EKS Capability creation with declarative Crossplane
Capability MRs, make Crossplane spokes provision fully async, and add a
non-fatal final gate that verifies spokes before install completes.

- Bump provider-aws-eks (+family/iam/ec2/rds/dynamodb/amp/grafana) v2.5.3->v2.6.1
  (registry, kind-crossplane bootstrap, terraform helm.tf). v2.6.1 serves a
  cluster-scoped eks.aws.upbound.io/v1beta1 Capability, so it composes into the
  existing legacy composition with no v2/namespaced migration.
- platform-cluster XRD/composition: native Capability MRs (kro/ack/argocd) +
  capability IAM roles, matching the kro-ack RGD (capabilities.eks.amazonaws.com
  trust; KRO=AmazonEKSClusterPolicy; ACK=inline AssumeWorkloadRoles+ManageIRSARoles),
  CEL-gated on spec.capabilities.<type>.enabled; deletePropagationPolicy RETAIN.
  Adds spec.accountId + spec.capabilities.* to the XRD and claim chart.
- kind-crossplane: spokes:enable-crossplane sets capabilities+accountId; drop the
  imperative spokes:create-capabilities and its ~19min foreground EKS-ACTIVE wait
  (spokes now async). Hub claim sets kro/ack/argocd(+IDC); hub:seed waits on the
  Capability MRs instead of the create-capability.yaml Job. Delete
  argocd:capability / argocd:delete-capability tasks + create/delete-capability.yaml
  (per ITERATION_PLAN.md).
- kind-kro-ack: remove the dead, unreferenced spokes:create-capabilities copy.
- workshop:install: add tolerant, non-fatal wait-for-spokes gate before setup-env
  (EKS ACTIVE -> ArgoCD cluster secrets -> spoke apps synced within tolerance,
  with backoff nudge), overlapping the ray/IDC/model tail.
…d argocd:delete-capability call in destroy

- README/ITERATION_PLAN/steering: reflect native Capability MRs instead of the
  create-capability.yaml Job (mark ITERATION_PLAN item 10 done).
- destroy Phase 3: remove the now-broken '- task: argocd:delete-capability' reference
  (task deleted in prior commit). The AWS-API fallback already force-deletes
  capabilities before the cluster delete; capability IAM role cleanup retained.
…estart/wait)

hub:bootstrap-crossplane-identity aborted the entire install (exit 201):
the wait loop's 'grep -c True' returns exit 1 on zero Ready PIAs, which under
go-task's set:[errexit,pipefail] failed the task. It also ran before
crossplane-base is deployed (No resources found on the pod restart), making
the restart premature.

Strip the task to PIA creation only. Provider pods start after these PIAs
exist (credentialed at startup); the phase1 hub:restart-identity-pods task
already waits for all PIAs Ready and restarts providers non-fatally.
…ned spokes

Provisioning a spoke via the KRO path creates the EksclusterWithVpc claim on the
HUB, so the HUB's kro capability (peeks-hub-kro-capability-role/KRO) renders the
ACK vpc/subnet/eks CRs. The eks-capabilities-rbac ClusterRole/Binding that grants
those ACK API groups only targeted enable_kro_manifests (spokes); the hub uses
enable_kro_manifests_hub, so it never got the RBAC. The hub's kro cap role is not
cluster-admin, so KRO was RBAC-denied:

  vpcs.ec2.services.k8s.aws is forbidden: User .../peeks-hub-kro-capability-role/KRO
  cannot get resource vpcs in ec2.services.k8s.aws in namespace peeks-spoke-prod

Add eks-capabilities-rbac-hub, gated on enable_kro_manifests_hub, mirroring the
kro-manifests/kro-manifests-hub split. No cluster has both labels, so no collision.
…bject delete recovery

Document the 'vpcs.ec2.services.k8s.aws forbidden' error for kro-provisioned
spokes (hub kro cap not cluster-admin, eks-capabilities-rbac only on spokes),
the eks-capabilities-rbac-hub fix, and the key manual step: KRO won't reconcile
over the ACK object left half-created during the denied window — delete it so
KRO recreates it cleanly.
KRO-provisioned spokes are created by the hub's ACK EKS capability, not the
Crossplane providers, so spoke EKS creation does not depend on phase1 restart /
wait-for-providers. Move set-overlay-repo + install:phase2-parallel ahead of the
provider bring-up so the ~25min spoke build starts earlier and overlaps with the
provider restart, observability seeding, ray image build and idc. kro-ack only;
kind-crossplane spoke-dev genuinely uses the Crossplane path and is unchanged.
…usters-kro

The clusters-kro ApplicationSet hardcoded argoCdHubRoleArn/argoCdCapabilityRoleArn
to <cluster>-argocd-capability-role (kebab). That's correct on kro-ack (its RGD
creates the role kebab-case) but wrong on the crossplane hub, whose ArgoCD
capability role is <cluster>-ArgoCDCapabilityRole (PascalCase). A KRO-provisioned
spoke on the crossplane hub then built an argocd-role trust policy referencing a
non-existent principal, and ACK IAM went terminal:

  MalformedPolicyDocument: Invalid principal in policy:
  AWS: arn:...:role/peeks-hub-argocd-capability-role

Key the suffix off the hub secret's provider label (kro-ack => kebab; crossplane/
no-label => PascalCase) so the trust policy references an existing role.
…Capability enabled

The spoke argocd-role trust policy unconditionally listed
<cluster>-argocd-capability-role as a principal, but that role is only created
when argocdCapability.enabled==true (hub only). On any spoke (enabled=false) the
principal doesn't exist, so IAM rejected the whole trust:

  MalformedPolicyDocument: Invalid principal in policy:
  AWS: arn:...:role/peeks-spoke-prod-argocd-capability-role

Make the second principal conditional: capability role ARN when enabled (hub),
otherwise duplicate argoCdHubRoleArn (IAM dedupes) so there is no dangling
principal. Fixes KRO-provisioned spokes on both hubs (surfaced on the crossplane
hub where spoke-prod is the KRO path).
…labels

The ApplicationSet clusters generator exposes .metadata.labels as map[string]string,
but sprig hasKey expects map[string]interface{} -> 'wrong type for value' template
error, leaving clusters-kro Degraded and blocking spoke generation. Use a plain
index lookup instead: index on map[string]string returns "" for a missing key
without tripping missingkey=error (which only affects .field access), so crossplane
hubs (no provider label) correctly fall through to the PascalCase suffix.
…visions

secrets-manager:seed wrote the config secret's `vpc` property as a bare VPC-id
string (--arg vpc '{{.VPC_ID}}'). But the aws-resources ExternalSecret reads that
property into peeks-hub-vpc-secret.vpc_data, and the init-env-config Job parses
.id/.subnet_ids/.cluster_security_group_id from it to build the vpc-config
EnvironmentConfig the devlake RDS composition consumes. A bare string makes every
jq lookup empty -> empty vpc-config -> RDS security group gets vpcId=null and the
subnet group gets empty subnetIds -> RDS Instance never created -> devlake stuck
waiting for its MySQL endpoint secret.

Resolve PRIVATE_SUBNET_IDS + CLUSTER_SG and write `vpc` as the JSON object
{id,subnet_ids,cluster_security_group_id} (stored as a JSON string via --arg,
matching kind-crossplane). Sole consumer of the property is that ExternalSecret;
bare-id consumers use the separate aws_vpc_id metadata field, so no other reader
is affected.
… access entry

clusters-crossplane set argoCDRoleArn to <cluster>-argocd-capability-role (kebab),
but the crossplane hub's ArgoCD capability role is <cluster>-ArgoCDCapabilityRole
(PascalCase). The composition therefore created the spoke's argocd AccessEntry for a
non-existent principal, so the hub ArgoCD (connecting as the real PascalCase role)
failed with 'failed to verify the access entry' and could not load state / sync any
app on the crossplane-provisioned spoke (spoke-dev). Correct to PascalCase; this
appset only runs on the crossplane hub.
The eks-capabilities-kro ClusterRole granted ec2/eks/iam/ecr/secretsmanager/
dynamodb ACK groups but NOT s3.services.k8s.aws. Creating an S3 bucket via the
Backstage ACK/KRO template (workshop module 10.6) was RBAC-denied for KRO/ACK on
s3.services.k8s.aws. Add the S3 group (covers both spoke eks-capabilities-rbac and
hub eks-capabilities-rbac-hub, same chart).
…lass

Two workshop-breaking blockers found by kro-ack testing:

#3 Backstage 'Invalid GitLab integration config, $GITLAB_CF_DOMAIN is not a valid
host': the kro-ack secrets seed wrote the config metadata as a single-quoted jq
--arg, so the bash var $GITLAB_CF_DOMAIN was NOT expanded and the cluster-secret
gitlab_domain_name annotation held the literal '$GITLAB_CF_DOMAIN'. That flows to
the Backstage gitlab_domain_name Helm value -> dynamic-catalog gitlab_hostname ->
integrations.gitlab host, breaking EVERY Backstage template. Break out of the single
quotes so the real GitLab CloudFront host is substituted (crossplane already did this
via jq --arg).

#4 kro AppmodService/RayService hardcoded ingressClassName: alb, but EKS Auto Mode
provides the 'platform' IngressClass (controller eks.amazonaws.com/alb) and there is
no 'alb' class -> every app Ingress failed with 'ingressClass alb not found', the ALB
address never populated, curl to /hello-world etc. failed. Align appmod-service.yaml
(x2) and ray-service.yaml to 'platform' (cicd-pipeline.yaml already used it).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant