feat(hyperpod-eks): add Cilium CNI support with overlay and chaining modes#1090
feat(hyperpod-eks): add Cilium CNI support with overlay and chaining modes#1090bluecrayon52 wants to merge 17 commits into
Conversation
…ilium deploys HyperPod nodes only join the cluster after the hyperpod_cluster module runs, which depends on this module completing. Using wait=true causes a deadlock since the Cilium DaemonSet cannot become ready without nodes.
…od incompatibility - Add ec2:DescribeRouteTables and ec2:CreateTags to Cilium ENI IAM policy (required by Cilium operator for route table discovery and ENI tagging) - Add check block that warns when ENI mode is used with HyperPod (SageMaker-managed instances are not visible in EC2 API) - Update cilium_mode description to document ENI+HyperPod incompatibility
Run terraform fmt -recursive across the hyperpod-eks-tf directory. All changes are alignment/spacing only (no logic changes).
SageMaker HyperPod instances are not visible in the EC2 API, so the Cilium operator cannot discover or manage ENIs on nodes. Removes ENI mode entirely since this Terraform stack is exclusively for HyperPod. - Remove 'eni' from cilium_mode allowed values - Remove ENI IAM policy resource and sagemaker_execution_role_name var - Remove eni_values from Cilium module locals - Update README and tfvars examples
mvinci12
left a comment
There was a problem hiding this comment.
Thanks for this — the architecture is clean, the test matrix is excellent, and the ENI-mode rationale is exactly right. Approving the direction; requesting a few changes before merge.
1. Add idempotency detection for the new VXLAN SG rules (should-fix)
What's wrong
The two new rules in modules/security_group/main.tf (cilium_vxlan_ingress / cilium_vxlan_egress) are gated only on var.enable_vxlan_rule. Every other rule in this module first scans the existing security group (has_intra_sg_ingress, has_fsx_lustre_ingress_988, has_fsx_lustre_ingress_1018_1023, has_vpc_endpoint_https_ingress, etc.) and only creates the rule if it isn't already there.
Why it matters
The PR explicitly supports the "existing EKS cluster with Cilium already installed" path (create_eks_module = false + enable_cilium = true). In that scenario the UDP 8472 rules will almost certainly already exist on the SG, and terraform apply will fail with InvalidPermission.Duplicate. This breaks the integration scenario the README documents.
Suggested fix
Add has_vxlan_ingress / has_vxlan_egress locals that scan local.rules for ip_protocol == "udp" && from_port == 8472 && to_port == 8472 && referenced_security_group_id == var.existing_security_group_id, and gate the new resources the same way the others are gated:
count = var.enable_vxlan_rule && (var.create_new_sg || !local.has_vxlan_ingress) ? 1 : 02. Remove unused `eks_cluster_name` variable in cilium module (should-fix)
What's wrong
modules/cilium/variables.tf declares eks_cluster_name and the root main.tf passes it in, but nothing inside modules/cilium/main.tf references it.
Why it matters
Dead inputs create a maintenance contract that doesn't actually exist and confuse future readers ("what is this for? what happens if I leave it empty?").
Suggested fix
Either drop the variable (and the corresponding line in root main.tf), or use it — e.g., as a Helm value (cluster.name = var.eks_cluster_name) or a tag on the release.
3. Remove stale `eni` reference in `cilium_helm_values` description (should-fix)
What's wrong
In root variables.tf, the description string for cilium_helm_values reads:
"For overlay/eni/chaining modes, these values override the base defaults."
ENI mode was intentionally removed from this stack (HyperPod nodes aren't visible to the EC2 API), and the validation block correctly rejects it.
Why it matters
Documentation that references an unsupported mode will mislead users into trying cilium_mode = "eni" and hitting a validation error.
Suggested fix
Drop `eni/` from the description so it reads "For overlay/chaining modes, these values override the base defaults."
4. Expose `cilium_helm_repository` for closed-network (discuss)
What's wrong
modules/cilium/main.tf hardcodes `repository = "https://helm.cilium.io/"\`.
Why it matters
The PR's own README Limitations section calls out closed-network as a constraint: "Cilium images must be pre-staged to ECR in closed-network deployments." HyperPod has real closed-network customers (regulated industries, gov regions), and a hardcoded public URL forces them into cilium_mode = "custom" and rebuilding the entire values surface just to redirect the chart source. That's a clunky workaround for a one-line change.
Suggested fix
Add a variable and thread it through:
variable "cilium_helm_repository" {
description = "Helm chart repository URL for Cilium. Override for closed-network deployments using a private mirror."
type = string
default = "https://helm.cilium.io/"
}Open question worth your input: would you prefer a separate `cilium_helm_chart` variable too (so users can point at a fully qualified OCI ref like `oci://.dkr.ecr..amazonaws.com/cilium`), or keep it to just the repository URL?
Summary
Adds optional Cilium CNI support to the HyperPod EKS Terraform stack. Users can deploy Cilium in overlay or chaining mode as an alternative to the default AWS VPC CNI.
modules/cilium/module with Helm-based deploymentChanges
New files
modules/cilium/main.tf— Helm release + mode-specific value locals (overlay/chaining/custom)modules/cilium/variables.tf—cilium_mode,cilium_version,cilium_helm_valuesmodules/cilium/outputs.tf— release name, namespace, modeAGENTS.md— developer conventions and platform constraints documentationModified files
main.tf— Cilium module wiring,skip_vpc_cni/create_ciliumlocals, dependency orderingvariables.tf—enable_cilium,cilium_mode,cilium_version,cilium_helm_valuesvariablesmodules/eks_cluster/main.tf— conditionalvpc-cniaddon (skipped whenskip_vpc_cni = true)modules/eks_cluster/variables.tf+outputs.tf—skip_vpc_cnivariable,eks_cluster_role_arnoutputmodules/security_group/main.tf+variables.tf— conditional UDP 8472 VXLAN ingress/egress rulesREADME.md— Cilium usage docs, mode comparison table, limitationsterraform.tfvars— commented Cilium examples.gitignore— addeddocs/for agent working filesterraform fmtalignment (cosmetic only, no logic changes)Supported Modes
overlaychainingcustomENI mode is intentionally excluded — SageMaker HyperPod instances are not visible in the EC2 API (
ec2:DescribeInstancesreturns nothing), so the Cilium operator cannot discover or manage ENIs on nodes.Test Results
Tested on live HyperPod clusters in
us-west-2withml.m5.12xlargeinstances.Overlay mode — PASS
Routing: Tunnel [vxlan]10.0.0.0/24)Chaining mode — PASS
CNI Chaining: aws-cni,Routing: Native10.1.x.x)ENI mode — INCOMPATIBLE (removed)
numInstances=0, "Instance not found"Regression (no Cilium) — PASS
Usage
Reviewer Validation Steps
Static validation:
cd 1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf terraform init terraform validate terraform fmt -check -recursivePlan check (overlay):
Verify: Cilium helm release planned, no
vpc-cniaddon, VXLAN SG rules present.Plan check (chaining):
Verify: Cilium helm release planned,
vpc-cniaddon present, no VXLAN SG rules.Plan check (no Cilium — regression):
Verify: No cilium module,
vpc-cniaddon present, no VXLAN rules.Review module code:
modules/cilium/main.tf— confirm mode defaults are reasonablemain.tflines 45-47 — verifyskip_vpc_cniandcreate_ciliumlogicmodules/security_group/main.tf— VXLAN rules gated byenable_vxlan_ruleNotes
helm_release.ciliumuseswait = falsebecause Cilium DaemonSet can't schedule until HyperPod nodes join (they join after the cluster is created, which depends on this module)/tmp/helm-repomust contain a clone ofhttps://github.com/aws/sagemaker-hyperpod-cli.gitfor the helm_chart moduleterraform fmtchanges (cosmetic alignment) are in a separate commit for easy reviewExisting EKS Clusters with Cilium
When integrating HyperPod with an existing EKS cluster that already has Cilium installed, set
create_eks_module = falseandenable_cilium = truewith thecilium_modematching the existing Cilium configuration. This will:The customer's existing Cilium installation must be running in overlay or chaining mode — ENI mode will not work once HyperPod nodes join.