Skip to content

feat(hyperpod-eks): add Cilium CNI support with overlay and chaining modes#1090

Open
bluecrayon52 wants to merge 17 commits into
mainfrom
hyperpod-eks-cilium-tf
Open

feat(hyperpod-eks): add Cilium CNI support with overlay and chaining modes#1090
bluecrayon52 wants to merge 17 commits into
mainfrom
hyperpod-eks-cilium-tf

Conversation

@bluecrayon52
Copy link
Copy Markdown
Contributor

@bluecrayon52 bluecrayon52 commented May 14, 2026

Summary

Adds optional Cilium CNI support to the HyperPod EKS Terraform stack. Users can deploy Cilium in overlay or chaining mode as an alternative to the default AWS VPC CNI.

  • New modules/cilium/ module with Helm-based deployment
  • Conditional VPC CNI skip when Cilium replaces it (overlay/custom modes)
  • VXLAN security group rules for overlay mode
  • Full integration tested on live HyperPod clusters

Changes

New files

  • modules/cilium/main.tf — Helm release + mode-specific value locals (overlay/chaining/custom)
  • modules/cilium/variables.tfcilium_mode, cilium_version, cilium_helm_values
  • modules/cilium/outputs.tf — release name, namespace, mode
  • AGENTS.md — developer conventions and platform constraints documentation

Modified files

  • main.tf — Cilium module wiring, skip_vpc_cni / create_cilium locals, dependency ordering
  • variables.tfenable_cilium, cilium_mode, cilium_version, cilium_helm_values variables
  • modules/eks_cluster/main.tf — conditional vpc-cni addon (skipped when skip_vpc_cni = true)
  • modules/eks_cluster/variables.tf + outputs.tfskip_vpc_cni variable, eks_cluster_role_arn output
  • modules/security_group/main.tf + variables.tf — conditional UDP 8472 VXLAN ingress/egress rules
  • README.md — Cilium usage docs, mode comparison table, limitations
  • terraform.tfvars — commented Cilium examples
  • .gitignore — added docs/ for agent working files
  • Various modules — terraform fmt alignment (cosmetic only, no logic changes)

Supported Modes

Mode Description VPC CNI
overlay VXLAN tunnel, Cilium IPAM (non-VPC-routable pod IPs) Removed
chaining VPC CNI handles IPAM, Cilium adds eBPF policy/observability Kept
custom User provides all Helm values, no defaults applied Removed

ENI mode is intentionally excluded — SageMaker HyperPod instances are not visible in the EC2 API (ec2:DescribeInstances returns nothing), so the Cilium operator cannot discover or manage ENIs on nodes.

Test Results

Tested on live HyperPod clusters in us-west-2 with ml.m5.12xlarge instances.

Overlay mode — PASS

  • Cilium agent + operator healthy, Routing: Tunnel [vxlan]
  • VPC CNI addon absent, VXLAN SG rules (UDP 8472) present
  • Pods get Cilium cluster-pool IPs (10.0.0.0/24)
  • Pod-to-pod ping: 0% loss, <0.2ms
  • CoreDNS resolution + Kubernetes API access working
  • HyperPod node registered and all Cilium endpoints ready

Chaining mode — PASS

  • Cilium agent + operator healthy, CNI Chaining: aws-cni, Routing: Native
  • VPC CNI addon present, no VXLAN SG rules
  • Pods get VPC IPs from subnet (10.1.x.x)
  • Pod-to-pod ping: 0% loss
  • DNS + K8s API working, Cilium endpoints tracking VPC-IP pods

ENI mode — INCOMPATIBLE (removed)

  • Cilium operator crashes: numInstances=0, "Instance not found"
  • Root cause: HyperPod instances invisible to EC2 API
  • Decision: removed from stack since this is a HyperPod-only deployment

Regression (no Cilium) — PASS

  • Standard VPC CNI behavior unchanged
  • No Cilium artifacts deployed
  • Pod networking works identically to pre-change

Usage

# Overlay mode (Cilium replaces VPC CNI entirely)
enable_cilium = true
cilium_mode   = "overlay"

# Chaining mode (VPC CNI + Cilium policy/observability)
enable_cilium = true
cilium_mode   = "chaining"

Reviewer Validation Steps

  1. Static validation:

    cd 1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf
    terraform init
    terraform validate
    terraform fmt -check -recursive
  2. Plan check (overlay):

    terraform plan -var-file="terraform.tfvars" \
      -var="enable_cilium=true" -var='cilium_mode=overlay' \
      -var="create_fsx_module=false"

    Verify: Cilium helm release planned, no vpc-cni addon, VXLAN SG rules present.

  3. Plan check (chaining):

    terraform plan -var-file="terraform.tfvars" \
      -var="enable_cilium=true" -var='cilium_mode=chaining' \
      -var="create_fsx_module=false"

    Verify: Cilium helm release planned, vpc-cni addon present, no VXLAN SG rules.

  4. Plan check (no Cilium — regression):

    terraform plan -var-file="terraform.tfvars" -var="create_fsx_module=false"

    Verify: No cilium module, vpc-cni addon present, no VXLAN rules.

  5. Review module code:

    • modules/cilium/main.tf — confirm mode defaults are reasonable
    • main.tf lines 45-47 — verify skip_vpc_cni and create_cilium logic
    • modules/security_group/main.tf — VXLAN rules gated by enable_vxlan_rule

Notes

  • helm_release.cilium uses wait = false because Cilium DaemonSet can't schedule until HyperPod nodes join (they join after the cluster is created, which depends on this module)
  • /tmp/helm-repo must contain a clone of https://github.com/aws/sagemaker-hyperpod-cli.git for the helm_chart module
  • The terraform fmt changes (cosmetic alignment) are in a separate commit for easy review

Existing EKS Clusters with Cilium

When integrating HyperPod with an existing EKS cluster that already has Cilium installed, set create_eks_module = false and enable_cilium = true with the cilium_mode matching the existing Cilium configuration. This will:

  • Skip Cilium deployment (it's already on the cluster)
  • Add appropriate security group rules to the HyperPod SG (e.g., VXLAN UDP 8472 for overlay mode)
  • Skip VPC CNI addon creation

The customer's existing Cilium installation must be running in overlay or chaining mode — ENI mode will not work once HyperPod nodes join.

…ilium deploys

HyperPod nodes only join the cluster after the hyperpod_cluster module
runs, which depends on this module completing. Using wait=true causes a
deadlock since the Cilium DaemonSet cannot become ready without nodes.
…od incompatibility

- Add ec2:DescribeRouteTables and ec2:CreateTags to Cilium ENI IAM policy
  (required by Cilium operator for route table discovery and ENI tagging)
- Add check block that warns when ENI mode is used with HyperPod
  (SageMaker-managed instances are not visible in EC2 API)
- Update cilium_mode description to document ENI+HyperPod incompatibility
Run terraform fmt -recursive across the hyperpod-eks-tf directory.
All changes are alignment/spacing only (no logic changes).
SageMaker HyperPod instances are not visible in the EC2 API, so the
Cilium operator cannot discover or manage ENIs on nodes. Removes ENI
mode entirely since this Terraform stack is exclusively for HyperPod.

- Remove 'eni' from cilium_mode allowed values
- Remove ENI IAM policy resource and sagemaker_execution_role_name var
- Remove eni_values from Cilium module locals
- Update README and tfvars examples
@bluecrayon52 bluecrayon52 requested a review from mvinci12 May 14, 2026 11:22
Copy link
Copy Markdown
Contributor

@mvinci12 mvinci12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the architecture is clean, the test matrix is excellent, and the ENI-mode rationale is exactly right. Approving the direction; requesting a few changes before merge.


1. Add idempotency detection for the new VXLAN SG rules (should-fix)

What's wrong
The two new rules in modules/security_group/main.tf (cilium_vxlan_ingress / cilium_vxlan_egress) are gated only on var.enable_vxlan_rule. Every other rule in this module first scans the existing security group (has_intra_sg_ingress, has_fsx_lustre_ingress_988, has_fsx_lustre_ingress_1018_1023, has_vpc_endpoint_https_ingress, etc.) and only creates the rule if it isn't already there.

Why it matters
The PR explicitly supports the "existing EKS cluster with Cilium already installed" path (create_eks_module = false + enable_cilium = true). In that scenario the UDP 8472 rules will almost certainly already exist on the SG, and terraform apply will fail with InvalidPermission.Duplicate. This breaks the integration scenario the README documents.

Suggested fix
Add has_vxlan_ingress / has_vxlan_egress locals that scan local.rules for ip_protocol == "udp" && from_port == 8472 && to_port == 8472 && referenced_security_group_id == var.existing_security_group_id, and gate the new resources the same way the others are gated:

count = var.enable_vxlan_rule && (var.create_new_sg || !local.has_vxlan_ingress) ? 1 : 0

2. Remove unused `eks_cluster_name` variable in cilium module (should-fix)

What's wrong
modules/cilium/variables.tf declares eks_cluster_name and the root main.tf passes it in, but nothing inside modules/cilium/main.tf references it.

Why it matters
Dead inputs create a maintenance contract that doesn't actually exist and confuse future readers ("what is this for? what happens if I leave it empty?").

Suggested fix
Either drop the variable (and the corresponding line in root main.tf), or use it — e.g., as a Helm value (cluster.name = var.eks_cluster_name) or a tag on the release.


3. Remove stale `eni` reference in `cilium_helm_values` description (should-fix)

What's wrong
In root variables.tf, the description string for cilium_helm_values reads:

"For overlay/eni/chaining modes, these values override the base defaults."

ENI mode was intentionally removed from this stack (HyperPod nodes aren't visible to the EC2 API), and the validation block correctly rejects it.

Why it matters
Documentation that references an unsupported mode will mislead users into trying cilium_mode = "eni" and hitting a validation error.

Suggested fix
Drop `eni/` from the description so it reads "For overlay/chaining modes, these values override the base defaults."


4. Expose `cilium_helm_repository` for closed-network (discuss)

What's wrong
modules/cilium/main.tf hardcodes `repository = "https://helm.cilium.io/"\`.

Why it matters
The PR's own README Limitations section calls out closed-network as a constraint: "Cilium images must be pre-staged to ECR in closed-network deployments." HyperPod has real closed-network customers (regulated industries, gov regions), and a hardcoded public URL forces them into cilium_mode = "custom" and rebuilding the entire values surface just to redirect the chart source. That's a clunky workaround for a one-line change.

Suggested fix
Add a variable and thread it through:

variable "cilium_helm_repository" {
  description = "Helm chart repository URL for Cilium. Override for closed-network deployments using a private mirror."
  type        = string
  default     = "https://helm.cilium.io/"
}

Open question worth your input: would you prefer a separate `cilium_helm_chart` variable too (so users can point at a fully qualified OCI ref like `oci://.dkr.ecr..amazonaws.com/cilium`), or keep it to just the repository URL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants