Skip to content

Add fully Terraform-managed EKS deployment mode#232

Closed
Erik Weathers (erikdw) wants to merge 1 commit intomainfrom
erikdw/add-eks-deployment-mode
Closed

Add fully Terraform-managed EKS deployment mode#232
Erik Weathers (erikdw) wants to merge 1 commit intomainfrom
erikdw/add-eks-deployment-mode

Conversation

@erikdw
Copy link
Copy Markdown

Introduces create_eks_cluster = true, which provisions an EKS cluster, the supporting AWS infrastructure, and the Braintrust Helm release end-to-end. Previously use_deployment_mode_external_eks assumed the cluster was managed outside Terraform; the new mode lets the module own the full lifecycle.

Structure

Two new submodules under modules/, plus a thin root-level wiring file (eks.tf).

modules/eks-cluster/ — AWS infrastructure

  • EKS cluster via terraform-aws-modules/eks v21
  • OIDC provider and IRSA trust policies, OIDC-only, scoped to the braintrust-api and brainstore service accounts
  • Core addons: vpc-cni, coredns, kube-proxy
  • Private-subnet tagging (kubernetes.io/role/internal-elb) for Load Balancer Controller discovery
  • Pre-created internal NLB with a CloudFront-restricted security group. NLB security groups cannot be attached after creation, so the module creates the NLB itself and lets the LB Controller adopt it via the service.beta.kubernetes.io/aws-load-balancer-name annotation
  • CloudFront VPC Origin and distribution: default behavior routes to the EKS API; AI-proxy paths route to braintrustproxy.com
  • IAM role for the AWS Load Balancer Controller (IRSA)

modules/eks-deploy/ — Kubernetes + Helm

  • Kubernetes namespace
  • Runtime Secret with keys PG_URL, REDIS_URL, FUNCTION_SECRET_KEY, BRAINSTORE_LICENSE_KEY. The name (braintrust-secrets) and keys are hardcoded by the chart.
  • Helm release: AWS Load Balancer Controller
  • Helm release: Braintrust chart, with a thin values template that sets only what the module owns (org name, namespace, cloud, S3 buckets, IRSA role ARNs, NLB adoption annotations, WAL / no-PG flags). Chart defaults handle everything else.
  • Structured per-component overrides — api_helm, brainstore_reader_helm, brainstore_fastreader_helm, brainstore_writer_helm — each accepting optional replicas and resources. Raw-YAML helm_chart_extra_values as an escape hatch for anything the structured variables do not cover.

Why two submodules (and a root-level wiring file)

services_common creates IAM roles shared with the non-EKS (Lambda / EC2) path, so it must sit at the root between eks_cluster (produces trust policies) and eks_deploy (consumes role ARNs). Wrapping both EKS submodules in a single parent would create a module-level dependency cycle through services_common: eks_deploy would need role ARNs from services_common, while services_common would need trust policies from eks_cluster, and Terraform treats module I/O atomically for cycle detection.

Changes to existing root files

  • main.tf: services_common uses eks_cluster's trust policies as override_*_trust_policy when create_eks_cluster = true; the database and redis authorized_security_groups include the EKS node security group in EKS mode.
  • outputs.tf: api_url and cloudfront_* outputs resolve to the EKS CloudFront distribution in EKS mode. Adds EKS-specific outputs plus database, Redis, storage, and IAM outputs consumed by downstream integrations.
  • variables.tf: new EKS knobs — create_eks_cluster, eks_node_instance_type, eks_node_min_size, eks_node_max_size, eks_node_desired_size, eks_kubernetes_version, helm_chart_version, and the four structured helm-override variables plus the raw-YAML escape hatch.
  • versions.tf: notes that kubernetes, helm, and random are declared in modules/eks-deploy. Non-EKS consumers still need empty provider blocks at the root because Terraform aggregates provider requirements across all submodules regardless of count, but the underlying resources are never evaluated when create_eks_cluster = false.

Example

examples/braintrust-data-plane-eks/ is a thin consumer — provider configuration plus a single module call — demonstrating the two-step apply workflow required on a fresh deployment:

terraform apply -target=module.braintrust.module.eks_cluster[0]
terraform apply

Step 1 creates the cluster so the kubernetes and helm providers in provider.tf can resolve its endpoint via data.aws_eks_cluster looked up by the known name ${deployment_name}-eks. Step 2 deploys the Kubernetes namespace, Secret, and Helm releases.

Contract

CONTRACT.md documents the coupling surface between this module and braintrustdata/helm: service account names, Secret name and keys, API port 8000, the helm-values schema this module writes, and the assumption that brainstore.fastreader.replicas >= 1 — the chart's api-configmap.yaml unconditionally emits BRAINSTORE_FAST_READER_URL, so replicas = 0 would leave the API pointing at an empty service. A mirror of CONTRACT.md lives in the helm repo.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

Introduces `create_eks_cluster = true`, which provisions an EKS cluster, the supporting AWS infrastructure, and the Braintrust Helm release end-to-end. Previously `use_deployment_mode_external_eks` assumed the cluster was managed outside Terraform; the new mode lets the module own the full lifecycle.

## Structure

Two new submodules under `modules/`, plus a thin root-level wiring file (`eks.tf`).

### `modules/eks-cluster/` — AWS infrastructure

- EKS cluster via `terraform-aws-modules/eks` v21
- OIDC provider and IRSA trust policies, OIDC-only, scoped to the `braintrust-api` and `brainstore` service accounts
- Core addons: `vpc-cni`, `coredns`, `kube-proxy`
- Private-subnet tagging (`kubernetes.io/role/internal-elb`) for Load Balancer Controller discovery
- Pre-created internal NLB with a CloudFront-restricted security group. NLB security groups cannot be attached after creation, so the module creates the NLB itself and lets the LB Controller adopt it via the `service.beta.kubernetes.io/aws-load-balancer-name` annotation
- CloudFront VPC Origin and distribution: default behavior routes to the EKS API; AI-proxy paths route to `braintrustproxy.com`
- IAM role for the AWS Load Balancer Controller (IRSA)

### `modules/eks-deploy/` — Kubernetes + Helm

- Kubernetes namespace
- Runtime `Secret` with keys `PG_URL`, `REDIS_URL`, `FUNCTION_SECRET_KEY`, `BRAINSTORE_LICENSE_KEY`. The name (`braintrust-secrets`) and keys are hardcoded by the chart.
- Helm release: AWS Load Balancer Controller
- Helm release: Braintrust chart, with a thin values template that sets only what the module owns (org name, namespace, `cloud`, S3 buckets, IRSA role ARNs, NLB adoption annotations, WAL / no-PG flags). Chart defaults handle everything else.
- Structured per-component overrides — `api_helm`, `brainstore_reader_helm`, `brainstore_fastreader_helm`, `brainstore_writer_helm` — each accepting optional `replicas` and `resources`. Raw-YAML `helm_chart_extra_values` as an escape hatch for anything the structured variables do not cover.

### Why two submodules (and a root-level wiring file)

`services_common` creates IAM roles shared with the non-EKS (Lambda / EC2) path, so it must sit at the root between `eks_cluster` (produces trust policies) and `eks_deploy` (consumes role ARNs). Wrapping both EKS submodules in a single parent would create a module-level dependency cycle through `services_common`: `eks_deploy` would need role ARNs from `services_common`, while `services_common` would need trust policies from `eks_cluster`, and Terraform treats module I/O atomically for cycle detection.

## Changes to existing root files

- `main.tf`: `services_common` uses `eks_cluster`'s trust policies as `override_*_trust_policy` when `create_eks_cluster = true`; the `database` and `redis` `authorized_security_groups` include the EKS node security group in EKS mode.
- `outputs.tf`: `api_url` and `cloudfront_*` outputs resolve to the EKS CloudFront distribution in EKS mode. Adds EKS-specific outputs plus database, Redis, storage, and IAM outputs consumed by downstream integrations.
- `variables.tf`: new EKS knobs — `create_eks_cluster`, `eks_node_instance_type`, `eks_node_min_size`, `eks_node_max_size`, `eks_node_desired_size`, `eks_kubernetes_version`, `helm_chart_version`, and the four structured helm-override variables plus the raw-YAML escape hatch.
- `versions.tf`: notes that `kubernetes`, `helm`, and `random` are declared in `modules/eks-deploy`. Non-EKS consumers still need empty provider blocks at the root because Terraform aggregates provider requirements across all submodules regardless of `count`, but the underlying resources are never evaluated when `create_eks_cluster = false`.

## Example

`examples/braintrust-data-plane-eks/` is a thin consumer — provider configuration plus a single module call — demonstrating the two-step apply workflow required on a fresh deployment:

    terraform apply -target=module.braintrust.module.eks_cluster[0]
    terraform apply

Step 1 creates the cluster so the `kubernetes` and `helm` providers in `provider.tf` can resolve its endpoint via `data.aws_eks_cluster` looked up by the known name `${deployment_name}-eks`. Step 2 deploys the Kubernetes namespace, `Secret`, and Helm releases.

## Contract

`CONTRACT.md` documents the coupling surface between this module and `braintrustdata/helm`: service account names, `Secret` name and keys, API port `8000`, the helm-values schema this module writes, and the assumption that `brainstore.fastreader.replicas >= 1` — the chart's `api-configmap.yaml` unconditionally emits `BRAINSTORE_FAST_READER_URL`, so `replicas = 0` would leave the API pointing at an empty service. A mirror of `CONTRACT.md` lives in the helm repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@erikdw
Copy link
Copy Markdown
Author

Superseded by: #233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant