Add EKS Auto Mode deployment mode#233
Draft
Erik Weathers (erikdw) wants to merge 23 commits intomainfrom
Draft
Conversation
Introduces `create_eks_cluster = true`, which provisions an EKS Auto Mode cluster and deploys the Braintrust Helm chart on it end-to-end. Uses raw AWS provider resources (no `terraform-aws-modules/eks` dependency) and EKS Pod Identity for pod-to-IAM binding.
## Why Auto Mode
Auto Mode collapses most of the yak shave for a production EKS deployment:
- **Node provisioning**: AWS runs Karpenter internally; no managed node group to define.
- **Core addons**: `vpc-cni`, `coredns`, `kube-proxy`, EBS CSI driver, and the AWS Load Balancer Controller come preinstalled — no `aws_eks_addon` resources, no LB Controller IAM role / Helm release.
- **Pod Identity**: the Pod Identity Agent ships built-in, enabling a simpler alternative to IRSA (no OIDC provider, no TLS-thumbprint wrangling, no `data.tls_certificate`).
This module therefore only has to own the cluster + node IAM roles, the VPC wiring, the pre-created NLB + CloudFront distribution, and the Braintrust-specific K8s objects.
## Structure
Two submodules under `modules/`, with a thin root-level wiring file (`eks.tf`). Both use only AWS provider primitives — no community module.
### `modules/eks-cluster/` — AWS infrastructure
- `aws_iam_role` for the cluster, with Auto Mode's five required managed policies attached: `AmazonEKSClusterPolicy`, `AmazonEKSComputePolicy`, `AmazonEKSBlockStoragePolicy`, `AmazonEKSLoadBalancingPolicy`, `AmazonEKSNetworkingPolicy`.
- `aws_iam_role` for Auto Mode nodes, with `AmazonEKSWorkerNodeMinimalPolicy` and `AmazonEC2ContainerRegistryPullOnly`.
- `aws_eks_cluster` with `compute_config`, `storage_config`, and `kubernetes_network_config.elastic_load_balancing` all enabled. `access_config.authentication_mode = "API"` uses EKS access entries (no aws-auth configmap).
- Pre-created internal NLB (`aws_lb`) with a CloudFront-prefix-list security group. NLB security groups cannot be attached after creation, so the module creates the NLB itself; the Auto-Mode-managed LB Controller adopts it later via the chart's `service.beta.kubernetes.io/aws-load-balancer-name` annotation.
- `aws_cloudfront_vpc_origin` wrapping the NLB, plus an `aws_cloudfront_distribution` whose default behavior routes to the EKS API and whose AI-proxy paths route to `braintrustproxy.com`.
- Private subnet tags (`kubernetes.io/role/internal-elb`) for LB Controller subnet auto-discovery.
### `modules/eks-deploy/` — Kubernetes + Helm
- `kubernetes_namespace` for Braintrust workloads.
- `kubernetes_secret` (`braintrust-secrets`) with `PG_URL`, `REDIS_URL`, `FUNCTION_SECRET_KEY`, `BRAINSTORE_LICENSE_KEY`. The name and keys are hardcoded by the chart.
- `aws_eks_pod_identity_association` resources for the `braintrust-api` and `brainstore` service accounts, binding each to its IAM role from `services_common`.
- `kubernetes_manifest` for a custom `NodeClass` (Auto Mode API: `eks.amazonaws.com/v1`) and `NodePool` (`karpenter.sh/v1`) that constrain Karpenter to NVMe-backed instance families (`c8gd`, `c7gd`, `m7gd` by default, configurable via `eks_brainstore_nodepool_instance_families`). Brainstore pods pin to this NodePool via a `braintrust.dev/node-pool: brainstore` nodeSelector in helm values.
- `helm_release` for the Braintrust chart, with a thin values template that sets only what this module owns and structured per-component overrides (`eks_api_helm`, `eks_brainstore_{reader,fastreader,writer}_helm`) plus a raw-YAML `eks_helm_chart_extra_values` escape hatch.
### Why two submodules
`services_common` creates IAM roles shared with the non-EKS (Lambda / EC2) deployment path, so it must sit at the root between `eks_cluster` (which provides the cluster ARN used to scope Pod Identity trust policies) and `eks_deploy` (which consumes the resulting role ARNs). Wrapping both EKS submodules in a single parent would create a module-level dependency cycle through `services_common`.
## Pod Identity (not IRSA)
Auto Mode's Pod Identity Agent intercepts AWS SDK credential resolution before IRSA is consulted, so pods authenticate via Pod Identity even though the chart still emits an `eks.amazonaws.com/role-arn` annotation (the IRSA path). The module:
- Sets `enable_eks_pod_identity = true` on `services_common` and passes it the cluster ARN. `services_common` builds a trust policy with the `pods.eks.amazonaws.com` principal, scoped via session tags (`aws:RequestTag/eks-cluster-arn`, `aws:RequestTag/kubernetes-namespace`) to this specific cluster and namespace.
- Creates an `aws_eks_pod_identity_association` for each service account, binding `(cluster, namespace, service-account)` to the IAM role.
No OIDC provider, no TLS cert thumbprint management.
## Changes to existing root files
- `main.tf`: `services_common` gets `enable_eks_pod_identity = true` + the EKS cluster ARN when `create_eks_cluster = true`. `database` and `redis` `authorized_security_groups` include the cluster's primary security group (which Auto Mode attaches to all nodes) in EKS mode.
- `outputs.tf`: `api_url` and `cloudfront_*` outputs resolve to the EKS CloudFront distribution in EKS mode. No other outputs added — existing output contract is otherwise unchanged.
- `variables.tf`: new EKS knobs (`create_eks_cluster`, `eks_kubernetes_version`, `eks_brainstore_nodepool_instance_families`, `helm_chart_version`, and the four structured helm-override variables plus the raw-YAML escape hatch).
- `versions.tf`: notes that `kubernetes`, `helm`, and `random` are declared in `modules/eks-deploy`. Non-EKS consumers must still declare empty provider blocks at the root because Terraform aggregates provider requirements across all submodules regardless of `count`, but the underlying resources are never evaluated when `create_eks_cluster = false`.
## Example
`examples/braintrust-data-plane-eks/` is a thin consumer — provider configuration plus a single module call — demonstrating the two-step apply workflow required on a fresh deployment:
terraform apply -target=module.braintrust.module.eks_cluster[0]
terraform apply
Step 1 creates the cluster so the `data.aws_eks_cluster` lookup in `provider.tf` (keyed by the statically-knowable name `${deployment_name}-eks`) can resolve. Step 2 plans the kubernetes and helm resources, including the `kubernetes_manifest` NodeClass/NodePool which require Auto Mode's CRDs to exist on the cluster before plan time.
The kubernetes and helm providers use a static token from `data.aws_eks_cluster_auth`. Step 2's runtime is well under the 15-minute token TTL because Auto Mode's in-cluster setup is fast and `helm_release` defaults to a 5-minute wait timeout.
## Contract
`CONTRACT.md` documents the coupling surface between this module and `braintrustdata/helm`: service account names, `Secret` name and keys, API port `8000`, the helm-values schema this module writes, the Pod-Identity-over-IRSA precedence, and the assumption that `brainstore.fastreader.replicas >= 1`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`braintrustdata/helm` released `6.1.0` today. Audit of the 5.0.1 → 6.1.0 diff against this module's coupling surface (service account names, `braintrust-secrets` name and keys, API port `8000`, values-schema keys the template writes, `brainstore.{reader,fastreader,writer}.nodeSelector`) came back clean — nothing on the contract moved.
What actually changed in 6.x:
- Image tags bumped `v1.1.32` → `v2.0.0` (chart semver policy treats image major bumps as chart major bumps).
- `skipPgForBrainstoreObjects` and `brainstoreWalFooterVersion` are now top-level `values.yaml` defaults (this module was already writing them at top-level, so no template change needed).
- Chart now emits additional Brainstore env vars derived from existing values (`BRAINSTORE_RESPONSE_CACHE_URI`, `BRAINSTORE_CODE_BUNDLE_URI`, `BRAINSTORE_ASYNC_SCORING_OBJECTS`, `BRAINSTORE_LOG_AUTOMATIONS_OBJECTS`, `BRAINSTORE_WAL_USE_EFFICIENT_FORMAT`) and adds `checksum/config` annotations on deployments so pods restart when the configmap changes.
None of it requires a template, values-schema, or variable change on this side.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Aligns `examples/braintrust-data-plane-eks/` with the style of the other examples in this directory (`braintrust-data-plane/`, `braintrust-data-plane-sandbox/`): literal values in `main.tf`, `variables.tf` reduced to just the sensitive/per-deployment `brainstore_license_key`.
Before, the example had a variable for every knob it set on the module (`deployment_name`, `braintrust_org_name`, `helm_chart_version`, `eks_namespace`, `brainstore_wal_footer_version`, `skip_pg_for_brainstore_objects`), which meant a user copying the example had to wire up `.tfvars` or `-var` flags for all of them. Now the example ships with sensible defaults as literals, users edit the values directly in their copy of `main.tf`, and only the license key flows through a variable (consistent with the sandbox and production examples).
Also:
- Module block renamed `module "braintrust"` → `module "braintrust-data-plane"` to match the other examples' naming.
- `helm_chart_version = "6.1.0"` pinned as a literal.
- `eks_cluster_name` local in `provider.tf` hardcoded to `"braintrust-eks"` with a comment noting it must match `${deployment_name}-eks` from `main.tf` (the `var.deployment_name` reference was dropped along with the variable).
- Output references updated to the new module block name.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`provider.tf` needs the EKS cluster name to configure the kubernetes and helm providers' `data.aws_eks_cluster` lookup. Previously this was a hardcoded literal (`"braintrust-eks"`) with a comment asking the user to keep it in sync with `deployment_name` in `main.tf`. That split-source-of-truth bit in practice: changing `deployment_name` in `main.tf` without also updating `provider.tf` silently points the providers at a nonexistent cluster, and step 2 of the two-step apply fails.
Move `deployment_name` into a `locals` block at the top of `main.tf`. Terraform merges locals across files in the same module, so `provider.tf` can compute `eks_cluster_name = "${local.deployment_name}-eks"` without duplicating the string. One place to edit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Learnings from standing up the Auto Mode deployment for the first time:
- aws_eks_cluster: set bootstrap_self_managed_addons = false. Auto Mode
rejects CreateCluster otherwise, since its built-in addons conflict
with the self-managed bootstrap path.
- Brainstore NodeClass: scope subnetSelectorTerms to this deployment's
VPC via BraintrustDeploymentName. `kubernetes.io/role/internal-elb`
alone matches subnets in other VPCs (default VPC, other clusters in
the same region), making Karpenter pick a subnet in the wrong VPC
and fail RunInstances with a cross-VPC SG/subnet error.
- Brainstore NodeClass: drop custom tags. AmazonEKSComputePolicy gates
ec2:CreateLaunchTemplate on a tag-key allowlist; any extra key fails
the controller's IAM pre-check.
- Brainstore NodePool: switch instance-family requirement key from
karpenter.k8s.aws/instance-family to eks.amazonaws.com/instance-family.
Auto Mode restricts requirement domains and the karpenter.k8s.aws one
isn't accepted.
- helm_release timeout: bump to 1200s. Cold first deploys take longer
than the 300s default (Karpenter node provisioning + three large
Brainstore image pulls + readiness).
- VPC private-subnet lifecycle: ignore_changes on the
kubernetes.io/role/internal-elb tag so Terraform doesn't fight
aws_ec2_tag (from modules/eks-cluster) on every apply.
- Example provider.tf: switch kubernetes/helm auth from the 15-min
static aws_eks_cluster_auth token to exec { aws eks get-token } so
long applies and extended approval-prompt pauses don't fail with
expired-token errors.
- Example main.tf: expand the two-step-apply doc comment with the
zsh-globbing caveat, and explain the 400 GB gp3 IOPS/throughput
threshold that trips up smaller (sandbox) postgres_storage_size
values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto Mode's Load Balancer Controller uses `ip` target-type for NLBs, which sends health checks and traffic directly to pod IPs on the container port (8000) — not via the NodePort on a node IP. When the NLB SG is pre-created and attached via the `aws-load-balancer-security-groups` annotation, the controller only opens the NodePort range on the cluster SG (the rule it would need for `instance` target-type) and leaves the container port unreachable. Result: TCP health checks time out, TG stays unhealthy, NLB has no backends, CloudFront hangs. Replace the NodePort-range rule (30000-32767) with a single TCP 8000 rule from the NLB SG to the cluster SG. NodePort wasn't being used by the `ip` target-type path anyway, so removing it is safe and avoids carrying a misleading rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The LB Controller names TargetGroups auto-generated from the Service as `k8s-<ns-8>-<svc-8>-<hash>`, and doesn't expose an override. For a Braintrust dataplane the namespace and service names are fixed (braintrust/braintrust-api), so every deployment in an AWS account ends up with TGs named `k8s-braintru-braintru-*` — visually indistinguishable in the console even though they're functionally isolated by the controller's cluster-scoping tag. Add the `aws-load-balancer-additional-resource-tags` annotation so the controller tags its TGs (and listeners) with BraintrustDeploymentName, matching the tag scheme we already use on Terraform-owned resources. Now `tag:BraintrustDeploymentName` is a reliable way to identify all AWS resources belonging to a specific dataplane deployment. Wire deployment_name into the helm-values template to pass through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example only showed api and brainstore writer overrides; reader and fastreader were undocumented even though they have the same structured override variables. Add them so all four chart components have a copy-pasteable sandbox sizing example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two blockers made the initial EKS deploy require a -target'd two-step apply: 1. The example's provider.tf looked up the cluster via data.aws_eks_cluster, which reads at refresh (pre-plan) and fails if the cluster doesn't exist yet. There's no way to "defer" a data source read through the initial plan. 2. The NodeClass and NodePool were delivered via kubernetes_manifest, which reads CRD schemas from the live cluster at plan time to validate the manifest. On a fresh deploy the cluster doesn't exist and the plan fails. Neither has to be this way: 1. Expose the cluster endpoint, CA data, and name as root-module outputs. The example's provider.tf reads those instead of the data source. Terraform treats module outputs that trace back to unknown resource attributes as "known after apply" and defers provider resolution — no data source, no refresh-time failure. 2. Replace kubernetes_manifest for the NodeClass + NodePool with a helm_release pointing at a tiny local chart (modules/eks-deploy/charts/brainstore-nodepool/). Helm renders templates locally and applies at apply time, so there's no plan-time cluster contact. Result: single `terraform apply` from an empty AWS account brings up everything — VPC, cluster, RDS, Redis, S3, IAM, NodeClass/NodePool, Braintrust Helm release — in one command. Tradeoff we accepted: if the cluster is destroyed out of band while Terraform state still references in-cluster resources, refresh will fail because the cluster outputs become unreadable. Recovery is `terraform state rm` of the kubernetes_*/helm_release resources followed by `terraform apply`. In-band `terraform destroy` is handled correctly by the dependency graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The EKS CloudFront distribution was unconditionally routing /function/*, /v1/proxy*, and /v1/eval* to the CloudflareProxy origin (braintrustproxy.com). For a self-hosted dataplane this is wrong on two counts: - Request payloads round-trip through Braintrust's hosted proxy rather than staying inside the customer's AWS account — defeating a core reason for self-hosting. - The preflight OPTIONS that browsers send for these paths hits a Cloudflare 404 with no CORS headers, so the UI (braintrust.dev) fails every cross-origin request to these paths. Fix: default `target_origin_id` for those path patterns to `EKSAPIOrigin` (the in-cluster API pod via the NLB — standalone-api serves these paths in Dataplane 2.0). Mirrors the Lambda ingress module's default behavior, where paths route to the local AIProxy Lambda unless `use_global_ai_proxy = true`. Expose the same `use_global_ai_proxy` toggle so both modes have identical semantics — opt-in to `braintruestproxy.com` if Braintrust instructs, otherwise stay local. The root-level `var.use_global_ai_proxy` already existed (shared with the Lambda path); this wires it through to the EKS cluster submodule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example's `locals { deployment_name = ... }` block only existed so
provider.tf could derive the EKS cluster name from it without a
duplicate literal. Since provider.tf now reads the cluster name from
module outputs directly (`module.braintrust-data-plane.eks_cluster_name`),
the local has no remaining cross-file use. Fold the constant back into
the module call and carry the comment on the attribute instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed failure mode: `terraform destroy` freezes for ~5 minutes on
helm_release.braintrust because the LB Controller holds the
`service.eks.amazonaws.com/resources` finalizer on the api Service
while it waits for target-group drain to complete. The default
deregistration delay is 300s. In failure-mode states (cluster never
had nodes register, a failed helm install, pods never reached Ready)
the drain wait is spent on nothing — there are no targets to drain —
but LB Controller respects it anyway. To the operator, `terraform
destroy` looks hung; the hang resolves only after a manual
`kubectl patch svc ... --patch '{"metadata":{"finalizers":null}}'`.
Hit this class three times now: yesterday on the 2nd deployment, and
today on both a failed redux apply and the intentional destroy of the
2nd deployment.
Fix: annotate the api Service with
`aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=0`.
Zero drain wait means the NLB deregisters targets instantly, the
finalizer clears immediately, and `helm uninstall` (and therefore
`terraform destroy`) converges in seconds.
Safe for production: the drain delay exists to let in-flight
connections finish before a target is removed. For a stateless HTTP
API fronted by CloudFront (which retries on connection failure), a
few aborted connections on scale-in or destroy are acceptable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `use_global_ai_proxy` toggle was carried over from the Lambda ingress module, where it exists to let Braintrust's own multi-tenant SaaS deployment route through braintrustproxy.com instead of the local AIProxy Lambda. For self-hosted customers there's no reason to route through Braintrust's hosted proxy — doing so defeats the point of self-hosting and requires Braintrust-side registration of the customer's deployment to work at all. Hardcode the LLM-proxy path routing to the in-cluster API (`EKSAPIOrigin`), remove the `CloudflareProxy` origin from the distribution entirely, and drop the `use_global_ai_proxy` variable from the EKS cluster submodule. Lambda mode is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`terraform destroy` on a non-empty dataplane currently requires two manual cleanup steps before it'll succeed: 1. RDS instance has deletion_protection=true by default. Destroy errors with `InvalidParameterCombination: Cannot delete protected DB Instance`. Fix today: `aws rds modify-db-instance --no-deletion-protection --apply-immediately` out of band. 2. S3 buckets are versioned and non-empty (especially the Brainstore bucket which accumulates WAL + cache). Destroy errors with `BucketNotEmpty: The bucket you tried to delete is not empty. You must delete all versions`. Fix today: write a loop against list-object-versions + delete-objects for every bucket. For real customer deployments this safety is the right default — it prevents accidental data loss on a typo'd destroy. For sandbox / CI / throwaway deployments the friction is painful. New root variable `force_destroy_data` (default: false). When true: - Every S3 bucket gets `force_destroy = true`, so destroy empties the bucket (all versions + delete markers) before deleting it. - RDS deletion_protection is disabled (OR'd with the existing `DANGER_disable_database_deletion_protection` toggle). - RDS `skip_final_snapshot = true`, so destroy doesn't block on snapshot creation. Default stays at false, so existing consumers are unaffected. Sandbox users set `force_destroy_data = true` in their example main.tf and subsequent destroys are a single command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore a set of root-module outputs that were previously pruned. They're broadly useful for consumers wiring this module into larger deployments — IAM role ARNs (for Pod Identity / IRSA references), Postgres and Redis connection details (for Kubernetes Secret construction from the root module's state), S3 bucket names (for downstream IAM policy templates), and EKS NLB identifiers. Omitted three outputs that appeared in earlier iterations but don't apply to Auto Mode: - eks_oidc_provider_arn — Auto Mode uses Pod Identity; there's no OIDC provider resource. - eks_node_security_group_id — we don't create a dedicated node SG; Auto Mode attaches the cluster SG to nodes. Expose eks_cluster_security_group_id instead. - eks_lb_controller_role_arn — Auto Mode owns the LB Controller; there's no IAM role to expose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Terraform state already stores these values; exposing them as root outputs amplifies the blast radius: - `terraform_remote_state` consumers pull values into a second state file. - `terraform output -json` in CI pipelines writes them to stdout/logs unredacted (`sensitive = true` only suppresses plaintext at the CLI, it doesn't scrub downstream logging). Removed: - `postgres_database_password` — the database module already creates a Secrets Manager secret; consumers can resolve credentials via `postgres_database_secret_arn` (still exposed), which is the canonical path. - `function_tools_secret_key` — Braintrust-internal encryption key used only by our own `kubernetes_secret.braintrust`. External consumers have no legitimate need for it. `eks_cluster_ca_certificate_data` stays — it's marked sensitive upstream but is a public CA cert by definition, and our own provider.tf consumes it from module outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Too dangerous a footgun to keep in the module. A consumer who accidentally flags `force_destroy_data = true` (or leaves it on after a test destroy and then starts using the deployment for real) would have no safety rails — all customer data evaporates on the next `destroy` with no final snapshot and no S3 version retention. Sandbox teardown friction is real but narrowly felt (just the TF module authors); the risk is broadly felt (every consumer). Prefer operators to run the same `aws s3api` version-delete loop and `aws rds modify-db-instance --no-deletion-protection` ceremony we ran during development — it's slower but impossible to trigger by accident. This reverts commit 1c350b2d; PR description updated separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove `data "aws_region" "current"` from modules/eks-deploy/main.tf; it was imported in the early days of the module and never actually referenced in any rendered value. - Remove the `custom_tags` variable from modules/eks-deploy/variables.tf and its unused pass-through in eks.tf. The eks-deploy submodule doesn't own any AWS resources (only Kubernetes + helm_release), so custom AWS tags have no effect there. Also incorporates the `terraform fmt -recursive` whitespace fixes on main.tf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the four structured per-component variables
(`eks_api_helm`, `eks_brainstore_{reader,fastreader,writer}_helm`)
and the `eks_helm_chart_extra_values` heredoc string with a single
`eks_helm_values_file` variable pointing at a YAML file alongside
the caller's main.tf.
Why:
- The four structured variables only covered two specific fields
(replicas, resources) on four specific components. Anyone tweaking
anything else (annotations, probes, env, image pins, nodeSelector)
was forced into the heredoc escape hatch. The structured-vars
abstraction was a half-abstraction.
- Helm's native interface is "a list of values files." Collapsing to
"module defaults + one caller-supplied values file" matches the
mental model customers already have from `helm install -f values.yaml`.
- Heredocs in HCL are unwieldy — no YAML lint, no IDE support, not
shareable between deployments. A separate `.yaml` file fixes all
three.
Mechanics: the submodule accepts a filename (not a `file()` result).
Path is interpreted by `file()` inside the submodule — use
`${path.module}/values.yaml` or an absolute path on the caller side.
Also adds `examples/braintrust-data-plane-eks-sandbox/` — a cheap
disposable-sandbox variant of the existing EKS example. Smaller RDS
(`db.r8g.large` / 100GB / gp3 baseline), smaller Redis
(`cache.t4g.small`), and a `values.yaml` that shrinks every chart
component to 1 replica with tight CPU/memory so the whole dataplane
fits on a single small Karpenter-provisioned node. Matches the
existing `braintrust-data-plane` / `braintrust-data-plane-sandbox`
pattern elsewhere in the examples directory.
Removes `modules/eks-deploy/overrides.tf` (the locals that
synthesized YAML from the structured variables — no longer needed).
CONTRACT.md updated to point at `eks_helm_values_file` for the
fast-reader opt-out warning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before-GA doc coverage items that came up in review prep: - `modules/eks-cluster/README.md`: terse "what this submodule owns" overview + outputs table + key variables, so readers browsing the module on github/registry have a landing page rather than raw .tf. - `modules/eks-deploy/README.md`: same for the K8s/Helm layer, plus an explanation of the in-repo `brainstore-nodepool` chart (why we use helm_release instead of kubernetes_manifest for the NodeClass + NodePool) and the helm-values merge precedence. - `CONTRACT.md` "Deployment isolation" section: explicit note that `deployment_name` must be unique per account+region. Enumerates the resources that'd collide, confirms multiple-deployments-with-distinct-names is supported and validated, and points at the cosmetic LB Controller TG-name overlap (disambiguated via BraintrustDeploymentName tag). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two discoverability gaps that make the EKS mode hard to find / recover from, surfaced during PR review prep: - Root README.md previously didn't mention EKS mode at all. Readers browsing the module on github or the registry would have no signal that the create_eks_cluster = true path exists. Added a one-paragraph subsection under "How to use this module" pointing at the prod + sandbox examples and the new TROUBLESHOOTING.md. - TROUBLESHOOTING.md promotes the EKS-mode recovery ritual from the PR description (where it'd evaporate after merge) into a durable operator-facing doc. Covers the four failure modes we actually hit during development: out-of-band cluster deletion + state-rm recovery, helm_release destroy hanging on the Service finalizer, EIP quota exhaustion on fresh apply, and pods stuck Pending due to broken NAT. Also notes that the existing Lambda-mode dump-logs.sh script does not cover EKS mode (observability parity is a tracked follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four .md-audit-pass fixes and one structure change: - CONTRACT.md: "the four NLB annotations" → explicit list of the six now-present (`-scheme`, `-type`, `-security-groups`, `-name`, `-additional-resource-tags`, `-target-group-attributes`). Matches the current helm-values.yaml.tpl. - CONTRACT.md: deployment-isolation section dropped a dangling "See the 'TG naming' follow-up in the PR description" reference; the paragraph now explains the cosmetic collision + tag-disambiguation story inline, so it survives PR merge. - TROUBLESHOOTING.md: dropped a dangling "See the PR description's 'Remaining challenges' section" reference in the dump-logs.sh note; observability gap is now described inline. - README.md (dump-logs.sh section): added a note that the script covers only the Lambda/EC2 deployment mode, pointing at TROUBLESHOOTING.md + RECOVERY.md for EKS-mode runbooks. Plus: the out-of-band-cluster-deletion runbook is promoted from a buried section in TROUBLESHOOTING.md to its own top-level RECOVERY.md. It's a disaster-recovery scenario (state mismatch requiring state-level intervention), distinct from the routine apply/destroy failures that TROUBLESHOOTING.md collects. Cross-refs between the two docs so readers landing on the wrong one get redirected. RECOVERY.md also includes a "why the module accepts this failure mode" note explaining the single-apply-bootstrap tradeoff that makes this scenario possible. README's EKS-mode signal now points at both TROUBLESHOOTING.md and RECOVERY.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status: implemented and validated end-to-end on the erikdw-sandbox-5 teardown, but unclear whether to ship in this PR. The chart-level annotation already added in fc11624 covers the same drain-wait finalizer hang for fresh deploys; this adds a redundant module-level preflight that catches the failure when the chart annotation didn't propagate (older chart, manual override, broken state). Decide before merge whether the broader coverage is worth the extra surface area. What it does, when var.prepare_for_destroy = true (default false): - kubernetes_annotations.api_drain_zero forces the api Service annotation `service.beta.kubernetes.io/aws-load-balancer-target-group-attributes` to `deregistration_delay.timeout_seconds=0`. Same key the chart template (helm-values.yaml.tpl) already sets — this resource only matters if the live annotation drifted. - terraform_data.api_tg_drain_zero loops over every TargetGroup tagged `BraintrustDeploymentName=<deployment_name>` and calls `aws elbv2 modify-target-group-attributes` to set the same attribute directly. Faster path than the LB Controller's reconcile loop, and covers the case where the controller created the TG before our annotation propagated. With drain wait at zero, the LB Controller releases its `service.eks.amazonaws.com/resources` finalizer the moment helm uninstall deletes the Service, finishes its own TG cleanup, and helm_release.braintrust returns in seconds. No kubectl-patch workaround needed, and no orphan TGs left in AWS. Why this exists: on the erikdw-sandbox-5 teardown, terraform destroy hung on helm_release.braintrust for ~10 minutes (past the default 5-min drain timer, suggesting the chart annotation never made it to the live TG on chart 6.1.0). Manual `kubectl patch svc braintrust-api ... finalizers:null` unblocks the destroy but interrupts the LBC mid-cleanup, leaving an orphan TG (`k8s-braintru-braintru-*` tagged with the deployment name). prepare_for_destroy avoids both problems. Scope is service infra only. Data-bearing resources keep their separate, explicit knobs: - RDS: DANGER_disable_database_deletion_protection (existing) - S3: deliberately not destroyable from TF — emptying buckets is a manual operator step before destroy. No DANGER_* flag, no force_destroy var. Matches the prior 398f997 revert of `force_destroy_data`. Files: - variables.tf: add `prepare_for_destroy` (root) - eks.tf: plumb through to module.eks_deploy - modules/eks-deploy/variables.tf: declare the var - modules/eks-deploy/main.tf: add `data.aws_region.current`, `kubernetes_annotations.api_drain_zero`, and `terraform_data.api_tg_drain_zero` (both gated by count) - TROUBLESHOOTING.md: prepare_for_destroy is now the documented happy path; the manual kubectl-patch runbook stays as the in-flight recovery, with a tag-driven cleanup snippet for the orphan TG that workaround leaves behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add EKS Auto Mode deployment mode
Summary
Adds a
create_eks_cluster = truedeployment mode that provisions a complete Braintrust dataplane on EKS Auto Mode instead of the existing Lambda + EC2 path. All Braintrust workloads (API, brainstore reader / fastreader / writer) run in-cluster as pods and are deployed via the Braintrust Helm chart.When enabled, the module owns:
${deployment_name}-eks)Everything else (VPC, RDS, ElastiCache, S3, KMS, API/Brainstore IAM) is shared with the existing Lambda/EC2 path.
The Lambda, EC2 Brainstore, and Lambda-URL ingress submodules are disabled in this mode (gated by
use_deployment_mode_external_eks = true, whichcreate_eks_cluster = truerequires).Why EKS Auto Mode
Auto Mode lets AWS manage the control plane add-ons (VPC CNI, CoreDNS, kube-proxy, Pod Identity Agent, AWS Load Balancer Controller, EBS CSI driver) and node lifecycle (via a managed Karpenter). The same capabilities on self-managed EKS mean owning the install, IAM configuration, version-compatibility matrix, and upgrade choreography for each addon — plus whichever subset of {Karpenter, Pod Identity Agent, EBS CSI, metrics-server} matches your feature choices. None of it is individually hard; collectively it's real recurring work on every cluster upgrade. Auto Mode hands that coordination surface to AWS in exchange for the managed-mode premium and some lost flexibility. This module therefore uses Auto Mode exclusively rather than self-managed EKS or the
terraform-aws-modules/ekscommunity module.Usage
Minimal example
See
examples/braintrust-data-plane-eks/for the production-sized canonical config, orexamples/braintrust-data-plane-eks-sandbox/for a cheap disposable sandbox variant (smaller RDS, Redis, and avalues.yamlalongside that shrinks the chart components to 1-replica with tight CPU/memory). The shortest working invocation:Single-apply bootstrap
terraform apply. One command.Cold first-deploy runtime is ~15 minutes (cluster ~8-10, then RDS + Redis + Helm release). Subsequent applies are incremental.
Two design choices make the one-command path work:
data.aws_eks_cluster. The example'sprovider.tfreadsmodule.braintrust-data-plane.eks_cluster_endpoint,eks_cluster_ca_certificate_data, andeks_cluster_namedirectly off the module. Terraform treats these as "known after apply" on the first run and defers provider resolution until the cluster exists. A data source, by contrast, reads at refresh (pre-plan) and would fail on a fresh deploy — that was the reason the first iteration of this module required a-target'd two-step apply.helm_release, notkubernetes_manifest.kubernetes_manifestreads CRD schemas from the live cluster at plan time to validate manifests, which fails on a fresh deploy; Helm renders templates locally and applies at apply time, with no plan-time cluster dependency. The CRDs live in an in-repo chart atmodules/eks-deploy/charts/brainstore-nodepool/.Architecture
Module layout
New submodules:
modules/eks-cluster/— EKS cluster, cluster+node IAM roles, NLB pre-creation, CloudFront VPC Origin wiring, CloudFront distribution.modules/eks-deploy/— the Kubernetes / Helm layer: namespace,braintrust-secretsSecret, Pod Identity associations, thebrainstore-nodepoolhelm release (NodeClass + NodePool), and thebraintrusthelm release itself.New in-repo Helm chart:
modules/eks-deploy/charts/brainstore-nodepool/— tiny chart with just two templates (NodeClass + NodePool). Not published anywhere; lives with the Terraform source so the module is self-contained.Top-level
eks.tfwires the three submodules together. Root-levelmain.tfis touched only lightly (forservices_commonto receive the EKS cluster ARN for Pod Identity trust scoping).Module ordering
eks_clusterprovisions the cluster and exports its ARN.services_commonbuilds the API + Brainstore IAM roles with Pod Identity trust policies scoped to(cluster_arn, namespace, service_account).eks_deploycreates the Pod Identity associations binding SAs to roles, plus the namespace / Secret / brainstore-nodepool chart / Braintrust helm release.This is why the EKS layer is split into two submodules rather than one:
services_commonis also used by the non-EKS path, so it can't live insideeks_deploy, and the role ARNs it produces are consumed byeks_deploy, soservices_commoncan't live insideeks_cluster.Key design decisions
Pod Identity, not IRSA. Auto Mode ships the Pod Identity Agent preinstalled. Pod Identity uses simpler trust policies, supports session tags, and doesn't require an OIDC provider. The module creates
aws_eks_pod_identity_associationresources for both thebraintrust-apiandbrainstoreservice accounts. The chart still writes an IRSA-styleeks.amazonaws.com/role-arnannotation on the service accounts; this is harmless because Pod Identity intercepts AWS SDK credential resolution before IRSA is consulted.Pre-created NLB adopted by the Load Balancer Controller. The CloudFront VPC Origin needs the NLB ARN at plan time, but the Load Balancer Controller normally creates NLBs on demand when a Service becomes
type: LoadBalancer. The module pre-creates the NLB in Terraform (aws_lb.api), and the chart's Service usesservice.beta.kubernetes.io/aws-load-balancer-name+aws-load-balancer-security-groupsannotations to have the controller adopt the existing NLB rather than create a new one. Security groups can only be attached to an NLB at creation time, which is why the NLB SG is also owned by Terraform.Custom Brainstore NodePool. Brainstore caches to local NVMe SSD via
emptyDir, so its pods need NVMe-backed EC2 families (c8gd,c7gd,m7gd, etc.). Auto Mode's defaultgeneral-purposeNodePool doesn't constrain to those families, so the module adds a custom NodeClass + NodePool that does, and Brainstore pods target it via thebraintrust.dev/node-pool: brainstorenodeSelectorinjected into the Helm values.NodePool delivered via
helm_release, notkubernetes_manifest.kubernetes_manifestreads CRD schemas from the live cluster at plan time. That's incompatible with single-apply bootstrap because the cluster doesn't exist yet on the first plan. Wrapping the two manifests in a tiny local Helm chart moves the cluster contact to apply time. The rendered objects are structurally identical to whatkubernetes_manifestproduced — verified by rendering the chart and diffing field-by-field against the old values, including the trickyaws:eks:cluster-namecolon-key.Provider config from module outputs, not
data.aws_eks_cluster. The example'sprovider.tfreadseks_cluster_endpoint,eks_cluster_ca_certificate_data, andeks_cluster_namedirectly off the module. Terraform treats module outputs that trace back to "known after apply" resource attributes as unknown at plan time and defers provider resolution until the cluster exists. A data source reads at refresh (pre-plan) and would fail on a fresh deploy.Exec auth for the Kubernetes/Helm providers, not static tokens. The example's
provider.tfusesexec { aws eks get-token }rather than the simpleraws_eks_cluster_authdata source. The static-token pattern expires after 15 minutes — short enough to fail if an apply sits at an approval prompt or if the operator walks away betweenterraform planandterraform apply. Exec auth refreshes on every API call and requires only the AWS CLI on the runner (which consumers need anyway, foraws eks update-kubeconfig).Destroy choreography
Tearing down an EKS-mode deployment is a two-step apply→destroy:
prepare_for_destroy = trueand runterraform apply.terraform destroy.The preflight resources live in
modules/eks-deploy/main.tfbehindcount = var.prepare_for_destroy ? 1 : 0:kubernetes_annotations.api_drain_zeropatchesservice.beta.kubernetes.io/aws-load-balancer-target-group-attributeson the api Service toderegistration_delay.timeout_seconds=0. Belt-and-suspenders: the chart'shelm-values.yaml.tplalready sets this, but this resource forces the live annotation back to the right value if it ever drifted (older chart, manual override, broken state).terraform_data.api_tg_drain_zerocallsaws elbv2 modify-target-group-attributesdirectly on every TargetGroup taggedBraintrustDeploymentName=<deployment_name>. Faster path than waiting for the LB Controller's reconcile loop to propagate the annotation, and works even on TGs created before the annotation was set.With drain wait at zero, the LB Controller releases its
service.eks.amazonaws.com/resourcesfinalizer the moment helm uninstall deletes the api Service, finishes its own TG cleanup, andhelm_release.braintrustreturns in seconds. Nokubectl patchworkarounds, no orphan TargetGroups left in AWS.Why this exists: in earlier sandbox tear-downs,
terraform destroyfroze for ~5 min on the helm_release while LBC waited out the default 300s drain timer. The manual workaround (kubectl -n braintrust patch svc braintrust-api --type merge -p '{"metadata":{"finalizers":null}}') unblocks the destroy but interrupts the controller mid-cleanup, leaving an orphan TG behind.prepare_for_destroyis the supported alternative.Scope: service infra only. Data-bearing resources have separate, explicit knobs:
DANGER_disable_database_deletion_protection = true(existing) flips the RDSdeletion_protectionattribute. Required forterraform destroyto remove the database; intentionally not bundled intoprepare_for_destroy.force_destroytoggle, no DANGER_* flag. If you need to tear a sandbox down completely, empty the buckets manually first (aws s3 rm s3://<bucket> --recursive, thenaws s3api delete-objectsfor non-current versions and delete-markers if versioning was enabled). The cost of a stray destroy hitting a data bucket is too high to mitigate with a flag.New variables
All defaulted except
helm_chart_versionwhencreate_eks_cluster = true.create_eks_clusterfalseuse_deployment_mode_external_eks = true.eks_kubernetes_version"1.31"eks_brainstore_nodepool_instance_families["c8gd", "c7gd", "m7gd"]helm_chart_versionnullcreate_eks_cluster = true. No default so chart upgrades are always deliberate.eks_helm_values_filenulleks_helm_values_file = "${path.module}/values.yaml"so the file lives alongside your main.tf. Leave null to accept chart defaults. See thebraintrust-data-plane-eks-sandboxexample for sandbox-sized values.prepare_for_destroyfalseterraform destroy. Flip true, apply, then destroy. Zeroes deregistration_delay on the LB Controller's TargetGroup(s) so the finalizer doesn't hanghelm_release.braintruston destroy and the controller cleans up its own TGs (no orphans). EKS-mode only. See Destroy choreography below andTROUBLESHOOTING.md.New module outputs
The three starred outputs below are required by the example's
provider.tfto configure the kubernetes/helm providers from module outputs instead of adata.aws_eks_clusterlookup — which is what enables single-apply bootstrap. The rest are broadly useful for downstream consumers wiring this module into larger deployments (IAM references for external Pod Identity associations, Postgres/Redis connection details for downstream Kubernetes Secret construction, S3 bucket names for downstream IAM policy templates, NLB identifiers, etc.).eks_cluster_nameaws eks get-tokenexec arg in provider.tf).eks_cluster_endpointhost.eks_cluster_ca_certificate_datacluster_ca_certificate(afterbase64decode()).eks_cluster_security_group_ideks_nlb_arneks_nlb_nameaws-load-balancer-nameannotation).nlb_security_group_idcode_bundle_bucket_idlambda_responses_bucket_idpostgres_database_addresspostgres_database_portredis_endpointredis_portapi_handler_role_arnbraintrust-apiservice account.brainstore_iam_role_arnbrainstoreservice account (also the EC2 role on the EC2-Brainstore path).★ = required for the single-apply
provider.tfpattern.Module ↔ Helm chart contract
The module and chart are tightly coupled — several names, ports, keys, and paths have to match exactly on both sides. The full list is documented in
CONTRACT.md(tested chart version6.1.0, supported range6.x). Highlights:braintrust-secretsand its keys (PG_URL,REDIS_URL,FUNCTION_SECRET_KEY,BRAINSTORE_LICENSE_KEY)braintrust-api,brainstore) used in both the Pod Identity associations and the chart8000— used by the CloudFront VPC Origin and by the cluster SG ingress rule that admits NLB trafficaws-load-balancer-*service annotations the controller reads to adopt our pre-created NLB, plusaws-load-balancer-additional-resource-tagsfor deployment-scoped tagging of controller-created resourcesnodeSelectorlabelbraintrust.dev/node-pool: brainstoreAny of these moving or renaming on the chart side breaks us, often silently. Drift detection between the module and chart is a follow-up (below).
Tradeoffs accepted with single-apply bootstrap
Single-apply is strictly an improvement if the target audience can handle the failure modes below. For Braintrust's self-hosted data plane audience (sophisticated operators), the judgment is that it is. For less-experienced consumers, two-step would have been safer. This PR accepts the tradeoff.
aws eks delete-cluster) breaksterraform planat refresh — the provider can no longer readeks_cluster_endpoint.terraform state rmthekubernetes_*andhelm_release.*resources, thenterraform applyto recreate. Full runbook inRECOVERY.md.-targeted partial destroy of just the cluster orphans K8s state with no way to reach it.data.aws_eks_clusterlookup +-targettwo-step. Change is reversible.brainstore-nodepoolgets corrupted or a customerkubectl deletes the CRs out of band, recovery is rougher than before.helm uninstall brainstore-nodepool -n braintrust+terraform apply.charts/brainstore-nodepool/templates/*.yamlfail at apply time, not plan.helm lintto CI later.helm_release.brainstore_nodepoolcould theoretically try to create aNodeClass/NodePoolbefore Auto Mode finishes installing the Karpenter CRDs.In-band
terraform destroystill works correctly via the dependency graph — Terraform drains K8s resources first, then the cluster.Known limitations / follow-ups
k8s-braintru-braintru-*still collides visually across deployments. Theadditional-resource-tagsfix makes the tags disambiguating, but the name itself isn't configurable on the controller side. Low priority.CONTRACT.mdenumerates the coupling surfaces, but there's no automated check. A CI smoke test that renders the chart against the module's template values and grep-asserts the known-good keys would prevent silent breakage on chart upgrades. Deferred.c8gd.*). That's fine for sandbox throughput but defeats the isolation/headroom story of the EC2 path for production workloads. Fix lives in the Helm chart (pod anti-affinity onbraintrust.dev/brainstore-role), not in this module. Deferred.nodeSelectorand falls through to Auto Mode's defaultgeneral-purposeNodePool, with opaque instance-family selection managed by AWS. Brainstore, by contrast, targets the module-ownedbrainstoreNodePool pinned to NVMe-Graviton families. Parallel follow-up: add a second custom NodePool (api) in the local chart with its owneks_api_nodepool_instance_familiesvariable (default non-NVMe Graviton compute:c8g/c7g/m7g). Brings the API pod under the same explicit control as Brainstore — predictable instance selection, consistent operational model, and enables per-pool tuning of disruption/consolidation policy (API pods tolerate more aggressive consolidation than Brainstore). Implementation is ~50 lines of chart YAML + one variable + one helm-values-template edit; guard thenodeSelectorrendering on the pool being enabled to avoid a stuck-Pending footgun. Deferred.Testing
End-to-end validated in a sandbox AWS account:
terraform applysucceeds from an empty AWS account.braintrust-api,brainstore-fastreader,brainstore-reader,brainstore-writer) reachRunning 1/1.c8gd.xlargeobserved).curl https://<cloudfront-domain>/returns200 OK+Hello World!from the API through CloudFront → NLB → pod.BraintrustDeploymentNametag after the tagging commit.brainstore-nodepoolchart renders to the exact same Kubernetes manifests as the previouskubernetes_manifestresources (field-by-field JSON diff, including theaws:eks:cluster-namecolon-key YAML parse).