Add EKS Auto Mode deployment mode by erikdw · Pull Request #233 · braintrustdata/terraform-aws-braintrust-data-plane

Erik Weathers (erikdw) · 2026-04-23T21:52:31Z

Add EKS Auto Mode deployment mode

Summary

Adds a create_eks_cluster = true deployment mode that provisions a complete Braintrust dataplane on EKS Auto Mode instead of the existing Lambda + EC2 path. All Braintrust workloads (API, brainstore reader / fastreader / writer) run in-cluster as pods and are deployed via the Braintrust Helm chart.

When enabled, the module owns:

A new EKS Auto Mode cluster (${deployment_name}-eks)
All cluster + node IAM roles
Pod Identity associations binding Kubernetes service accounts to AWS IAM roles
A custom Karpenter NodePool/NodeClass constraining Brainstore to NVMe-backed instance families (delivered via a small in-repo Helm chart)
A pre-created NLB and CloudFront VPC Origin, adopted by the AWS Load Balancer Controller at Helm release time
The Braintrust Helm release itself, with values rendered from Terraform

Everything else (VPC, RDS, ElastiCache, S3, KMS, API/Brainstore IAM) is shared with the existing Lambda/EC2 path.

The Lambda, EC2 Brainstore, and Lambda-URL ingress submodules are disabled in this mode (gated by use_deployment_mode_external_eks = true, which create_eks_cluster = true requires).

Why EKS Auto Mode

Auto Mode lets AWS manage the control plane add-ons (VPC CNI, CoreDNS, kube-proxy, Pod Identity Agent, AWS Load Balancer Controller, EBS CSI driver) and node lifecycle (via a managed Karpenter). The same capabilities on self-managed EKS mean owning the install, IAM configuration, version-compatibility matrix, and upgrade choreography for each addon — plus whichever subset of {Karpenter, Pod Identity Agent, EBS CSI, metrics-server} matches your feature choices. None of it is individually hard; collectively it's real recurring work on every cluster upgrade. Auto Mode hands that coordination surface to AWS in exchange for the managed-mode premium and some lost flexibility. This module therefore uses Auto Mode exclusively rather than self-managed EKS or the terraform-aws-modules/eks community module.

Usage

Minimal example

See examples/braintrust-data-plane-eks/ for the production-sized canonical config, or examples/braintrust-data-plane-eks-sandbox/ for a cheap disposable sandbox variant (smaller RDS, Redis, and a values.yaml alongside that shrinks the chart components to 1-replica with tight CPU/memory). The shortest working invocation:

module "braintrust-data-plane" {
  source = "../../"

  deployment_name        = "my-deployment"
  braintrust_org_name    = "my-org"
  brainstore_license_key = var.brainstore_license_key

  use_deployment_mode_external_eks = true
  create_eks_cluster               = true

  helm_chart_version = "6.1.0"

  # ...plus the usual postgres_*, redis_*, etc.
}

Single-apply bootstrap

terraform apply. One command.

Cold first-deploy runtime is ~15 minutes (cluster ~8-10, then RDS + Redis + Helm release). Subsequent applies are incremental.

Two design choices make the one-command path work:

Provider config is sourced from module outputs, not data.aws_eks_cluster. The example's provider.tf reads module.braintrust-data-plane.eks_cluster_endpoint, eks_cluster_ca_certificate_data, and eks_cluster_name directly off the module. Terraform treats these as "known after apply" on the first run and defers provider resolution until the cluster exists. A data source, by contrast, reads at refresh (pre-plan) and would fail on a fresh deploy — that was the reason the first iteration of this module required a -target'd two-step apply.
NodeClass + NodePool are delivered via helm_release, not kubernetes_manifest. kubernetes_manifest reads CRD schemas from the live cluster at plan time to validate manifests, which fails on a fresh deploy; Helm renders templates locally and applies at apply time, with no plan-time cluster dependency. The CRDs live in an in-repo chart at modules/eks-deploy/charts/brainstore-nodepool/.

Architecture

Module layout

New submodules:

modules/eks-cluster/ — EKS cluster, cluster+node IAM roles, NLB pre-creation, CloudFront VPC Origin wiring, CloudFront distribution.
modules/eks-deploy/ — the Kubernetes / Helm layer: namespace, braintrust-secrets Secret, Pod Identity associations, the brainstore-nodepool helm release (NodeClass + NodePool), and the braintrust helm release itself.

New in-repo Helm chart:

modules/eks-deploy/charts/brainstore-nodepool/ — tiny chart with just two templates (NodeClass + NodePool). Not published anywhere; lives with the Terraform source so the module is self-contained.

Top-level eks.tf wires the three submodules together. Root-level main.tf is touched only lightly (for services_common to receive the EKS cluster ARN for Pod Identity trust scoping).

Module ordering

module.eks_cluster → module.services_common → module.eks_deploy

eks_cluster provisions the cluster and exports its ARN.
services_common builds the API + Brainstore IAM roles with Pod Identity trust policies scoped to (cluster_arn, namespace, service_account).
eks_deploy creates the Pod Identity associations binding SAs to roles, plus the namespace / Secret / brainstore-nodepool chart / Braintrust helm release.

This is why the EKS layer is split into two submodules rather than one: services_common is also used by the non-EKS path, so it can't live inside eks_deploy, and the role ARNs it produces are consumed by eks_deploy, so services_common can't live inside eks_cluster.

Key design decisions

Pod Identity, not IRSA. Auto Mode ships the Pod Identity Agent preinstalled. Pod Identity uses simpler trust policies, supports session tags, and doesn't require an OIDC provider. The module creates aws_eks_pod_identity_association resources for both the braintrust-api and brainstore service accounts. The chart still writes an IRSA-style eks.amazonaws.com/role-arn annotation on the service accounts; this is harmless because Pod Identity intercepts AWS SDK credential resolution before IRSA is consulted.
Pre-created NLB adopted by the Load Balancer Controller. The CloudFront VPC Origin needs the NLB ARN at plan time, but the Load Balancer Controller normally creates NLBs on demand when a Service becomes type: LoadBalancer. The module pre-creates the NLB in Terraform (aws_lb.api), and the chart's Service uses service.beta.kubernetes.io/aws-load-balancer-name + aws-load-balancer-security-groups annotations to have the controller adopt the existing NLB rather than create a new one. Security groups can only be attached to an NLB at creation time, which is why the NLB SG is also owned by Terraform.
Custom Brainstore NodePool. Brainstore caches to local NVMe SSD via emptyDir, so its pods need NVMe-backed EC2 families (c8gd, c7gd, m7gd, etc.). Auto Mode's default general-purpose NodePool doesn't constrain to those families, so the module adds a custom NodeClass + NodePool that does, and Brainstore pods target it via the braintrust.dev/node-pool: brainstore nodeSelector injected into the Helm values.
NodePool delivered via helm_release, not kubernetes_manifest. kubernetes_manifest reads CRD schemas from the live cluster at plan time. That's incompatible with single-apply bootstrap because the cluster doesn't exist yet on the first plan. Wrapping the two manifests in a tiny local Helm chart moves the cluster contact to apply time. The rendered objects are structurally identical to what kubernetes_manifest produced — verified by rendering the chart and diffing field-by-field against the old values, including the tricky aws:eks:cluster-name colon-key.
Provider config from module outputs, not data.aws_eks_cluster. The example's provider.tf reads eks_cluster_endpoint, eks_cluster_ca_certificate_data, and eks_cluster_name directly off the module. Terraform treats module outputs that trace back to "known after apply" resource attributes as unknown at plan time and defers provider resolution until the cluster exists. A data source reads at refresh (pre-plan) and would fail on a fresh deploy.
Exec auth for the Kubernetes/Helm providers, not static tokens. The example's provider.tf uses exec { aws eks get-token } rather than the simpler aws_eks_cluster_auth data source. The static-token pattern expires after 15 minutes — short enough to fail if an apply sits at an approval prompt or if the operator walks away between terraform plan and terraform apply. Exec auth refreshes on every API call and requires only the AWS CLI on the runner (which consumers need anyway, for aws eks update-kubeconfig).

Destroy choreography

TBD whether to ship this in the PR. The prepare_for_destroy mechanism below is implemented and validated, but it overlaps with the chart-level annotation already shipped in fc11624. If the chart annotation reliably propagates to the live TG on every supported chart version, this is belt-and-suspenders and could be cut to keep the surface area smaller. Keeping it earns its place when the chart annotation drifts (older chart, manual overrides, broken-state cluster) — exactly the case that motivated the manual kubectl patch runbook in TROUBLESHOOTING.md and produced an orphan TG on a real sandbox teardown. Decide before merge.

Tearing down an EKS-mode deployment is a two-step apply→destroy:

Set prepare_for_destroy = true and run terraform apply.
Run terraform destroy.

The preflight resources live in modules/eks-deploy/main.tf behind count = var.prepare_for_destroy ? 1 : 0:

kubernetes_annotations.api_drain_zero patches service.beta.kubernetes.io/aws-load-balancer-target-group-attributes on the api Service to deregistration_delay.timeout_seconds=0. Belt-and-suspenders: the chart's helm-values.yaml.tpl already sets this, but this resource forces the live annotation back to the right value if it ever drifted (older chart, manual override, broken state).
terraform_data.api_tg_drain_zero calls aws elbv2 modify-target-group-attributes directly on every TargetGroup tagged BraintrustDeploymentName=<deployment_name>. Faster path than waiting for the LB Controller's reconcile loop to propagate the annotation, and works even on TGs created before the annotation was set.

With drain wait at zero, the LB Controller releases its service.eks.amazonaws.com/resources finalizer the moment helm uninstall deletes the api Service, finishes its own TG cleanup, and helm_release.braintrust returns in seconds. No kubectl patch workarounds, no orphan TargetGroups left in AWS.

Why this exists: in earlier sandbox tear-downs, terraform destroy froze for ~5 min on the helm_release while LBC waited out the default 300s drain timer. The manual workaround (kubectl -n braintrust patch svc braintrust-api --type merge -p '{"metadata":{"finalizers":null}}') unblocks the destroy but interrupts the controller mid-cleanup, leaving an orphan TG behind. prepare_for_destroy is the supported alternative.

Scope: service infra only. Data-bearing resources have separate, explicit knobs:

RDS — DANGER_disable_database_deletion_protection = true (existing) flips the RDS deletion_protection attribute. Required for terraform destroy to remove the database; intentionally not bundled into prepare_for_destroy.
S3 — buckets are deliberately not exposed as destroyable from Terraform. No force_destroy toggle, no DANGER_* flag. If you need to tear a sandbox down completely, empty the buckets manually first (aws s3 rm s3://<bucket> --recursive, then aws s3api delete-objects for non-current versions and delete-markers if versioning was enabled). The cost of a stray destroy hitting a data bucket is too high to mitigate with a flag.

New variables

All defaulted except helm_chart_version when create_eks_cluster = true.

Variable	Default	Purpose
`create_eks_cluster`	`false`	Master switch for this mode. Requires `use_deployment_mode_external_eks = true`.
`eks_kubernetes_version`	`"1.31"`	Kubernetes version for the EKS cluster.
`eks_brainstore_nodepool_instance_families`	`["c8gd", "c7gd", "m7gd"]`	EC2 families Karpenter may pick from for Brainstore nodes. Must be NVMe-backed.
`helm_chart_version`	`null`	Required when `create_eks_cluster = true`. No default so chart upgrades are always deliberate.
`eks_helm_values_file`	`null`	Path to a YAML file with Helm values overrides, merged in after the module's rendered defaults. Idiomatic usage: `eks_helm_values_file = "${path.module}/values.yaml"` so the file lives alongside your main.tf. Leave null to accept chart defaults. See the `braintrust-data-plane-eks-sandbox` example for sandbox-sized values.
`prepare_for_destroy`	`false`	Pre-flight before `terraform destroy`. Flip true, apply, then destroy. Zeroes deregistration_delay on the LB Controller's TargetGroup(s) so the finalizer doesn't hang `helm_release.braintrust` on destroy and the controller cleans up its own TGs (no orphans). EKS-mode only. See Destroy choreography below and `TROUBLESHOOTING.md`.

New module outputs

The three starred outputs below are required by the example's provider.tf to configure the kubernetes/helm providers from module outputs instead of a data.aws_eks_cluster lookup — which is what enables single-apply bootstrap. The rest are broadly useful for downstream consumers wiring this module into larger deployments (IAM references for external Pod Identity associations, Postgres/Redis connection details for downstream Kubernetes Secret construction, S3 bucket names for downstream IAM policy templates, NLB identifiers, etc.).

Output	Sensitive	Purpose
★ `eks_cluster_name`	no	Cluster name (used in the `aws eks get-token` exec arg in provider.tf).
★ `eks_cluster_endpoint`	no	API server endpoint for the kubernetes/helm providers' `host`.
★ `eks_cluster_ca_certificate_data`	yes	Base64-encoded cluster CA. Consumed by the kubernetes/helm providers' `cluster_ca_certificate` (after `base64decode()`).
`eks_cluster_security_group_id`	no	Cluster SG (attached to Auto Mode nodes). Useful for authoring additional inbound rules from external sources.
`eks_nlb_arn`	no	ARN of the pre-created internal NLB adopted by the LB Controller.
`eks_nlb_name`	no	NLB name (referenced by the chart's `aws-load-balancer-name` annotation).
`nlb_security_group_id`	no	Security group attached to the NLB.
`code_bundle_bucket_id`	no	S3 bucket for code bundles.
`lambda_responses_bucket_id`	no	S3 bucket for lambda responses.
`postgres_database_address`	no	Postgres hostname.
`postgres_database_port`	no	Postgres port.
`redis_endpoint`	no	Redis hostname.
`redis_port`	no	Redis port.
`api_handler_role_arn`	no	IAM role ARN for the `braintrust-api` service account.
`brainstore_iam_role_arn`	no	IAM role ARN for the `brainstore` service account (also the EC2 role on the EC2-Brainstore path).

★ = required for the single-apply provider.tf pattern.

Module ↔ Helm chart contract

The module and chart are tightly coupled — several names, ports, keys, and paths have to match exactly on both sides. The full list is documented in CONTRACT.md (tested chart version 6.1.0, supported range 6.x). Highlights:

K8s Secret name braintrust-secrets and its keys (PG_URL, REDIS_URL, FUNCTION_SECRET_KEY, BRAINSTORE_LICENSE_KEY)
Service account names (braintrust-api, brainstore) used in both the Pod Identity associations and the chart
API container port 8000 — used by the CloudFront VPC Origin and by the cluster SG ingress rule that admits NLB traffic
The four aws-load-balancer-* service annotations the controller reads to adopt our pre-created NLB, plus aws-load-balancer-additional-resource-tags for deployment-scoped tagging of controller-created resources
Brainstore nodeSelector label braintrust.dev/node-pool: brainstore

Any of these moving or renaming on the chart side breaks us, often silently. Drift detection between the module and chart is a follow-up (below).

Tradeoffs accepted with single-apply bootstrap

Single-apply is strictly an improvement if the target audience can handle the failure modes below. For Braintrust's self-hosted data plane audience (sophisticated operators), the judgment is that it is. For less-experienced consumers, two-step would have been safer. This PR accepts the tradeoff.

#	Tradeoff	Likelihood	Severity	Recovery
1	Out-of-band cluster deletion (AWS console, cleanup script, `aws eks delete-cluster`) breaks `terraform plan` at refresh — the provider can no longer read `eks_cluster_endpoint`.	Medium	Medium	`terraform state rm` the `kubernetes_` and `helm_release.` resources, then `terraform apply` to recreate. Full runbook in `RECOVERY.md`.
2	`-target`ed partial destroy of just the cluster orphans K8s state with no way to reach it.	Low (anti-pattern)	Medium	Same as #1.
3	"Known after apply" values in provider config generate warnings on first plan. Terraform ≥1.3 tolerates this, but future provider/TF versions could tighten the rules.	Low (fine today, future risk)	Low	Revert the provider config to a `data.aws_eks_cluster` lookup + `-target` two-step. Change is reversible.
4	Helm-managed NodeClass/NodePool. If the helm release secret for `brainstore-nodepool` gets corrupted or a customer `kubectl delete`s the CRs out of band, recovery is rougher than before.	Low	Low	`helm uninstall brainstore-nodepool -n braintrust` + `terraform apply`.
5	Local chart templates don't validate at plan time. Syntax errors in `charts/brainstore-nodepool/templates/*.yaml` fail at apply time, not plan.	Low	Low	Fix chart, re-apply. Could add `helm lint` to CI later.
6	CRD availability race — `helm_release.brainstore_nodepool` could theoretically try to create a `NodeClass` / `NodePool` before Auto Mode finishes installing the Karpenter CRDs.	Very low	Low	Retry apply. Not observed; Auto Mode CRDs are present by the time the cluster is kubectl-reachable.

In-band terraform destroy still works correctly via the dependency graph — Terraform drains K8s resources first, then the cluster.

Known limitations / follow-ups

TG naming. Controller-generated TG name k8s-braintru-braintru-* still collides visually across deployments. The additional-resource-tags fix makes the tags disambiguating, but the name itself isn't configurable on the controller side. Low priority.
Drift detection between module and chart. CONTRACT.md enumerates the coupling surfaces, but there's no automated check. A CI smoke test that renders the chart against the module's template values and grep-asserts the known-good keys would prevent silent breakage on chart upgrades. Deferred.
Dedicated node per Brainstore pod. The EC2-Brainstore path runs each Brainstore component on its own instance; on EKS today multiple Brainstore pods can land on the same Karpenter-provisioned node (as we observed in testing — all 3 readers + the api pod on one c8gd.*). That's fine for sandbox throughput but defeats the isolation/headroom story of the EC2 path for production workloads. Fix lives in the Helm chart (pod anti-affinity on braintrust.dev/brainstore-role), not in this module. Deferred.
Explicit NodePool for the API component. Today the API pod has no nodeSelector and falls through to Auto Mode's default general-purpose NodePool, with opaque instance-family selection managed by AWS. Brainstore, by contrast, targets the module-owned brainstore NodePool pinned to NVMe-Graviton families. Parallel follow-up: add a second custom NodePool (api) in the local chart with its own eks_api_nodepool_instance_families variable (default non-NVMe Graviton compute: c8g/c7g/m7g). Brings the API pod under the same explicit control as Brainstore — predictable instance selection, consistent operational model, and enables per-pool tuning of disruption/consolidation policy (API pods tolerate more aggressive consolidation than Brainstore). Implementation is ~50 lines of chart YAML + one variable + one helm-values-template edit; guard the nodeSelector rendering on the pool being enabled to avoid a stuck-Pending footgun. Deferred.

Testing

End-to-end validated in a sandbox AWS account:

Fresh single-terraform apply succeeds from an empty AWS account.
All 4 pods (braintrust-api, brainstore-fastreader, brainstore-reader, brainstore-writer) reach Running 1/1.
Brainstore pods land on a Karpenter-provisioned node in the custom NodePool with an NVMe-backed instance type (c8gd.xlarge observed).
curl https://<cloudfront-domain>/ returns 200 OK + Hello World! from the API through CloudFront → NLB → pod.
Subsequent applies are idempotent and don't recreate the NLB or CloudFront.
Verified TG carries the BraintrustDeploymentName tag after the tagging commit.
Verified coexistence with a separate manual dataplane install in the same AWS account — no cross-deployment interference at the AWS resource layer (tags and cluster-scoping keep them isolated).
Verified the brainstore-nodepool chart renders to the exact same Kubernetes manifests as the previous kubernetes_manifest resources (field-by-field JSON diff, including the aws:eks:cluster-name colon-key YAML parse).
Ran smoke test from a real Braintrust org in the braintrust.dev UI:
- Configured to use the CloudFront API URL
- Created a service token
- Successfully interacted with ChatGPT from the Playground

Introduces `create_eks_cluster = true`, which provisions an EKS Auto Mode cluster and deploys the Braintrust Helm chart on it end-to-end. Uses raw AWS provider resources (no `terraform-aws-modules/eks` dependency) and EKS Pod Identity for pod-to-IAM binding. ## Why Auto Mode Auto Mode collapses most of the yak shave for a production EKS deployment: - **Node provisioning**: AWS runs Karpenter internally; no managed node group to define. - **Core addons**: `vpc-cni`, `coredns`, `kube-proxy`, EBS CSI driver, and the AWS Load Balancer Controller come preinstalled — no `aws_eks_addon` resources, no LB Controller IAM role / Helm release. - **Pod Identity**: the Pod Identity Agent ships built-in, enabling a simpler alternative to IRSA (no OIDC provider, no TLS-thumbprint wrangling, no `data.tls_certificate`). This module therefore only has to own the cluster + node IAM roles, the VPC wiring, the pre-created NLB + CloudFront distribution, and the Braintrust-specific K8s objects. ## Structure Two submodules under `modules/`, with a thin root-level wiring file (`eks.tf`). Both use only AWS provider primitives — no community module. ### `modules/eks-cluster/` — AWS infrastructure - `aws_iam_role` for the cluster, with Auto Mode's five required managed policies attached: `AmazonEKSClusterPolicy`, `AmazonEKSComputePolicy`, `AmazonEKSBlockStoragePolicy`, `AmazonEKSLoadBalancingPolicy`, `AmazonEKSNetworkingPolicy`. - `aws_iam_role` for Auto Mode nodes, with `AmazonEKSWorkerNodeMinimalPolicy` and `AmazonEC2ContainerRegistryPullOnly`. - `aws_eks_cluster` with `compute_config`, `storage_config`, and `kubernetes_network_config.elastic_load_balancing` all enabled. `access_config.authentication_mode = "API"` uses EKS access entries (no aws-auth configmap). - Pre-created internal NLB (`aws_lb`) with a CloudFront-prefix-list security group. NLB security groups cannot be attached after creation, so the module creates the NLB itself; the Auto-Mode-managed LB Controller adopts it later via the chart's `service.beta.kubernetes.io/aws-load-balancer-name` annotation. - `aws_cloudfront_vpc_origin` wrapping the NLB, plus an `aws_cloudfront_distribution` whose default behavior routes to the EKS API and whose AI-proxy paths route to `braintrustproxy.com`. - Private subnet tags (`kubernetes.io/role/internal-elb`) for LB Controller subnet auto-discovery. ### `modules/eks-deploy/` — Kubernetes + Helm - `kubernetes_namespace` for Braintrust workloads. - `kubernetes_secret` (`braintrust-secrets`) with `PG_URL`, `REDIS_URL`, `FUNCTION_SECRET_KEY`, `BRAINSTORE_LICENSE_KEY`. The name and keys are hardcoded by the chart. - `aws_eks_pod_identity_association` resources for the `braintrust-api` and `brainstore` service accounts, binding each to its IAM role from `services_common`. - `kubernetes_manifest` for a custom `NodeClass` (Auto Mode API: `eks.amazonaws.com/v1`) and `NodePool` (`karpenter.sh/v1`) that constrain Karpenter to NVMe-backed instance families (`c8gd`, `c7gd`, `m7gd` by default, configurable via `eks_brainstore_nodepool_instance_families`). Brainstore pods pin to this NodePool via a `braintrust.dev/node-pool: brainstore` nodeSelector in helm values. - `helm_release` for the Braintrust chart, with a thin values template that sets only what this module owns and structured per-component overrides (`eks_api_helm`, `eks_brainstore_{reader,fastreader,writer}_helm`) plus a raw-YAML `eks_helm_chart_extra_values` escape hatch. ### Why two submodules `services_common` creates IAM roles shared with the non-EKS (Lambda / EC2) deployment path, so it must sit at the root between `eks_cluster` (which provides the cluster ARN used to scope Pod Identity trust policies) and `eks_deploy` (which consumes the resulting role ARNs). Wrapping both EKS submodules in a single parent would create a module-level dependency cycle through `services_common`. ## Pod Identity (not IRSA) Auto Mode's Pod Identity Agent intercepts AWS SDK credential resolution before IRSA is consulted, so pods authenticate via Pod Identity even though the chart still emits an `eks.amazonaws.com/role-arn` annotation (the IRSA path). The module: - Sets `enable_eks_pod_identity = true` on `services_common` and passes it the cluster ARN. `services_common` builds a trust policy with the `pods.eks.amazonaws.com` principal, scoped via session tags (`aws:RequestTag/eks-cluster-arn`, `aws:RequestTag/kubernetes-namespace`) to this specific cluster and namespace. - Creates an `aws_eks_pod_identity_association` for each service account, binding `(cluster, namespace, service-account)` to the IAM role. No OIDC provider, no TLS cert thumbprint management. ## Changes to existing root files - `main.tf`: `services_common` gets `enable_eks_pod_identity = true` + the EKS cluster ARN when `create_eks_cluster = true`. `database` and `redis` `authorized_security_groups` include the cluster's primary security group (which Auto Mode attaches to all nodes) in EKS mode. - `outputs.tf`: `api_url` and `cloudfront_*` outputs resolve to the EKS CloudFront distribution in EKS mode. No other outputs added — existing output contract is otherwise unchanged. - `variables.tf`: new EKS knobs (`create_eks_cluster`, `eks_kubernetes_version`, `eks_brainstore_nodepool_instance_families`, `helm_chart_version`, and the four structured helm-override variables plus the raw-YAML escape hatch). - `versions.tf`: notes that `kubernetes`, `helm`, and `random` are declared in `modules/eks-deploy`. Non-EKS consumers must still declare empty provider blocks at the root because Terraform aggregates provider requirements across all submodules regardless of `count`, but the underlying resources are never evaluated when `create_eks_cluster = false`. ## Example `examples/braintrust-data-plane-eks/` is a thin consumer — provider configuration plus a single module call — demonstrating the two-step apply workflow required on a fresh deployment: terraform apply -target=module.braintrust.module.eks_cluster[0] terraform apply Step 1 creates the cluster so the `data.aws_eks_cluster` lookup in `provider.tf` (keyed by the statically-knowable name `${deployment_name}-eks`) can resolve. Step 2 plans the kubernetes and helm resources, including the `kubernetes_manifest` NodeClass/NodePool which require Auto Mode's CRDs to exist on the cluster before plan time. The kubernetes and helm providers use a static token from `data.aws_eks_cluster_auth`. Step 2's runtime is well under the 15-minute token TTL because Auto Mode's in-cluster setup is fast and `helm_release` defaults to a 5-minute wait timeout. ## Contract `CONTRACT.md` documents the coupling surface between this module and `braintrustdata/helm`: service account names, `Secret` name and keys, API port `8000`, the helm-values schema this module writes, the Pod-Identity-over-IRSA precedence, and the assumption that `brainstore.fastreader.replicas >= 1`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`braintrustdata/helm` released `6.1.0` today. Audit of the 5.0.1 → 6.1.0 diff against this module's coupling surface (service account names, `braintrust-secrets` name and keys, API port `8000`, values-schema keys the template writes, `brainstore.{reader,fastreader,writer}.nodeSelector`) came back clean — nothing on the contract moved. What actually changed in 6.x: - Image tags bumped `v1.1.32` → `v2.0.0` (chart semver policy treats image major bumps as chart major bumps). - `skipPgForBrainstoreObjects` and `brainstoreWalFooterVersion` are now top-level `values.yaml` defaults (this module was already writing them at top-level, so no template change needed). - Chart now emits additional Brainstore env vars derived from existing values (`BRAINSTORE_RESPONSE_CACHE_URI`, `BRAINSTORE_CODE_BUNDLE_URI`, `BRAINSTORE_ASYNC_SCORING_OBJECTS`, `BRAINSTORE_LOG_AUTOMATIONS_OBJECTS`, `BRAINSTORE_WAL_USE_EFFICIENT_FORMAT`) and adds `checksum/config` annotations on deployments so pods restart when the configmap changes. None of it requires a template, values-schema, or variable change on this side. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Aligns `examples/braintrust-data-plane-eks/` with the style of the other examples in this directory (`braintrust-data-plane/`, `braintrust-data-plane-sandbox/`): literal values in `main.tf`, `variables.tf` reduced to just the sensitive/per-deployment `brainstore_license_key`. Before, the example had a variable for every knob it set on the module (`deployment_name`, `braintrust_org_name`, `helm_chart_version`, `eks_namespace`, `brainstore_wal_footer_version`, `skip_pg_for_brainstore_objects`), which meant a user copying the example had to wire up `.tfvars` or `-var` flags for all of them. Now the example ships with sensible defaults as literals, users edit the values directly in their copy of `main.tf`, and only the license key flows through a variable (consistent with the sandbox and production examples). Also: - Module block renamed `module "braintrust"` → `module "braintrust-data-plane"` to match the other examples' naming. - `helm_chart_version = "6.1.0"` pinned as a literal. - `eks_cluster_name` local in `provider.tf` hardcoded to `"braintrust-eks"` with a comment noting it must match `${deployment_name}-eks` from `main.tf` (the `var.deployment_name` reference was dropped along with the variable). - Output references updated to the new module block name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`provider.tf` needs the EKS cluster name to configure the kubernetes and helm providers' `data.aws_eks_cluster` lookup. Previously this was a hardcoded literal (`"braintrust-eks"`) with a comment asking the user to keep it in sync with `deployment_name` in `main.tf`. That split-source-of-truth bit in practice: changing `deployment_name` in `main.tf` without also updating `provider.tf` silently points the providers at a nonexistent cluster, and step 2 of the two-step apply fails. Move `deployment_name` into a `locals` block at the top of `main.tf`. Terraform merges locals across files in the same module, so `provider.tf` can compute `eks_cluster_name = "${local.deployment_name}-eks"` without duplicating the string. One place to edit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Learnings from standing up the Auto Mode deployment for the first time: - aws_eks_cluster: set bootstrap_self_managed_addons = false. Auto Mode rejects CreateCluster otherwise, since its built-in addons conflict with the self-managed bootstrap path. - Brainstore NodeClass: scope subnetSelectorTerms to this deployment's VPC via BraintrustDeploymentName. `kubernetes.io/role/internal-elb` alone matches subnets in other VPCs (default VPC, other clusters in the same region), making Karpenter pick a subnet in the wrong VPC and fail RunInstances with a cross-VPC SG/subnet error. - Brainstore NodeClass: drop custom tags. AmazonEKSComputePolicy gates ec2:CreateLaunchTemplate on a tag-key allowlist; any extra key fails the controller's IAM pre-check. - Brainstore NodePool: switch instance-family requirement key from karpenter.k8s.aws/instance-family to eks.amazonaws.com/instance-family. Auto Mode restricts requirement domains and the karpenter.k8s.aws one isn't accepted. - helm_release timeout: bump to 1200s. Cold first deploys take longer than the 300s default (Karpenter node provisioning + three large Brainstore image pulls + readiness). - VPC private-subnet lifecycle: ignore_changes on the kubernetes.io/role/internal-elb tag so Terraform doesn't fight aws_ec2_tag (from modules/eks-cluster) on every apply. - Example provider.tf: switch kubernetes/helm auth from the 15-min static aws_eks_cluster_auth token to exec { aws eks get-token } so long applies and extended approval-prompt pauses don't fail with expired-token errors. - Example main.tf: expand the two-step-apply doc comment with the zsh-globbing caveat, and explain the 400 GB gp3 IOPS/throughput threshold that trips up smaller (sandbox) postgres_storage_size values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Auto Mode's Load Balancer Controller uses `ip` target-type for NLBs, which sends health checks and traffic directly to pod IPs on the container port (8000) — not via the NodePort on a node IP. When the NLB SG is pre-created and attached via the `aws-load-balancer-security-groups` annotation, the controller only opens the NodePort range on the cluster SG (the rule it would need for `instance` target-type) and leaves the container port unreachable. Result: TCP health checks time out, TG stays unhealthy, NLB has no backends, CloudFront hangs. Replace the NodePort-range rule (30000-32767) with a single TCP 8000 rule from the NLB SG to the cluster SG. NodePort wasn't being used by the `ip` target-type path anyway, so removing it is safe and avoids carrying a misleading rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The LB Controller names TargetGroups auto-generated from the Service as `k8s-<ns-8>-<svc-8>-<hash>`, and doesn't expose an override. For a Braintrust dataplane the namespace and service names are fixed (braintrust/braintrust-api), so every deployment in an AWS account ends up with TGs named `k8s-braintru-braintru-*` — visually indistinguishable in the console even though they're functionally isolated by the controller's cluster-scoping tag. Add the `aws-load-balancer-additional-resource-tags` annotation so the controller tags its TGs (and listeners) with BraintrustDeploymentName, matching the tag scheme we already use on Terraform-owned resources. Now `tag:BraintrustDeploymentName` is a reliable way to identify all AWS resources belonging to a specific dataplane deployment. Wire deployment_name into the helm-values template to pass through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The example only showed api and brainstore writer overrides; reader and fastreader were undocumented even though they have the same structured override variables. Add them so all four chart components have a copy-pasteable sandbox sizing example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two blockers made the initial EKS deploy require a -target'd two-step apply: 1. The example's provider.tf looked up the cluster via data.aws_eks_cluster, which reads at refresh (pre-plan) and fails if the cluster doesn't exist yet. There's no way to "defer" a data source read through the initial plan. 2. The NodeClass and NodePool were delivered via kubernetes_manifest, which reads CRD schemas from the live cluster at plan time to validate the manifest. On a fresh deploy the cluster doesn't exist and the plan fails. Neither has to be this way: 1. Expose the cluster endpoint, CA data, and name as root-module outputs. The example's provider.tf reads those instead of the data source. Terraform treats module outputs that trace back to unknown resource attributes as "known after apply" and defers provider resolution — no data source, no refresh-time failure. 2. Replace kubernetes_manifest for the NodeClass + NodePool with a helm_release pointing at a tiny local chart (modules/eks-deploy/charts/brainstore-nodepool/). Helm renders templates locally and applies at apply time, so there's no plan-time cluster contact. Result: single `terraform apply` from an empty AWS account brings up everything — VPC, cluster, RDS, Redis, S3, IAM, NodeClass/NodePool, Braintrust Helm release — in one command. Tradeoff we accepted: if the cluster is destroyed out of band while Terraform state still references in-cluster resources, refresh will fail because the cluster outputs become unreadable. Recovery is `terraform state rm` of the kubernetes_*/helm_release resources followed by `terraform apply`. In-band `terraform destroy` is handled correctly by the dependency graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The EKS CloudFront distribution was unconditionally routing /function/*, /v1/proxy*, and /v1/eval* to the CloudflareProxy origin (braintrustproxy.com). For a self-hosted dataplane this is wrong on two counts: - Request payloads round-trip through Braintrust's hosted proxy rather than staying inside the customer's AWS account — defeating a core reason for self-hosting. - The preflight OPTIONS that browsers send for these paths hits a Cloudflare 404 with no CORS headers, so the UI (braintrust.dev) fails every cross-origin request to these paths. Fix: default `target_origin_id` for those path patterns to `EKSAPIOrigin` (the in-cluster API pod via the NLB — standalone-api serves these paths in Dataplane 2.0). Mirrors the Lambda ingress module's default behavior, where paths route to the local AIProxy Lambda unless `use_global_ai_proxy = true`. Expose the same `use_global_ai_proxy` toggle so both modes have identical semantics — opt-in to `braintruestproxy.com` if Braintrust instructs, otherwise stay local. The root-level `var.use_global_ai_proxy` already existed (shared with the Lambda path); this wires it through to the EKS cluster submodule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The example's `locals { deployment_name = ... }` block only existed so provider.tf could derive the EKS cluster name from it without a duplicate literal. Since provider.tf now reads the cluster name from module outputs directly (`module.braintrust-data-plane.eks_cluster_name`), the local has no remaining cross-file use. Fold the constant back into the module call and carry the comment on the attribute instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Observed failure mode: `terraform destroy` freezes for ~5 minutes on helm_release.braintrust because the LB Controller holds the `service.eks.amazonaws.com/resources` finalizer on the api Service while it waits for target-group drain to complete. The default deregistration delay is 300s. In failure-mode states (cluster never had nodes register, a failed helm install, pods never reached Ready) the drain wait is spent on nothing — there are no targets to drain — but LB Controller respects it anyway. To the operator, `terraform destroy` looks hung; the hang resolves only after a manual `kubectl patch svc ... --patch '{"metadata":{"finalizers":null}}'`. Hit this class three times now: yesterday on the 2nd deployment, and today on both a failed redux apply and the intentional destroy of the 2nd deployment. Fix: annotate the api Service with `aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=0`. Zero drain wait means the NLB deregisters targets instantly, the finalizer clears immediately, and `helm uninstall` (and therefore `terraform destroy`) converges in seconds. Safe for production: the drain delay exists to let in-flight connections finish before a target is removed. For a stateless HTTP API fronted by CloudFront (which retries on connection failure), a few aborted connections on scale-in or destroy are acceptable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `use_global_ai_proxy` toggle was carried over from the Lambda ingress module, where it exists to let Braintrust's own multi-tenant SaaS deployment route through braintrustproxy.com instead of the local AIProxy Lambda. For self-hosted customers there's no reason to route through Braintrust's hosted proxy — doing so defeats the point of self-hosting and requires Braintrust-side registration of the customer's deployment to work at all. Hardcode the LLM-proxy path routing to the in-cluster API (`EKSAPIOrigin`), remove the `CloudflareProxy` origin from the distribution entirely, and drop the `use_global_ai_proxy` variable from the EKS cluster submodule. Lambda mode is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`terraform destroy` on a non-empty dataplane currently requires two manual cleanup steps before it'll succeed: 1. RDS instance has deletion_protection=true by default. Destroy errors with `InvalidParameterCombination: Cannot delete protected DB Instance`. Fix today: `aws rds modify-db-instance --no-deletion-protection --apply-immediately` out of band. 2. S3 buckets are versioned and non-empty (especially the Brainstore bucket which accumulates WAL + cache). Destroy errors with `BucketNotEmpty: The bucket you tried to delete is not empty. You must delete all versions`. Fix today: write a loop against list-object-versions + delete-objects for every bucket. For real customer deployments this safety is the right default — it prevents accidental data loss on a typo'd destroy. For sandbox / CI / throwaway deployments the friction is painful. New root variable `force_destroy_data` (default: false). When true: - Every S3 bucket gets `force_destroy = true`, so destroy empties the bucket (all versions + delete markers) before deleting it. - RDS deletion_protection is disabled (OR'd with the existing `DANGER_disable_database_deletion_protection` toggle). - RDS `skip_final_snapshot = true`, so destroy doesn't block on snapshot creation. Default stays at false, so existing consumers are unaffected. Sandbox users set `force_destroy_data = true` in their example main.tf and subsequent destroys are a single command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restore a set of root-module outputs that were previously pruned. They're broadly useful for consumers wiring this module into larger deployments — IAM role ARNs (for Pod Identity / IRSA references), Postgres and Redis connection details (for Kubernetes Secret construction from the root module's state), S3 bucket names (for downstream IAM policy templates), and EKS NLB identifiers. Omitted three outputs that appeared in earlier iterations but don't apply to Auto Mode: - eks_oidc_provider_arn — Auto Mode uses Pod Identity; there's no OIDC provider resource. - eks_node_security_group_id — we don't create a dedicated node SG; Auto Mode attaches the cluster SG to nodes. Expose eks_cluster_security_group_id instead. - eks_lb_controller_role_arn — Auto Mode owns the LB Controller; there's no IAM role to expose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Terraform state already stores these values; exposing them as root outputs amplifies the blast radius: - `terraform_remote_state` consumers pull values into a second state file. - `terraform output -json` in CI pipelines writes them to stdout/logs unredacted (`sensitive = true` only suppresses plaintext at the CLI, it doesn't scrub downstream logging). Removed: - `postgres_database_password` — the database module already creates a Secrets Manager secret; consumers can resolve credentials via `postgres_database_secret_arn` (still exposed), which is the canonical path. - `function_tools_secret_key` — Braintrust-internal encryption key used only by our own `kubernetes_secret.braintrust`. External consumers have no legitimate need for it. `eks_cluster_ca_certificate_data` stays — it's marked sensitive upstream but is a public CA cert by definition, and our own provider.tf consumes it from module outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Too dangerous a footgun to keep in the module. A consumer who accidentally flags `force_destroy_data = true` (or leaves it on after a test destroy and then starts using the deployment for real) would have no safety rails — all customer data evaporates on the next `destroy` with no final snapshot and no S3 version retention. Sandbox teardown friction is real but narrowly felt (just the TF module authors); the risk is broadly felt (every consumer). Prefer operators to run the same `aws s3api` version-delete loop and `aws rds modify-db-instance --no-deletion-protection` ceremony we ran during development — it's slower but impossible to trigger by accident. This reverts commit 1c350b2d; PR description updated separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove `data "aws_region" "current"` from modules/eks-deploy/main.tf; it was imported in the early days of the module and never actually referenced in any rendered value. - Remove the `custom_tags` variable from modules/eks-deploy/variables.tf and its unused pass-through in eks.tf. The eks-deploy submodule doesn't own any AWS resources (only Kubernetes + helm_release), so custom AWS tags have no effect there. Also incorporates the `terraform fmt -recursive` whitespace fixes on main.tf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the four structured per-component variables (`eks_api_helm`, `eks_brainstore_{reader,fastreader,writer}_helm`) and the `eks_helm_chart_extra_values` heredoc string with a single `eks_helm_values_file` variable pointing at a YAML file alongside the caller's main.tf. Why: - The four structured variables only covered two specific fields (replicas, resources) on four specific components. Anyone tweaking anything else (annotations, probes, env, image pins, nodeSelector) was forced into the heredoc escape hatch. The structured-vars abstraction was a half-abstraction. - Helm's native interface is "a list of values files." Collapsing to "module defaults + one caller-supplied values file" matches the mental model customers already have from `helm install -f values.yaml`. - Heredocs in HCL are unwieldy — no YAML lint, no IDE support, not shareable between deployments. A separate `.yaml` file fixes all three. Mechanics: the submodule accepts a filename (not a `file()` result). Path is interpreted by `file()` inside the submodule — use `${path.module}/values.yaml` or an absolute path on the caller side. Also adds `examples/braintrust-data-plane-eks-sandbox/` — a cheap disposable-sandbox variant of the existing EKS example. Smaller RDS (`db.r8g.large` / 100GB / gp3 baseline), smaller Redis (`cache.t4g.small`), and a `values.yaml` that shrinks every chart component to 1 replica with tight CPU/memory so the whole dataplane fits on a single small Karpenter-provisioned node. Matches the existing `braintrust-data-plane` / `braintrust-data-plane-sandbox` pattern elsewhere in the examples directory. Removes `modules/eks-deploy/overrides.tf` (the locals that synthesized YAML from the structured variables — no longer needed). CONTRACT.md updated to point at `eks_helm_values_file` for the fast-reader opt-out warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Before-GA doc coverage items that came up in review prep: - `modules/eks-cluster/README.md`: terse "what this submodule owns" overview + outputs table + key variables, so readers browsing the module on github/registry have a landing page rather than raw .tf. - `modules/eks-deploy/README.md`: same for the K8s/Helm layer, plus an explanation of the in-repo `brainstore-nodepool` chart (why we use helm_release instead of kubernetes_manifest for the NodeClass + NodePool) and the helm-values merge precedence. - `CONTRACT.md` "Deployment isolation" section: explicit note that `deployment_name` must be unique per account+region. Enumerates the resources that'd collide, confirms multiple-deployments-with-distinct-names is supported and validated, and points at the cosmetic LB Controller TG-name overlap (disambiguated via BraintrustDeploymentName tag). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two discoverability gaps that make the EKS mode hard to find / recover from, surfaced during PR review prep: - Root README.md previously didn't mention EKS mode at all. Readers browsing the module on github or the registry would have no signal that the create_eks_cluster = true path exists. Added a one-paragraph subsection under "How to use this module" pointing at the prod + sandbox examples and the new TROUBLESHOOTING.md. - TROUBLESHOOTING.md promotes the EKS-mode recovery ritual from the PR description (where it'd evaporate after merge) into a durable operator-facing doc. Covers the four failure modes we actually hit during development: out-of-band cluster deletion + state-rm recovery, helm_release destroy hanging on the Service finalizer, EIP quota exhaustion on fresh apply, and pods stuck Pending due to broken NAT. Also notes that the existing Lambda-mode dump-logs.sh script does not cover EKS mode (observability parity is a tracked follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four .md-audit-pass fixes and one structure change: - CONTRACT.md: "the four NLB annotations" → explicit list of the six now-present (`-scheme`, `-type`, `-security-groups`, `-name`, `-additional-resource-tags`, `-target-group-attributes`). Matches the current helm-values.yaml.tpl. - CONTRACT.md: deployment-isolation section dropped a dangling "See the 'TG naming' follow-up in the PR description" reference; the paragraph now explains the cosmetic collision + tag-disambiguation story inline, so it survives PR merge. - TROUBLESHOOTING.md: dropped a dangling "See the PR description's 'Remaining challenges' section" reference in the dump-logs.sh note; observability gap is now described inline. - README.md (dump-logs.sh section): added a note that the script covers only the Lambda/EC2 deployment mode, pointing at TROUBLESHOOTING.md + RECOVERY.md for EKS-mode runbooks. Plus: the out-of-band-cluster-deletion runbook is promoted from a buried section in TROUBLESHOOTING.md to its own top-level RECOVERY.md. It's a disaster-recovery scenario (state mismatch requiring state-level intervention), distinct from the routine apply/destroy failures that TROUBLESHOOTING.md collects. Cross-refs between the two docs so readers landing on the wrong one get redirected. RECOVERY.md also includes a "why the module accepts this failure mode" note explaining the single-apply-bootstrap tradeoff that makes this scenario possible. README's EKS-mode signal now points at both TROUBLESHOOTING.md and RECOVERY.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Status: implemented and validated end-to-end on the erikdw-sandbox-5 teardown, but unclear whether to ship in this PR. The chart-level annotation already added in fc11624 covers the same drain-wait finalizer hang for fresh deploys; this adds a redundant module-level preflight that catches the failure when the chart annotation didn't propagate (older chart, manual override, broken state). Decide before merge whether the broader coverage is worth the extra surface area. What it does, when var.prepare_for_destroy = true (default false): - kubernetes_annotations.api_drain_zero forces the api Service annotation `service.beta.kubernetes.io/aws-load-balancer-target-group-attributes` to `deregistration_delay.timeout_seconds=0`. Same key the chart template (helm-values.yaml.tpl) already sets — this resource only matters if the live annotation drifted. - terraform_data.api_tg_drain_zero loops over every TargetGroup tagged `BraintrustDeploymentName=<deployment_name>` and calls `aws elbv2 modify-target-group-attributes` to set the same attribute directly. Faster path than the LB Controller's reconcile loop, and covers the case where the controller created the TG before our annotation propagated. With drain wait at zero, the LB Controller releases its `service.eks.amazonaws.com/resources` finalizer the moment helm uninstall deletes the Service, finishes its own TG cleanup, and helm_release.braintrust returns in seconds. No kubectl-patch workaround needed, and no orphan TGs left in AWS. Why this exists: on the erikdw-sandbox-5 teardown, terraform destroy hung on helm_release.braintrust for ~10 minutes (past the default 5-min drain timer, suggesting the chart annotation never made it to the live TG on chart 6.1.0). Manual `kubectl patch svc braintrust-api ... finalizers:null` unblocks the destroy but interrupts the LBC mid-cleanup, leaving an orphan TG (`k8s-braintru-braintru-*` tagged with the deployment name). prepare_for_destroy avoids both problems. Scope is service infra only. Data-bearing resources keep their separate, explicit knobs: - RDS: DANGER_disable_database_deletion_protection (existing) - S3: deliberately not destroyable from TF — emptying buckets is a manual operator step before destroy. No DANGER_* flag, no force_destroy var. Matches the prior 398f997 revert of `force_destroy_data`. Files: - variables.tf: add `prepare_for_destroy` (root) - eks.tf: plumb through to module.eks_deploy - modules/eks-deploy/variables.tf: declare the var - modules/eks-deploy/main.tf: add `data.aws_region.current`, `kubernetes_annotations.api_drain_zero`, and `terraform_data.api_tg_drain_zero` (both gated by count) - TROUBLESHOOTING.md: prepare_for_destroy is now the documented happy path; the manual kubectl-patch runbook stays as the in-flight recovery, with a tag-driven cleanup snippet for the orphan TG that workaround leaves behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Erik Weathers (erikdw) and others added 21 commits April 22, 2026 18:38

Erik Weathers (erikdw) mentioned this pull request Apr 24, 2026

Add fully Terraform-managed EKS deployment mode #232

Closed

Erik Weathers (erikdw) and others added 2 commits April 24, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EKS Auto Mode deployment mode#233

Add EKS Auto Mode deployment mode#233
Erik Weathers (erikdw) wants to merge 23 commits intomainfrom
erikdw/eks-auto-mode

Erik Weathers (erikdw) commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Erik Weathers (erikdw) commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add EKS Auto Mode deployment mode

Summary

Why EKS Auto Mode

Usage

Minimal example

Single-apply bootstrap

Architecture

Module layout

Module ordering

Key design decisions

Destroy choreography

New variables

New module outputs

Module ↔ Helm chart contract

Tradeoffs accepted with single-apply bootstrap

Known limitations / follow-ups

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Erik Weathers (erikdw) commented Apr 23, 2026 •

edited

Loading