Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions CONTRACT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Terraform ↔ Helm Contract

This Terraform module is paired with the Braintrust Helm chart in
[`braintrustdata/helm`](https://github.com/braintrustdata/helm). When
`create_eks_cluster = true` (EKS deployment mode), the module provisions
AWS infrastructure that the Helm chart expects to consume, and the Helm
release in turn must match a set of names, ports, and keys this module
hardcodes into IAM trust policies and security groups.

The coupling surface is small, but **several items fail silently at pod
runtime, not at `terraform apply`**. This document enumerates them so
that a PR to either side can check it hasn't broken the other.

## Pinned chart compatibility

| Field | Value |
|---|---|
| Braintrust Helm chart | `oci://public.ecr.aws/braintrust/helm` |
| Tested chart version | `5.0.1` |
| Supported range | `5.x` (no hard validation today — revisit when 6.x ships) |

The `helm_chart_version` variable in `examples/braintrust-data-plane-eks/`
has no default — consumers must pin.

## Coupling surfaces

Anything the module *writes into the chart values* OR *trusts the chart
to name* is listed here. If you change either side, audit this list.

### Names and identifiers

| Thing | TF location | Chart location | Failure mode |
|---|---|---|---|
| API service account name `braintrust-api` | IRSA `sub` claim in API handler role trust policy (computed in `modules/eks-cluster/iam.tf` `locals.api_iam_trust_policy`, with SA name from `var.api_service_account_name`) | `api.serviceAccount.name` default in chart `values.yaml` | **Silent runtime**: pod starts, `AssumeRoleWithWebIdentity` is rejected, every AWS SDK call returns 403 |
| Brainstore service account name `brainstore` | IRSA `sub` claim in Brainstore role trust policy (computed in `modules/eks-cluster/iam.tf` `locals.brainstore_iam_trust_policy`, with SA name from `var.brainstore_service_account_name`) | `brainstore.serviceAccount.name` default | Silent runtime (same as above) |
| LB Controller service account `kube-system:aws-load-balancer-controller` | `aws_iam_role.lb_controller` trust policy in `modules/eks-cluster/iam.tf` | AWS LB Controller helm chart (upstream, not ours) | LB Controller fails to create NLB targets; API service stays unreachable |
| K8s Secret name `braintrust-secrets` | `kubernetes_secret.braintrust` in `modules/eks-deploy/main.tf` | `api-deployment.yaml` and `brainstore-*-deployment.yaml` hardcode `secretKeyRef.name: braintrust-secrets` | Pod fails to start: `CreateContainerConfigError` |
| Secret keys `PG_URL`, `REDIS_URL`, `FUNCTION_SECRET_KEY`, `BRAINSTORE_LICENSE_KEY` | `data = { ... }` in `kubernetes_secret.braintrust` (`modules/eks-deploy/main.tf`) | Referenced in chart deployment templates | Pod fails to start (missing env var key) |
| Namespace | `var.eks_namespace` → `kubernetes_namespace.braintrust` in `modules/eks-deploy/main.tf` + passed as template `namespace` var | `global.namespace` (used in configmap to build `BRAINSTORE_*_URL`); runtime namespace resolved via `braintrust.namespace` helper to `.Release.Namespace` when `createNamespace: false` | Pods run in wrong namespace; intra-cluster DNS fails |

### Network / ports

| Thing | TF location | Chart location | Failure mode |
|---|---|---|---|
| API service port `8000` | `aws_cloudfront_vpc_origin.api.http_port` in `modules/eks-cluster/cloudfront.tf`, NLB target port implicit via LB Controller | `api.service.port` default `8000`; `api-deployment.yaml` containerPort | **Silent at deploy**: CloudFront → NLB → node NodePort path dead |
| NodePort range `30000-32767` | `aws_vpc_security_group_ingress_rule.eks_nodes_from_nlb` in `modules/eks-cluster/networking.tf` | Kubernetes kube-apiserver default (outside our control) | Would require K8s project default change — very low risk |
| Pre-created NLB adopted by chart via `service.beta.kubernetes.io/aws-load-balancer-name` | `aws_lb.api.name` in `modules/eks-cluster/networking.tf` (exposed as the root's `eks_nlb_name` output) | `api.annotations.service.*` — controller reads this annotation | If chart renames the annotation or consumer unsets it, the controller creates a parallel NLB; CloudFront VPC Origin points at the orphan |
| NLB security group | `aws_security_group.nlb_cloudfront` in `modules/eks-cluster/networking.tf` (NLBs only accept SGs at creation; cannot be added later) | `service.beta.kubernetes.io/aws-load-balancer-security-groups` in `api.annotations.service` | Adopted NLB gets wrong SG; CloudFront can't reach it |

### Helm values schema written by the module template

The template lives at `modules/eks-deploy/assets/helm-values.yaml.tpl`.
Any of these keys moving or renaming in the chart breaks us silently
(the template writes a dead key, the chart uses its own default).

- `global.orgName`
- `global.createNamespace`
- `global.namespace`
- `cloud` (set to `"aws"`)
- `skipPgForBrainstoreObjects`
- `brainstoreWalFooterVersion`
- `objectStorage.aws.brainstoreBucket`
- `objectStorage.aws.responseBucket`
- `objectStorage.aws.codeBundleBucket`
- `api.service.type` (set to `LoadBalancer`)
- `api.annotations.service.*` (the four NLB annotations)
- `api.serviceAccount.awsRoleArn`
- `brainstore.serviceAccount.awsRoleArn`

### Feature-flag value domains

- `brainstoreWalFooterVersion` — TF validation allows `""`, `"v1"`, `"v2"`, `"v3"` (see `variables.tf`). Chart must accept the same set; when the chart adds a new version, TF validation needs updating.
- `skipPgForBrainstoreObjects` — TF allows `""`, `"all"`, `"include:…"`, `"exclude:…"`. Chart passes through unchanged.

### Assumptions baked into the contract

- **EKS mode assumes a fast reader is always deployed.** The chart defaults `brainstore.fastreader.replicas = 2` and unconditionally emits `BRAINSTORE_FAST_READER_URL` + `BRAINSTORE_FAST_READER_QUERY_SOURCES` from `api-configmap.yaml`, so the API always believes fast readers are available. This differs from EC2 Brainstore mode where `brainstore_fast_reader_instance_count = 0` is a supported "disabled" state (the services module conditionally omits the env vars). In EKS mode we intentionally do not support the 0-replicas case — users who scale `eks_brainstore_fastreader_helm.replicas` to 0 opt out of this contract and own the resulting query failures.

## Checklist: making a change

### Changing the TF module

- If the change touches any row of a table above, open a matching issue/PR in `braintrustdata/helm`.
- Regenerate `helm-values.yaml.tpl` and confirm every key still exists in the chart's `values.yaml`.
- If you rename a service-account name or secret, update both the IRSA trust policy *and* the kubernetes_secret / chart values in the example.

### Changing the Helm chart

- If you rename any `.Values.*` key listed in the "Helm values schema" section, file an issue here to update `helm-values.yaml.tpl`.
- If you rename an SA (`api.serviceAccount.name` or `brainstore.serviceAccount.name`) or change the hardcoded secret name in a deployment template, this module's IRSA trust policy breaks silently — file a coordinated PR.
- If you change the API service port default away from 8000, ship a matching TF variable for `eks_api_service_port` or coordinate a default bump.
- If you want to support `fastreader.replicas = 0` in EKS mode (parity with EC2's `brainstore_fast_reader_instance_count = 0`), gate the `BRAINSTORE_FAST_READER_URL` configmap entry on `replicas > 0` first, then update the assumption in this doc.

### Bumping the chart version used in the example

- Diff the chart's `values.yaml` between versions, scan for any key listed above.
- Run `helm template` locally with this module's rendered values and grep for the hardcoded names/ports/keys listed in the tables.

## Future: mechanical drift detection

This document is a manual safety net. See `memory` notes for deferred
ideas: CI smoke tests that render `helm template` with TF-shaped values
and assert the contract, plus a symmetric test in the helm repo.
99 changes: 99 additions & 0 deletions eks.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Composition of the EKS cluster and EKS-deploy submodules.
# All resource-level logic lives under modules/eks-cluster/ and
# modules/eks-deploy/. This file is just wiring.

locals {
# Kubernetes namespace the Braintrust workloads run in. Falls back to
# "braintrust" when var.eks_namespace is null (keeps the existing
# non-EKS behavior of the var while providing a sensible default here).
eks_namespace_resolved = coalesce(var.eks_namespace, "braintrust")

# Safe accessors for count-gated module outputs — one() returns null for empty lists.
eks_cluster_arn_val = one(module.eks_cluster[*].cluster_arn)
eks_cluster_name_val = one(module.eks_cluster[*].cluster_name)
eks_cluster_endpoint_val = one(module.eks_cluster[*].cluster_endpoint)
eks_cluster_ca_certificate_val = one(module.eks_cluster[*].cluster_certificate_authority_data)
eks_oidc_provider_arn = one(module.eks_cluster[*].oidc_provider_arn)
eks_node_security_group_id = one(module.eks_cluster[*].node_security_group_id)
eks_api_iam_trust_policy = one(module.eks_cluster[*].api_iam_trust_policy)
eks_brainstore_iam_trust_policy = one(module.eks_cluster[*].brainstore_iam_trust_policy)
eks_lb_controller_role_arn = one(module.eks_cluster[*].lb_controller_role_arn)
eks_nlb_arn_val = one(module.eks_cluster[*].nlb_arn)
eks_nlb_name_val = one(module.eks_cluster[*].nlb_name)
eks_nlb_security_group_id = one(module.eks_cluster[*].nlb_security_group_id)
eks_cloudfront_domain_name = one(module.eks_cluster[*].cloudfront_distribution_domain_name)
eks_cloudfront_arn = one(module.eks_cluster[*].cloudfront_distribution_arn)
eks_cloudfront_hosted_zone_id = one(module.eks_cluster[*].cloudfront_distribution_hosted_zone_id)
}

module "eks_cluster" {
source = "./modules/eks-cluster"
count = var.create_eks_cluster ? 1 : 0

deployment_name = var.deployment_name
custom_tags = var.custom_tags
permissions_boundary_arn = var.permissions_boundary_arn

vpc_id = local.main_vpc_id
private_subnet_ids = [
local.main_vpc_private_subnet_1_id,
local.main_vpc_private_subnet_2_id,
local.main_vpc_private_subnet_3_id,
]

eks_namespace = local.eks_namespace_resolved
eks_kubernetes_version = var.eks_kubernetes_version
eks_node_instance_type = var.eks_node_instance_type
eks_node_min_size = var.eks_node_min_size
eks_node_max_size = var.eks_node_max_size
eks_node_desired_size = var.eks_node_desired_size

cloudfront_price_class = var.cloudfront_price_class
custom_domain = var.custom_domain
custom_certificate_arn = var.custom_certificate_arn
waf_acl_id = var.waf_acl_id
}

module "eks_deploy" {
source = "./modules/eks-deploy"
count = var.create_eks_cluster ? 1 : 0

deployment_name = var.deployment_name
custom_tags = var.custom_tags
braintrust_org_name = var.braintrust_org_name
namespace = local.eks_namespace_resolved

cluster_name = module.eks_cluster[0].cluster_name
vpc_id = local.main_vpc_id
lb_controller_role_arn = module.eks_cluster[0].lb_controller_role_arn
nlb_security_group_id = module.eks_cluster[0].nlb_security_group_id
nlb_name = module.eks_cluster[0].nlb_name

# IAM role ARNs come from services_common, which in turn consumed the
# trust policies output by eks_cluster above. This forms a linear chain:
# eks_cluster -> services_common -> eks_deploy.
api_handler_role_arn = module.services_common.api_handler_role_arn
brainstore_iam_role_arn = module.services_common.brainstore_iam_role_arn

brainstore_bucket_name = module.storage.brainstore_bucket_id
response_bucket_name = module.storage.lambda_responses_bucket_id
code_bundle_bucket_name = module.storage.code_bundle_bucket_id

postgres_host = module.database.postgres_database_address
postgres_port = module.database.postgres_database_port
postgres_username = module.database.postgres_database_username
postgres_password = module.database.postgres_database_password
redis_host = module.redis.redis_endpoint
redis_port = module.redis.redis_port

brainstore_license_key = var.brainstore_license_key
brainstore_wal_footer_version = var.brainstore_wal_footer_version
skip_pg_for_brainstore_objects = var.skip_pg_for_brainstore_objects

helm_chart_version = var.helm_chart_version
api_helm = var.eks_api_helm
brainstore_reader_helm = var.eks_brainstore_reader_helm
brainstore_fastreader_helm = var.eks_brainstore_fastreader_helm
brainstore_writer_helm = var.eks_brainstore_writer_helm
helm_chart_extra_values = var.eks_helm_chart_extra_values
}
73 changes: 73 additions & 0 deletions examples/braintrust-data-plane-eks/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# tflint-ignore-file: terraform_module_pinned_source

# Example: Fully Terraform-managed EKS-based Braintrust data plane.
#
# This example is a thin configuration layer — all logic (EKS cluster, OIDC,
# addons, NLB, CloudFront, Kubernetes namespace + secret, Helm releases) lives
# inside the module. The example just sets variables and configures providers.
#
# IMPORTANT — two-step apply required on first deployment:
#
# Step 1: terraform apply -target=module.braintrust.module.eks[0]
# Step 2: terraform apply
#
# Step 1 creates the EKS cluster so the kubernetes/helm providers in
# provider.tf can resolve the cluster endpoint via data.aws_eks_cluster.
# Step 2 deploys the K8s namespace, secret, and Helm releases.

module "braintrust" {
source = "../../"
# For production use, pin to a released version:
# source = "github.com/braintrustdata/terraform-braintrust-data-plane?ref=vX.Y.Z"

deployment_name = var.deployment_name
braintrust_org_name = var.braintrust_org_name
brainstore_license_key = var.brainstore_license_key

# EKS deployment mode — disables Lambda, EC2 Brainstore, and Lambda-based ingress
use_deployment_mode_external_eks = true
# Create and manage the EKS cluster with Terraform
create_eks_cluster = true

eks_namespace = var.eks_namespace
helm_chart_version = var.helm_chart_version

brainstore_wal_footer_version = var.brainstore_wal_footer_version
skip_pg_for_brainstore_objects = var.skip_pg_for_brainstore_objects

# Disable quarantine VPC (Lambda-based, not relevant for EKS mode)
enable_quarantine_vpc = false

### Postgres
postgres_instance_type = "db.r8g.2xlarge"
postgres_storage_size = 1000
postgres_max_storage_size = 10000
postgres_storage_type = "gp3"
postgres_storage_iops = 15000
postgres_storage_throughput = 500
postgres_version = "15"
postgres_auto_minor_version_upgrade = true

### Redis
redis_instance_type = "cache.t4g.medium"
redis_version = "7.0"

### Sandbox helm overrides (example — uncomment and adjust for smaller
### deployments than the chart's production defaults assume).
# eks_node_instance_type = "c8gd.2xlarge"
# eks_node_desired_size = 2
# eks_api_helm = {
# replicas = 1
# resources = {
# requests = { cpu = "500m", memory = "1Gi" }
# limits = { cpu = "1", memory = "2Gi" }
# }
# }
# eks_brainstore_writer_helm = {
# replicas = 1
# resources = {
# requests = { cpu = "1", memory = "2Gi" }
# limits = { cpu = "2", memory = "4Gi" }
# }
# }
}
24 changes: 24 additions & 0 deletions examples/braintrust-data-plane-eks/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
output "api_url" {
value = module.braintrust.api_url
description = "Braintrust API URL — enter this in the Braintrust dashboard under Settings > Data Plane > API URL"
}

output "eks_cluster_name" {
value = module.braintrust.eks_cluster_name
description = "EKS cluster name"
}

output "eks_cluster_endpoint" {
value = module.braintrust.eks_cluster_endpoint
description = "EKS cluster API server endpoint"
}

output "cloudfront_distribution_domain_name" {
value = module.braintrust.cloudfront_distribution_domain_name
description = "CloudFront distribution domain name"
}

output "postgres_database_identifier" {
value = module.braintrust.postgres_database_identifier
description = "RDS instance identifier"
}
46 changes: 46 additions & 0 deletions examples/braintrust-data-plane-eks/provider.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# provider "aws" {
# region = "<your AWS region>"
#
# # Optional but recommended: restrict to a specific credential profile/account.
# # profile = "<your AWS credential profile>"
# # allowed_account_ids = ["<your AWS account ID>"]
# }

# The kubernetes and helm providers need the EKS cluster endpoint, which is created
# by the braintrust module. On a fresh deployment, use a two-step apply:
#
# Step 1: terraform apply -target=module.braintrust.module.eks[0]
# Step 2: terraform apply
#
# After step 1 the cluster exists and the data sources below succeed.
locals {
eks_cluster_name = "${var.deployment_name}-eks"
}

data "aws_region" "current" {}

data "aws_eks_cluster" "braintrust" {
name = local.eks_cluster_name
}

provider "kubernetes" {
host = data.aws_eks_cluster.braintrust.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.braintrust.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", local.eks_cluster_name, "--region", data.aws_region.current.region]
}
}

provider "helm" {
kubernetes {
host = data.aws_eks_cluster.braintrust.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.braintrust.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "aws"
args = ["eks", "get-token", "--cluster-name", local.eks_cluster_name, "--region", data.aws_region.current.region]
}
}
}
9 changes: 9 additions & 0 deletions examples/braintrust-data-plane-eks/terraform.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Configure remote state backend here (e.g. S3):
#
# terraform {
# backend "s3" {
# bucket = "<your-state-bucket>"
# key = "braintrust-eks/terraform.tfstate"
# region = "<your-region>"
# }
# }
Loading
Loading