Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

AWS reference architecture

Terraform modules and example configurations for deploying the Poolside platform on AWS EKS. A single terraform apply provisions the AWS infrastructure, pushes the Poolside container images to ECR, uploads model checkpoints to S3, and installs the Poolside Helm releases.

Poolside Reference Architecture for AWS

See docs/architecture.md for a detailed walkthrough of the architecture.

Deployment profiles

Profile GPU Cognito Use case
examples/platform-only No Optional Core platform; models hosted elsewhere or via external API
examples/full Yes Optional Platform + local GPU inference

Deployment workflow

  1. Prepare prerequisites. AWS account with admin-equivalent credentials, the Poolside Helm bundle extracted on disk, model checkpoint tarballs (full profile), a public DNS hostname, and an ACM certificate covering that hostname in your target region. See docs/prerequisites.md.
  2. Copy the example that matches your profile out of this repo, then fill in terraform.tfvars from terraform.tfvars.example.
  3. First terraform apply creates the VPC, EKS cluster, node groups, RDS, S3, ECR (with Poolside images pushed via skopeo), KMS keys, IAM roles, ALB controller, External Secrets Operator, and uploads model checkpoints to S3. The Helm releases are gated off by default so the first apply exercises only infrastructure.
  4. Flip install_poolside_deployment = true (and install_inference_stack = true for the full profile) in terraform.tfvars and re-apply. Terraform installs the Poolside Helm releases against the cluster.
  5. Point your public hostname at the ALB. Create a Route 53 A (Alias) record. The ALB DNS name is on the ingress: kubectl get ingress -n poolside.
  6. Bind your identity provider. Visit https://<your-hostname> and follow the on-screen prompts. With Cognito, retrieve the issuer URL, client ID, and client secret from the Terraform outputs.

Quickstart: docs/quickstart.md.

What Terraform creates

Layer Resources
Network VPC, three-tier subnets (public, worker, control-plane), NAT gateways, S3 gateway endpoint
Compute EKS cluster, CPU node group, optional GPU node group, managed addons
Data RDS PostgreSQL (AWS-managed password), S3 buckets (data, access logs, models)
Security KMS keys, least-privilege IRSA roles, permissions-boundary support
Registry ECR repositories for all Poolside container images, populated from the bundle
Model checkpoints Streaming upload of *.tar checkpoints into the models bucket (full profile)
Cluster setup Kubernetes namespaces, gp3 StorageClass, ALB controller, External Secrets Operator, optional GPU Operator
Helm releases poolside-deployment and (full profile) inference-stack, gated by capability flags
Auth Optional Cognito user pool + client

Modules

All modules live under modules/:

Module Purpose
reference-stack Composition wrapper that wires all infra modules into a single deployable stack
network VPC, subnets, NAT gateways, S3 endpoint
eks EKS cluster, OIDC provider, managed addons, access entries
eks-node-groups/cpu CPU (platform) node group
eks-node-groups/gpu GPU (inference) node group with capacity-reservation support
data-stores RDS PostgreSQL + S3 buckets
ecr ECR repositories with bundle-driven image push
iam Node group instance roles plus IRSA roles for cluster addons and Poolside workloads
security KMS keys (EKS, RDS, S3, EBS, application encryption)
cluster-bootstrap Namespaces, gp3 StorageClass, optional custom CA bundle
ingress AWS Load Balancer Controller (Helm)
gpu-operator NVIDIA GPU Operator (Helm)
secrets-sync External Secrets Operator + RDS password sync
cognito AWS Cognito user pool, client, domain
model-checkpoints Streaming uploader for model checkpoint tarballs into S3
poolside-values Composes Helm values for the Poolside charts from infra outputs
helm-wrapper Chart-agnostic helm_release wrapper used by the example roots

Architecture decisions

This reference architecture is intentionally opinionated:

  • Single terraform apply provisions infrastructure, pushes container images, uploads model checkpoints, and installs the Helm releases.
  • IRSA only. EKS Pod Identity is not used.
  • ALB only. No nginx ingress controller.
  • Public EKS API endpoint with a mandatory CIDR allowlist; the private endpoint is also enabled.
  • AWS-managed RDS password. Never stored in Terraform state.
  • KMS encryption for application secrets. No static key option.
  • p5e.48xlarge as the minimum GPU instance type.
  • Terraform 1.5.7 or later. Works with OpenTofu 1.x.

See docs/customizing.md for permissions boundaries, custom CA bundles, AMI overrides, and other tunable knobs.

Documentation