Skip to content

Latest commit

 

History

History
122 lines (86 loc) · 6.83 KB

File metadata and controls

122 lines (86 loc) · 6.83 KB

Architecture overview

This document describes what the AWS reference architecture produces: the AWS resources and Kubernetes components needed to deploy the Poolside platform and/or models.

Poolside Reference Architecture for AWS

What Terraform creates

A single terraform apply creates the complete AWS foundation and Kubernetes requirements for a Poolside deployment, then deploys the Poolside platform and (optional) local inference stack on GPU EKS worker nodes.

The infrastructure is organized into layers:

Network

A dedicated VPC with three subnet tiers across multiple availability zones:

  • Public subnets: NAT gateways and the internet-facing Application Load Balancer (ALB)
  • Private worker subnets: EKS worker nodes (CPU and GPU), RDS instance, with outbound internet via NAT
  • Private control plane subnets: EKS control plane ENIs (AWS-managed)

An S3 VPC gateway endpoint routes S3 traffic directly, bypassing NAT gateways to reduce data transfer costs for image pulls and model artifact downloads.

EKS cluster

A managed Kubernetes cluster with:

  • OIDC provider for IAM Roles for Service Accounts (IRSA)
  • Managed EKS add-ons: vpc-cni, kube-proxy, coredns, metrics-server, snapshot-controller, aws-ebs-csi-driver
  • Public API endpoint protected by a mandatory CIDR allowlist, plus a private endpoint for in-VPC traffic
  • Access entries for cluster admin principals (API mode, not aws-auth ConfigMap)
  • Envelope encryption of Kubernetes Secrets using a customer-managed KMS key

Node groups

  • CPU node group (always created): runs the Poolside platform services, ALB controller, External Secrets Operator, and cluster addons
  • GPU node group (optional, full profile only): runs model inference workloads via the NVIDIA GPU Operator. Supports EC2 capacity reservations for guaranteed GPU instance availability.

Data stores

  • RDS PostgreSQL: application database with AWS-managed master password (stored in Secrets Manager, never in Terraform state), multi-AZ by default, Performance Insights enabled, CloudWatch log exports
  • S3 buckets: data bucket (model artifacts, telemetry, repositories) and access log bucket. Both SSE-KMS encrypted with public access blocked.
  • ECR repositories: one per container image the Helm chart needs, namespaced under the deployment name

Security

  • KMS keys: EKS secret encryption, RDS storage encryption, S3 object encryption, EBS volume encryption, and application-level encryption (used by core-api for encrypting sensitive data in the database)
  • IAM roles with least-privilege policies: node group instance roles, IRSA workload roles (core-api, inference, external-secrets, ALB controller, EBS CSI, VPC CNI), plus the EKS cluster role
  • Permissions boundary support: an optional permissions_boundary_arn threads through every IAM role for regulated environments

Cluster prerequisites (Kubernetes resources created by Terraform)

Before Helm runs, Terraform creates the Kubernetes resources the chart expects to find:

  • Namespaces: poolside (platform) and poolside-models (inference)
  • gp3 StorageClass (cluster default): EBS-backed, KMS-encrypted, WaitForFirstConsumer binding
  • Custom CA bundle ConfigMap (optional): for environments with TLS-intercepting proxies or private PKI
  • AWS Load Balancer Controller: Helm-managed, creates ALBs from Kubernetes Ingress resources
  • External Secrets Operator: syncs the RDS master password from Secrets Manager into a Kubernetes Secret (poolside-db-secret)
  • NVIDIA GPU Operator (full profile only): installs GPU device drivers and the Kubernetes device plugin

Pods reach KMS and S3 via IRSA, so no static-key or AWS-credentials Kubernetes Secrets are created.

What Helm installs

Terraform also owns the Helm installs, via the helm-wrapper module. Two releases come from the Poolside bundle:

  • poolside-deployment: platform workloads:
    • core-api: Poolside API server (chat, completions, agent orchestration, repository indexing)
    • core-api-worker: async worker pool running the same forge_api binary in worker mode
    • core-api-temporal-server: embedded Temporal server
    • web-assistant: Svelte SPA frontend served by Caddy
    • public-docs: static docs served at the same ALB under /docs
  • inference-stack (full profile only): one deployment per enabled model subchart, plus an envoy proxy and an extproc sidecar for request dispatch

Values for both charts are composed by the poolside-values module from reference-stack outputs (database endpoints, S3 bucket names, KMS ARNs, ECR registry URIs, IRSA role ARNs). The three-layer composition (reference-stackpoolside-valueshelm-wrapper) isolates chart-specific knowledge to a single module; see customizing.md for the operator-visible knobs.

All application pods use IRSA for AWS API access (S3, KMS).

Ingress

core-api, web-assistant and the public-docs workloads share a single internet-facing ALB via the group.name annotation. TLS termination uses an ACM certificate (looked up by domain name). After deployment, the operator must create a DNS record that points the public hostname at the ALB domain name.

Authentication

The platform supports any OIDC-compliant identity provider. Optionally, Terraform can create an AWS Cognito user pool and client. The Cognito endpoint, client ID, and client secret are output for the first-time IdP binding in the Poolside Console.

High availability

  • Multi-AZ by default: VPC subnets, NAT gateways, EKS control plane, and RDS are spread across availability zones
  • Single NAT gateway option: available via single_nat_gateway = true for cost-sensitive deployments (trades AZ-level NAT redundancy for lower cost)
  • RDS Multi-AZ: synchronous standby replica with automatic failover (RTO < 60 seconds). Can be disabled for non-production use.

Architectural limitations

A few deliberate design choices that affect operator workflow:

  • Public EKS API endpoint required. The cluster is created with both a public API endpoint (gated by cluster_endpoint_public_access_cidrs) and a private endpoint. Terraform talks to the public one; in-cluster workloads use the private one. Fully-private API access isn't supported by the reference architecture because the documented operator workflow assumes Helm and Terraform run from outside the VPC. If your organization requires private-only API access, you'll need to run Terraform and Helm from inside the VPC (bastion, peered VPC, or transit gateway).
  • Single ALB, HTTPS-only. The Poolside Console, core-api, and public-docs share one internet-facing ALB joined via the group.name annotation. There is no HTTP-only fallback. An ACM cert covering public_hostname must be issued in var.region before plan.