End-to-end walkthrough from an unconfigured AWS account to a running
Poolside platform. Assumes you've read
prerequisites.md and have tools + creds +
bundle + (for full profile) model checkpoints in place.
Pick platform-only or full — see the profile comparison in quickstart for the component-by-component breakdown and rough hourly cost.
Each profile has an example root in public/examples/<profile>/. Copy
the one you want; don't edit the in-tree example directly.
# 1. Stage your working dir outside the repo (so updates to the reference architecture
# don't conflict with your customizations)
cp -r path/to/aws-reference-architecture/public/examples/full ~/my-poolside-deployment
cd ~/my-poolside-deployment
# 2. Fill in your variables
cp terraform.tfvars.example terraform.tfvars
$EDITOR terraform.tfvars
# 3. Export your AWS profile (NOT pinned in providers.tf on purpose)
export AWS_PROFILE=my-admin-profile
aws sts get-caller-identity
# 4. Init + plan + apply, STAGED
terraform init
terraform plan # review carefully on first apply
terraform applyThe full example defaults BOTH Helm installs OFF via
install_poolside_deployment = false and install_inference_stack = false. First apply creates infrastructure, pushes container images to
ECR, and uploads model checkpoints, with no Helm releases yet. This
gives you a clean stopping point to verify each layer before
installing the application on top.
# Both install_* flags stay at their defaults (false)
terraform applyVerify before moving on:
# ECR repos populated
aws ecr describe-repositories --region us-east-2 \
--query 'repositories[?starts_with(repositoryName, `<your-deployment>/`)].repositoryName'
aws ecr list-images --region us-east-2 \
--repository-name <your-deployment>/atlas
# Model checkpoints uploaded, marker objects present
aws s3 ls s3://<your-deployment>-models/models/checkpoints/
aws s3api head-object \
--bucket <your-deployment>-models \
--key models/checkpoints/malibu-v2.20251021_int4/.checkpoint-complete \
--query Metadata
# Cluster up, kubectl works
eval "$(terraform output -raw kubeconfig_command)"
kubectl get nodes
kubectl get ns# Set install_poolside_deployment = true in terraform.tfvars
terraform apply
kubectl get pods -n poolsideWait for all poolside pods to be Ready. If a pod is CrashLooping,
see the Troubleshooting section below.
# Set install_inference_stack = true in terraform.tfvars
terraform apply
kubectl get pods -n poolside-models
kubectl logs -n poolside-models <inference-pod> -c initThe init container streams 20-75GB of model checkpoint data from S3 on first startup. Give it time. The default Helm timeout for inference is 30 minutes.
Or skip staging and set both install_* to true in the first
apply. Fine for routine redeployments where the code + bundle + models
are known-good.
Platform-only installs the poolside-deployment Helm chart via
Terraform (gated by the same install_poolside_deployment flag as
the full profile) and skips inference-stack entirely. The staged
rollout collapses to two stages instead of three: stage 1 = infra +
ECR (install_poolside_deployment = false), stage 2 = platform Helm
release (install_poolside_deployment = true). The deployed
platform connects to an external OpenAI-compatible model API rather
than running local inference; that endpoint is configured in the
Poolside Console after install, not via Terraform.
Once Helm is installed:
- Get the ALB hostname:
kubectl get ingress -n poolside
- Create a Route 53 A record (Alias) pointing your
public_hostnameto the ALB. - Navigate to
https://<public_hostname>and follow the IdP binding prompts.
If you set enable_cognito = true, the Cognito endpoint, client ID,
and client secret are Terraform outputs:
terraform output -raw cognito_user_pool_endpoint
terraform output -raw cognito_user_pool_client_id
terraform output -raw cognito_user_pool_client_secretRoutine terraform apply against an existing deployment:
- No code change → no-op or very small plan
- New bundle: update
containers_dirandbundle_root→ ECR gets new images pushed (skip-if-exists on identical tags), Helm gets re-released with new chart versions - New model checkpoint: give the new tarball a distinct filename
(the version is already part of the alias-prefix convention, e.g.
malibu-v2.20251021_int4.tar→malibu-v2.20260101.tar) and Terraform plans the upload. In-place replacement of an existing tarball is not detected at plan time. Seemodel-checkpoints.md.
# 1. Uninstall Helm first (clean removal of ALB, finalizers)
# Set install_inference_stack = false, install_poolside_deployment = false
terraform apply
# 2. Destroy the rest
terraform destroyDestroy typically takes 15-25 minutes. Common hangs:
- EKS waiting on finalizer: a Kubernetes namespace is stuck
terminating.
kubectl get ns --field-selector status.phase=Terminatingto find it;kubectl get all -n <ns>to find the holdout resource. - ECR / S3 non-empty: if
ecr_force_destroy_repositoriesors3_force_destroy_bucketswasn't set true, Terraform refuses to delete. Either flip those flags for the destroy, or empty the resources manually first.
The reference architecture adds plan-time preconditions that catch common misconfigurations early. The error message tells you what to fix:
- "bundle is missing required images":
atlas/envoy/gateway/forge_api(orforge_api/web-assistant/public-docsfor the deployment chart) aren't all in yourcontainers_dir. Re-extract the bundle. - "bundle_root does not contain charts/...": typo in
bundle_root, or bundle isn't fully extracted.
Either:
- ECR wasn't populated (stage 1 didn't finish): rerun
terraform apply - Image tags in the bundle don't match what the chart references:
verify
terraform output ecr_imagesagainst the chart's image names
First startup downloads 20-75GB from S3. If it's stuck more than ~30 minutes:
- Check
kubectl logs -n poolside-models <pod> -c initfor S3 errors - Check the inference pod's IRSA role has
s3:GetObjecton the models bucket:aws iam get-role-policy --role-name <deployment>-inference-pod - Check the models S3 bucket has the checkpoint's
.checkpoint-completemarker; if missing, the upload didn't complete
Check the ALB controller pods in kube-system. Most commonly the
controller's mutating webhook isn't ready before the chart's Services
got created. Rerun terraform apply to retry.
The callback URLs are computed from public_hostname. If you changed
it after first apply, re-run terraform apply so Cognito updates.
Re-read architecture.md after each release of this reference architecture for the list of what's new.