Skip to content

Latest commit

 

History

History
220 lines (166 loc) · 7.39 KB

File metadata and controls

220 lines (166 loc) · 7.39 KB

Deployment guide

End-to-end walkthrough from an unconfigured AWS account to a running Poolside platform. Assumes you've read prerequisites.md and have tools + creds + bundle + (for full profile) model checkpoints in place.

Choose a profile

Pick platform-only or full — see the profile comparison in quickstart for the component-by-component breakdown and rough hourly cost.

Each profile has an example root in public/examples/<profile>/. Copy the one you want; don't edit the in-tree example directly.

Bootstrap steps

# 1. Stage your working dir outside the repo (so updates to the reference architecture
#    don't conflict with your customizations)
cp -r path/to/aws-reference-architecture/public/examples/full ~/my-poolside-deployment
cd ~/my-poolside-deployment

# 2. Fill in your variables
cp terraform.tfvars.example terraform.tfvars
$EDITOR terraform.tfvars

# 3. Export your AWS profile (NOT pinned in providers.tf on purpose)
export AWS_PROFILE=my-admin-profile
aws sts get-caller-identity

# 4. Init + plan + apply, STAGED
terraform init
terraform plan     # review carefully on first apply
terraform apply

The staged-rollout pattern (full profile)

The full example defaults BOTH Helm installs OFF via install_poolside_deployment = false and install_inference_stack = false. First apply creates infrastructure, pushes container images to ECR, and uploads model checkpoints, with no Helm releases yet. This gives you a clean stopping point to verify each layer before installing the application on top.

Stage 1: infra + ECR + model uploads

# Both install_* flags stay at their defaults (false)
terraform apply

Verify before moving on:

# ECR repos populated
aws ecr describe-repositories --region us-east-2 \
  --query 'repositories[?starts_with(repositoryName, `<your-deployment>/`)].repositoryName'

aws ecr list-images --region us-east-2 \
  --repository-name <your-deployment>/atlas

# Model checkpoints uploaded, marker objects present
aws s3 ls s3://<your-deployment>-models/models/checkpoints/
aws s3api head-object \
  --bucket <your-deployment>-models \
  --key models/checkpoints/malibu-v2.20251021_int4/.checkpoint-complete \
  --query Metadata

# Cluster up, kubectl works
eval "$(terraform output -raw kubeconfig_command)"
kubectl get nodes
kubectl get ns

Stage 2: install the platform

# Set install_poolside_deployment = true in terraform.tfvars
terraform apply
kubectl get pods -n poolside

Wait for all poolside pods to be Ready. If a pod is CrashLooping, see the Troubleshooting section below.

Stage 3: install inference

# Set install_inference_stack = true in terraform.tfvars
terraform apply
kubectl get pods -n poolside-models
kubectl logs -n poolside-models <inference-pod> -c init

The init container streams 20-75GB of model checkpoint data from S3 on first startup. Give it time. The default Helm timeout for inference is 30 minutes.

Or skip staging and set both install_* to true in the first apply. Fine for routine redeployments where the code + bundle + models are known-good.

Platform-only profile

Platform-only installs the poolside-deployment Helm chart via Terraform (gated by the same install_poolside_deployment flag as the full profile) and skips inference-stack entirely. The staged rollout collapses to two stages instead of three: stage 1 = infra + ECR (install_poolside_deployment = false), stage 2 = platform Helm release (install_poolside_deployment = true). The deployed platform connects to an external OpenAI-compatible model API rather than running local inference; that endpoint is configured in the Poolside Console after install, not via Terraform.

Post-install: public DNS and IdP

Once Helm is installed:

  1. Get the ALB hostname:
    kubectl get ingress -n poolside
  2. Create a Route 53 A record (Alias) pointing your public_hostname to the ALB.
  3. Navigate to https://<public_hostname> and follow the IdP binding prompts.

If you set enable_cognito = true, the Cognito endpoint, client ID, and client secret are Terraform outputs:

terraform output -raw cognito_user_pool_endpoint
terraform output -raw cognito_user_pool_client_id
terraform output -raw cognito_user_pool_client_secret

Re-applies

Routine terraform apply against an existing deployment:

  • No code change → no-op or very small plan
  • New bundle: update containers_dir and bundle_root → ECR gets new images pushed (skip-if-exists on identical tags), Helm gets re-released with new chart versions
  • New model checkpoint: give the new tarball a distinct filename (the version is already part of the alias-prefix convention, e.g. malibu-v2.20251021_int4.tarmalibu-v2.20260101.tar) and Terraform plans the upload. In-place replacement of an existing tarball is not detected at plan time. See model-checkpoints.md.

Teardown

# 1. Uninstall Helm first (clean removal of ALB, finalizers)
#    Set install_inference_stack = false, install_poolside_deployment = false
terraform apply

# 2. Destroy the rest
terraform destroy

Destroy typically takes 15-25 minutes. Common hangs:

  • EKS waiting on finalizer: a Kubernetes namespace is stuck terminating. kubectl get ns --field-selector status.phase=Terminating to find it; kubectl get all -n <ns> to find the holdout resource.
  • ECR / S3 non-empty: if ecr_force_destroy_repositories or s3_force_destroy_buckets wasn't set true, Terraform refuses to delete. Either flip those flags for the destroy, or empty the resources manually first.

Troubleshooting

terraform plan fails with precondition errors

The reference architecture adds plan-time preconditions that catch common misconfigurations early. The error message tells you what to fix:

  • "bundle is missing required images": atlas/envoy/gateway/forge_api (or forge_api/web-assistant/public-docs for the deployment chart) aren't all in your containers_dir. Re-extract the bundle.
  • "bundle_root does not contain charts/...": typo in bundle_root, or bundle isn't fully extracted.

Helm install fails with ImagePullBackOff

Either:

  • ECR wasn't populated (stage 1 didn't finish): rerun terraform apply
  • Image tags in the bundle don't match what the chart references: verify terraform output ecr_images against the chart's image names

Inference pod stuck on init container

First startup downloads 20-75GB from S3. If it's stuck more than ~30 minutes:

  • Check kubectl logs -n poolside-models <pod> -c init for S3 errors
  • Check the inference pod's IRSA role has s3:GetObject on the models bucket: aws iam get-role-policy --role-name <deployment>-inference-pod
  • Check the models S3 bucket has the checkpoint's .checkpoint-complete marker; if missing, the upload didn't complete

ALB not provisioning

Check the ALB controller pods in kube-system. Most commonly the controller's mutating webhook isn't ready before the chart's Services got created. Rerun terraform apply to retry.

Cognito "Invalid redirect URI"

The callback URLs are computed from public_hostname. If you changed it after first apply, re-run terraform apply so Cognito updates.

Staying up to date

Re-read architecture.md after each release of this reference architecture for the list of what's new.