Skip to content

Latest commit

 

History

History
347 lines (262 loc) · 12 KB

File metadata and controls

347 lines (262 loc) · 12 KB

Customizing your deployment

How to tune the reference architecture for your environment without forking it.

The layering model

The reference architecture composes three modules in your operator root:

├── reference-stack   — infra (VPC, EKS, RDS, S3, IAM, ECR, model uploads)
├── poolside-values   — chart-specific glue (YAML values for Poolside Helm charts)
└── helm-wrapper (x2) — chart-agnostic helm_release runner

Most customizations happen via reference-stack variables (infra sizing and toggles) or poolside-values inputs (chart-level overrides). You rarely touch the helm-wrapper calls.

How to customize in practice

The examples/full and examples/platform-only roots expose a small set of variables (deployment_name, region, public_hostname, containers_dir, etc.). That's enough to stand up a deployment from a tfvars file. Most reference-stack inputs fall to their module defaults, which you won't see in the example's variables.tf.

To customize a knob that isn't already a variable, copy the example root and pick one of two patterns:

  • Edit module "stack" inline: add the input directly to the block. Good for a one-shot deployment.
  • Add a passthrough variable: declare a new variable in variables.tf, wire it into module "stack", surface it in terraform.tfvars. Good when multiple deployments share a root.

The full example already has commented-out GPU-sizing lines in module "stack" showing the inline pattern.

Common customizations

Override a model's GPU count (or any other per-model chart field)

The example surfaces a single inference_models variable mirroring the chart's inference.models.<key> schema. To override a model's GPU count or other fields, declare the full set of models you want and pass whatever per-model fields the chart accepts:

inference_models = {
  malibu      = { model = "s3://...", gpus = 4 }
  point       = { model = "s3://...", gpus = 2 }
  "laguna-xs" = { model = "s3://...", gpus = 1 }
}

Keys become chart Deployment names verbatim. For well-known aliases (malibu, point, laguna-m, laguna-xs, laguna), missing optional fields (modelName, modelType, gpus) fall back to built-in defaults. For unknown keys, the operator must supply all required fields.

Other fields you can pass:

inference_models = {
  malibu = {
    model          = "s3://..."
    gpus           = 4
    modelName      = "MyOrg-Malibu-Variant"
    modelType      = "agent"
    modelExtraArgs = {
      "distributed-executor-backend" = "mp"
    }
  }
}

If inference_models is left null (the example's default), the example auto-derives one entry per uploaded tarball using the first-hyphen rule described in model-checkpoints.md.

Scale the CPU node group

cpu_instance_type       = "m5.8xlarge"    # default m5.4xlarge
cpu_desired_size        = 5
cpu_max_size            = 10
cpu_ebs_volume_size_gib = 200

Turn GPUs off entirely (cost control)

# Still creates node group + GPU Operator, but desired size = 0
gpu_desired_size = 0

# Or disable GPU provisioning entirely
enable_gpu_node_group = false

Note: enable_gpu_node_group = false means no inference workloads will schedule even if tarballs exist. For platform-only (no inference) deployments, use the platform-only profile instead. It's cheaper and skips several other GPU-adjacent resources.

Use a different GPU instance type

gpu_instance_type           = "p5.48xlarge"   # default p5e.48xlarge
gpu_capacity_reservation_id = "cr-xxxxxxxxxx" # pin to a reservation

p5e.48xlarge is the recommended minimum. Smaller GPUs (p4, g5) may not have enough memory for the Malibu/Point models.

Restrict EKS public API access

cluster_endpoint_public_access_cidrs = [
  "203.0.113.0/24",   # your office VPN
  "198.51.100.5/32",  # your bastion
]

The list must be non-empty (an empty list fails plan). To open the API to the entire internet, pass ["0.0.0.0/0"] explicitly. This is discouraged for any deployment with production workloads.

Add more admin principals

admin_principal_arns = [
  data.aws_iam_session_context.current.issuer_arn,
  "arn:aws:iam::<acct>:role/AWSReservedSSO_AdministratorAccess_*",
]

These get EKS cluster-admin access via Access Entries (not the legacy aws-auth ConfigMap).

Lockout guard

allow_locked_out_cluster = false   # default

When false (default), plan fails if the running principal isn't in admin_principal_arns. This prevents you from creating a cluster you can't access. Set true only if you're explicitly handing over control to another role.

Custom CA bundle

If your environment intercepts TLS (corporate proxy, private PKI):

custom_ca_bundle_pem = file("./certs/ca-chain.pem")

Terraform creates a ConfigMap (custom-ca-bundle in the poolside namespace), and the chart mounts it into pods at /etc/ssl/certs/custom-ca-bundle.crt.

Permissions boundaries (regulated environments)

permissions_boundary_arn = "arn:aws:iam::<acct>:policy/my-boundary"

Every IAM role created by the reference architecture gets this boundary attached: the direct roles in modules/iam/ (CPU/GPU node group instance roles, core-api pod, inference pod, External Secrets Operator), the module-managed addon roles (ALB controller, VPC CNI, EBS CSI driver, via terraform-aws-modules/iam/aws), and the EKS cluster role created by the upstream terraform-aws-modules/eks/aws module. Common in PubSec / FedRAMP environments where SCPs mandate a boundary on every new role. Default empty = no boundary attached.

IAM name prefix (regulated environments)

iam_name_prefix = "MyOrg-"

Prepends the prefix to every IAM role and policy name.

ACM cert for ingress TLS

The reference architecture is HTTPS-only: the ALB terminates TLS and the ACM cert is mandatory. Both example roots look up the cert by public_hostname:

public_hostname = "poolside.example.com"

public_hostname is required (no default, no empty value allowed) and must be a valid DNS hostname. An ACM cert covering it must be issued in var.region before terraform apply. The data "aws_acm_certificate" lookup fails plan otherwise. There is no HTTP-only fallback.

After apply, point the public hostname at the ALB by creating a Route 53 A (Alias) record. The ALB DNS name is on the ingress resource: kubectl get ingress -n poolside.

Less common customizations

BYO models bucket

See model-checkpoints.md Mode B.

BYO bundle-container-registry (private registry other than ECR)

Not supported. The reference architecture is opinionated: ECR only. If you need a different registry, the pattern is:

  1. Fork the reference architecture, replace modules/ecr with your own pushing module
  2. Keep everything else
  3. Point poolside-values's ecr input at your registry's outputs

Non-trivial. Open an issue if this matters to you.

Disable the staged-rollout gates

The default install_poolside_deployment = false and install_inference_stack = false are development-workflow niceties. For routine redeployments where the code is stable, just set both to true in your tfvars.

Database sizing

database_instance_class        = "db.m7g.xlarge"  # default
database_multi_az              = true             # default true
database_allocated_storage_gib = 200              # default 64

The default db.m7g.xlarge Multi-AZ instance is sized for production use. Override only if you need to scale further up, or down for non-production environments. database_multi_az = true is strongly recommended for any deployment that has an SLA.

S3 transfer acceleration for model uploads

use_s3_transfer_acceleration = true

Enables the transfer-acceleration endpoint on the models bucket. Worth it when the operator's machine is geographically distant from the models bucket's region (e.g. Europe ↔ us-east-2). Costs a fraction of a cent per GB; negligible next to the bandwidth it saves.

EKS AMI overrides

Default: AWS-managed EKS-optimized AL2023 AMIs (AL2023_x86_64_STANDARD for CPU nodes, AL2023_x86_64_NVIDIA for GPU nodes). Two override patterns, in order of preference:

# Pin to a specific managed AMI release (still managed, version-locked)
cpu_ami_release_version = "1.32.0-20251120"
gpu_ami_release_version = "1.32.0-20251120"

# OR pick a different managed AMI family (e.g. Bottlerocket, Graviton)
cpu_ami_type = "BOTTLEROCKET_x86_64"   # default AL2023_x86_64_STANDARD

Bring-your-own AMI is also supported, but you own the lifecycle:

cpu_custom_ami_id    = "ami-0123456789abcdef0"
cpu_custom_user_data = file("./userdata/cpu-bootstrap.sh")

When cpu_custom_ami_id is set, AWS-managed AMI updates are disabled. You're responsible for kubelet, containerd, CNI binaries, SSM agent, and any custom-CA injection. cpu_custom_user_data is required (pass "" only if your AMI self-bootstraps, which is rare).

The same four knobs exist with a gpu_ prefix. A custom GPU AMI must include working NVIDIA drivers, nvidia-container-toolkit, and the runtime setup the device plugin expects. The AL2023 NVIDIA AMI ships all of this; a barebones custom AMI will boot but pods will fail to schedule because the GPUs aren't visible.

Stick with managed AMIs unless you have a hard requirement (FIPS, custom hardening baseline, internal-build-promotion policy). Custom AMIs make every EKS upgrade your problem to keep working.

Logging / observability (BYO)

The reference architecture does not install a cluster-wide log shipper or metrics stack. Pick whatever fits your existing observability tooling. Two data sources are already wired and ready to consume:

  • EKS control-plane logs go to CloudWatch via the upstream EKS module's cluster_enabled_log_types / create_cloudwatch_log_group. See control_plane_log_types in modules/eks/variables.tf to pick which streams (api, audit, authenticator, controllerManager, scheduler) get enabled.
  • RDS logs export to CloudWatch via enabled_cloudwatch_logs_exports in modules/data-stores.

For pod and node logs, layer on whichever shipper your team already runs:

  • CloudWatch Container Insights / Fluent Bit: AWS-native, easiest if you're already in CloudWatch.
  • Managed offering (Datadog, New Relic, Splunk): drop in their agent DaemonSet against the same EKS cluster.
  • Self-hosted (Vector, Fluent Bit, Promtail → Loki / OpenSearch / S3): install via separate Helm chart.

Poolside workloads log to stdout/stderr; any of the above will pick them up. The reference architecture deliberately doesn't pick one for you because operators almost always have an organizational standard already.

What you can't customize (by design)

  • Chart-internal values like pod resource requests, probe configs, env vars, autoscaling on inference-envoy/inference-extproc, and anything outside inference.models.<key>. Those are chart defaults. If you need to override one, use a custom Helm values overlay at install time, but doing so bypasses this reference architecture's poolside-values module. (inference.models.<key> itself is fully operator-controlled via inference_models; see above.)
  • Bundle layout. containers/*.tar and charts/<chart>/ are the expected structure; re-extract your bundle if this looks different.

Where to next

  • GPU cost control: gpu_desired_size = 0 (above) plus the capacity-reservation knobs (gpu_capacity_reservation_id, gpu_capacity_reservation_resource_group_arn, gpu_use_capacity_block) in modules/reference-stack/variables.tf.
  • Regulated environments: permissions_boundary_arn + iam_name_prefix + allow_locked_out_cluster = false are the core knobs (above).
  • Observability: see Logging / observability (BYO) above.
  • Want to configure something not listed here? Open an issue. This reference architecture's variable surface is intentionally small; we'd rather add a specific variable than point people at a generic escape hatch.