Skip to content

Latest commit

 

History

History
203 lines (158 loc) · 7.48 KB

File metadata and controls

203 lines (158 loc) · 7.48 KB

Model checkpoints

How Poolside model weights get from your disk to S3 to GPU pods.

The short version

For the full profile, the reference architecture ships a streaming uploader that reads *.tar files from a local directory and uploads each tarball's members directly to an S3 bucket it creates for you. The inference-stack Helm values reference the S3 URIs automatically.

The alternative, operator-managed buckets, is also supported.

Two modes

Mode A: architecture-managed (default)

# terraform.tfvars
enable_model_s3_upload = true   # default
checkpoints_dir        = "/home/ops/poolside/models"

What happens:

  1. Reference architecture creates a <deployment>-models S3 bucket (SSE-KMS, HTTPS-only, BucketOwnerEnforced)
  2. The model-checkpoints module scans checkpoints_dir for *.tar files
  3. For each tarball, it streams members directly into S3 at s3://<bucket>/models/checkpoints/<tarball_stem>/<member>. No local extraction, no scratch disk.
  4. On success, a zero-byte .checkpoint-complete marker is written with the source tarball's SHA-256 in its S3 metadata. Future applies HEAD this marker and short-circuit if the SHA matches
  5. Inference IAM role gets scoped to this bucket
  6. inference-stack Helm values point each inference subchart's model: field at the right S3 path

Mode B: BYO bucket (operator-managed)

# terraform.tfvars
enable_model_s3_upload = false
external_models_bucket = "your-existing-models-bucket"
# You populate the bucket yourself, out-of-band.

What happens:

  1. Reference architecture does NOT create a bucket or run any uploads
  2. Inference IAM role gets s3:GetObject + s3:ListBucket on external_models_bucket
  3. inference-stack Helm values don't auto-derive S3 URIs from tarballs. You supply them yourself by passing inference_models directly to the poolside-values module (one entry per model with the s3:// URI in the model field). See customizing.md.

Cross-account buckets: not supported. The bucket must live in the same AWS account as the deployment. Cross-account requires a bucket policy on the source side granting the inference_pod role, plus KMS key cross-account grants. Both are out of scope for this reference architecture.

Tarball filename convention

The reference architecture splits each <file>.tar stem on the first hyphen to yield a model alias:

Tarball filename Alias Version
malibu-v2.20251021.tar malibu v2.20251021
malibu-v2.20251021_int4.tar malibu v2.20251021_int4
point-v2.20250403.tar point v2.20250403
laguna-v1.0.tar laguna v1.0

The alias identifies which inference-stack subchart this checkpoint feeds (e.g. inference-malibu, inference-point). The version (everything after the first hyphen) is carried verbatim in the S3 path:

s3://<deployment>-models/models/checkpoints/malibu-v2.20251021_int4/
  config.yaml
  model.safetensors
  pipeline_config.json
  recipe.yaml
  tokenizer/
    tokenizer.json
    chat_template.jinja
    ...
  .checkpoint-complete     ← zero-byte marker; source SHA in metadata

Keeping the full stem as the S3 directory name gives you content-addressable-by-version. Swapping tarballs doesn't silently replace the S3 contents.

One tarball per alias when using the auto-derive helper

examples/full calls inference_models_from_uploads whenever you leave inference_models unset. That helper keys the map by the first hyphen of each tarball stem, so multiple tarballs whose stems share a first segment (e.g. laguna-12341a.tar and laguna-1234513.tar) collapse to the same key — only one survives, unpredictably.

If you need to deploy multiple checkpoints under what the helper would treat as the same alias, either rename the tarballs so each produces a distinct first segment, or set inference_models explicitly in your root (see customizing.md) rather than using the helper.

Selecting a subset of your tarball library

Every *.tar at the top level of checkpoints_dir gets uploaded, so pointing it at your full model library uploads everything. To deploy only a subset, point checkpoints_dir at a directory of symlinks:

mkdir -p ~/poolside/models-for-this-deployment
ln -s ~/poolside/models/malibu-v2.20251021_int4.tar ~/poolside/models-for-this-deployment/
ln -s ~/poolside/models/point-v2.20250403.tar      ~/poolside/models-for-this-deployment/

Then set checkpoints_dir = "~/poolside/models-for-this-deployment".

Which models enable

A model deploys whenever its key appears in the inference_models map passed to modules/poolside-values. Each key becomes a separate inference-<key> Deployment/Service.

The examples/full root, when var.inference_models is left null, auto-derives the map from module.stack.inference_models_from_uploads, producing one entry per uploaded tarball, keyed by first-hyphen alias. poolside-values fills in modelName/modelType/gpus from a defaults table for well-known alias keys (malibu, point, laguna-m, laguna-xs, laguna). Unknown keys pass through untouched; you supply the chart fields they need yourself.

To see which models will deploy and what values they'll receive:

terraform output -json inference_models_resolved

Change detection

Terraform's change trigger on each upload is keyed on the tarball filename and the destination S3 path, not the file's contents. That means:

  • Add / remove / rename a tarball → trigger fires, Terraform plans the upload (or teardown).
  • In-place replace (same filename, new content) → Terraform trigger does NOT fire. The Python uploader's SHA-check against the marker object catches the mismatch and re-uploads, but plan/apply output won't reflect it until the provisioner runs.

The recommended pattern is to give every distinct version a distinct filename. The version is part of the alias-prefix convention anyway (for example, malibu-v2.20251021_int4.tar vs malibu-v2.20260101.tar), so the "in-place replace" case rarely comes up in practice.

Why a custom Python uploader?

aws s3 cp --recursive would work, but requires extracting each tarball to disk first: 75GB of scratch per large model, written twice (extract, then upload). This reference architecture's uploader uses Python's tarfile module to read the archive's member index, then pipes each member directly through boto3's upload_fileobj to S3 with no local extraction.

Tradeoff: you need python3 with boto3 importable by it. See prerequisites.md for the requirement statement and the install-method notes. This reference architecture doesn't prescribe an install method; system package, venv, pip --user, uv, and similar are all fine as long as import boto3 succeeds for whichever python3 is first on PATH.

If the Python dependency is a dealbreaker for your environment, use Mode B (BYO bucket) and populate the bucket with whatever tool you already have (aws s3 sync, rclone, a CI pipeline, etc.).

Teardown

Checkpoints persist across terraform apply runs (that's the point of the marker). To clear them:

# Clear one checkpoint
aws s3 rm --recursive s3://<deployment>-models/models/checkpoints/<stem>/

# Clear everything (e.g. before destroy)
aws s3 rm --recursive s3://<deployment>-models/

On full terraform destroy, s3_force_destroy_buckets = true deletes the bucket even if it still has objects. Defaults to false: flip it for POCs, leave it for production.