Skip to content

AI Edge: eliminate slow/failed TLS cert provisioning at scale via Envoy Gateway extension server #216

Description

@scotwells

Problem

AI Edge sites — especially those with custom hostnames — had slow or failed HTTPS provisioning: certificates stuck "pending," ordinary sites taking 10+ minutes instead of seconds.

Every configuration change rebuilt the full Envoy config for all gateways and re-applied each gateway's customizations (WAF, connectors) via EnvoyPatchPolicy. With WAF + connectors past ~140 gateways, programming took 30+ seconds — long enough to miss the time-limited Let's Encrypt challenge window, so certificates never issued. Latent since inception; surfaced past a scale we'd never load-tested.

Direction

Retire per-gateway EnvoyPatchPolicy and apply WAF/connector customizations inline during config build via an Envoy Gateway extension server.

Result: ~30 s → ~50 ms at 250 gateways. Shipped in NSO v0.22.0; provisioning restored and faster than before. Parent incident: datum-cloud/engineering#317.

Work

Direction & core migration

Provisioning fix & cluster-naming follow-ups (post-upgrade)

Behavior restored/improved under the extension server

Observability, CI & tests

Safety / rollback

Related hardening (tracked separately under #317)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions