Problem
AI Edge sites — especially those with custom hostnames — had slow or failed HTTPS provisioning: certificates stuck "pending," ordinary sites taking 10+ minutes instead of seconds.
Every configuration change rebuilt the full Envoy config for all gateways and re-applied each gateway's customizations (WAF, connectors) via EnvoyPatchPolicy. With WAF + connectors past ~140 gateways, programming took 30+ seconds — long enough to miss the time-limited Let's Encrypt challenge window, so certificates never issued. Latent since inception; surfaced past a scale we'd never load-tested.
Direction
Retire per-gateway EnvoyPatchPolicy and apply WAF/connector customizations inline during config build via an Envoy Gateway extension server.
Result: ~30 s → ~50 ms at 250 gateways. Shipped in NSO v0.22.0; provisioning restored and faster than before. Parent incident: datum-cloud/engineering#317.
Work
Direction & core migration
Provisioning fix & cluster-naming follow-ups (post-upgrade)
Behavior restored/improved under the extension server
Observability, CI & tests
Safety / rollback
Related hardening (tracked separately under #317)
Problem
AI Edge sites — especially those with custom hostnames — had slow or failed HTTPS provisioning: certificates stuck "pending," ordinary sites taking 10+ minutes instead of seconds.
Every configuration change rebuilt the full Envoy config for all gateways and re-applied each gateway's customizations (WAF, connectors) via
EnvoyPatchPolicy. With WAF + connectors past ~140 gateways, programming took 30+ seconds — long enough to miss the time-limited Let's Encrypt challenge window, so certificates never issued. Latent since inception; surfaced past a scale we'd never load-tested.Direction
Retire per-gateway
EnvoyPatchPolicyand apply WAF/connector customizations inline during config build via an Envoy Gateway extension server.Result: ~30 s → ~50 ms at 250 gateways. Shipped in NSO v0.22.0; provisioning restored and faster than before. Parent incident: datum-cloud/engineering#317.
Work
Direction & core migration
EnvoyPatchPolicyper HTTPS-listener cert readinessProvisioning fix & cluster-naming follow-ups (post-upgrade)
Behavior restored/improved under the extension server
backendRefin extension-server modeObservability, CI & tests
Safety / rollback
Related hardening (tracked separately under #317)