Skip to content

Commit b1eded2

Browse files
authored
Merge pull request #217 from datum-cloud/fix/issue-212-invalid-cert-listener-isolation
2 parents a474ece + 63fa912 commit b1eded2

16 files changed

Lines changed: 2541 additions & 27 deletions

File tree

.golangci.yml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ linters:
88
- dupl
99
- errcheck
1010
- ginkgolinter
11-
- goconst
1211
- gocyclo
1312
- govet
1413
- ineffassign
@@ -37,11 +36,6 @@ linters:
3736
- dupl
3837
- lll
3938
path: internal/*
40-
# Repeated string literals in tests are usually fixture/table data;
41-
# extracting them to constants hurts readability more than it helps.
42-
- linters:
43-
- goconst
44-
path: _test\.go
4539
# The validation packages are built almost entirely from field.ErrorList
4640
# accumulators that hold a handful of errors; preallocating them adds noise
4741
# without meaningful benefit.

config/default/kustomization.yaml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,13 @@ namePrefix: network-services-operator-
1717
components:
1818
- ../resource-metrics
1919
- ../webhook
20+
# Prometheus ServiceMonitor for the controller-manager metrics endpoint.
21+
# The controller serves metrics over HTTPS on :8443 with delegated authn/authz
22+
# (controller-runtime WithAuthenticationAndAuthorization). The ServiceMonitor
23+
# uses insecureSkipVerify because the controller auto-generates a self-signed
24+
# TLS cert — there is no cert-manager-issued cert for the metrics endpoint and
25+
# no CA bundle to reference. Prometheus still authenticates via the bearer token.
26+
- ../prometheus
2027

2128
resources:
2229
- ../crd
@@ -26,10 +33,7 @@ resources:
2633
# crd/kustomization.yaml
2734
# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. 'WEBHOOK' components are required.
2835
#- ../certmanager
29-
# [PROMETHEUS] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'.
30-
#- ../prometheus
31-
# [METRICS] Expose the controller manager metrics service.
32-
- metrics_service.yaml
36+
# metrics_service.yaml is now included by the ../prometheus component above.
3337
# [NETWORK POLICY] Protect the /metrics endpoint and Webhook Server with NetworkPolicy.
3438
# Only Pod(s) running a namespace labeled with 'metrics: enabled' will be able to gather the metrics.
3539
# Only CR(s) which requires webhooks and are applied on namespaces labeled with 'webhooks: enabled' will

config/extension-server/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ resources:
99
- rbac
1010
- certmanager
1111
- network-policy
12+
- metrics-monitor.yaml
1213

1314
images:
1415
- name: ghcr.io/datum-cloud/network-services-operator
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: monitoring.coreos.com/v1
2+
kind: ServiceMonitor
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: network-services-operator
6+
app.kubernetes.io/component: envoy-gateway-extension-server
7+
app.kubernetes.io/managed-by: kustomize
8+
name: envoy-gateway-extension-server-metrics
9+
namespace: system
10+
spec:
11+
endpoints:
12+
# The extension server serves /metrics on plain HTTP (no TLS) on the
13+
# health-addr port (:8080). Only the gRPC port uses mTLS; the health
14+
# address intentionally stays plain HTTP so Kubernetes probes don't
15+
# need certificates.
16+
- path: /metrics
17+
port: metrics
18+
scheme: http
19+
selector:
20+
matchLabels:
21+
app.kubernetes.io/name: network-services-operator
22+
app.kubernetes.io/component: envoy-gateway-extension-server

config/telemetry/alerts/gateways.yaml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ spec:
2222
summary: "Gateway {{ $labels.resource_name }} is taking longer than 60 seconds to reach Ready status"
2323
description: "Gateway {{ $labels.resource_name }} in namespace {{ $labels.resource_namespace }} has been in creation state for {{ $value }} seconds without reaching Ready status (Accepted=True AND Programmed=True), which exceeds the 60-second SLO threshold."
2424

25+
- alert: EnvoyPatchPolicyProgrammingFailed
26+
expr: |
27+
envoy_gateway_envoypatchpolicy_status_condition{type="Programmed"} == 0
28+
for: 5m
29+
labels:
30+
severity: critical
31+
annotations:
32+
summary: "EnvoyPatchPolicy {{ $labels.name }} failed to program"
33+
description: "EnvoyPatchPolicy {{ $labels.name }} in namespace {{ $labels.namespace }} has been unable to apply its xDS patch for over 5 minutes (reason: {{ $labels.reason }}). Customer traffic on affected gateways may be impacted."
34+
2535
- alert: GatewayDegradedSLOViolation
2636
expr: |
2737
(
@@ -39,3 +49,75 @@ spec:
3949
annotations:
4050
summary: "Gateway {{ $labels.resource_name }} has been degraded for over 60 seconds"
4151
description: "Gateway {{ $labels.resource_name }} in namespace {{ $labels.resource_namespace }} has been in a degraded state for over 60 seconds without recovering, which exceeds the 60-second SLO threshold."
52+
53+
# TLS certificate health alerts fire on nso_* metrics emitted directly by the
54+
# NSO operator and extension server. They are available in the same Prometheus
55+
# that loads this rule, alongside the envoy_gateway_* metrics above.
56+
# These complement the infrastructure EnvoyListenerUpdateRejected alert (which
57+
# fires when Envoy rejects a bad LDS update). These alerts cover the earlier
58+
# prevention path: NSO withholds a listener or the extension server drops a
59+
# broken filter chain before Envoy has a chance to reject the update.
60+
- name: nso-tls-cert-health
61+
interval: 30s
62+
rules:
63+
# Fires when NSO has withheld a Gateway listener because its TLS certificate
64+
# is unusable. The customer's HTTPS hostname is dark until the cert recovers.
65+
# If EnvoyListenerUpdateRejected is also firing without this alert, NSO's
66+
# cert gating has regressed and a bad cert reached Envoy directly.
67+
- alert: GatewayListenerCertUnusable
68+
expr: |
69+
nso_gateway_listener_cert_withheld == 1
70+
for: 5m
71+
labels:
72+
severity: warning
73+
annotations:
74+
summary: "Gateway listener {{ $labels.namespace }}/{{ $labels.name }}/{{ $labels.listener }} has an unusable TLS certificate"
75+
description: "NSO has withheld listener {{ $labels.listener }} (hostname {{ $labels.hostname }}) on Gateway {{ $labels.name }} in namespace {{ $labels.namespace }} because its TLS certificate is unusable (reason: {{ $labels.reason }}). The customer cannot serve HTTPS on this hostname. Check the cert-manager Certificate and Secret in the downstream cluster."
76+
runbook_url: "https://github.com/datum-cloud/network-services-operator/blob/main/docs/runbooks/gateway-tls-certificates.md#gatewaylistenercertunusable"
77+
78+
# Fires when a managed TLS certificate is within 7 days of expiry while it
79+
# is still healthy. cert-manager renews automatically, but renewal fails if
80+
# the domain's DNS no longer points to Datum. Acting here avoids a future
81+
# GatewayListenerCertUnusable alert.
82+
- alert: GatewayListenerCertExpiringSoon
83+
expr: |
84+
(nso_gateway_listener_cert_expiry_time - time()) / 86400 < 7
85+
for: 1h
86+
labels:
87+
severity: warning
88+
annotations:
89+
summary: "TLS certificate for Gateway listener {{ $labels.namespace }}/{{ $labels.name }}/{{ $labels.listener }} expires in less than 7 days"
90+
description: "The cert-manager Certificate for listener {{ $labels.listener }} (hostname {{ $labels.hostname }}, secret {{ $labels.secret }}) on Gateway {{ $labels.name }} in namespace {{ $labels.namespace }} expires within 7 days. cert-manager should renew it automatically, but renewal fails if the domain's DNS no longer points to Datum. Verify the Certificate is Ready=True in the downstream cluster."
91+
runbook_url: "https://github.com/datum-cloud/network-services-operator/blob/main/docs/runbooks/gateway-tls-certificates.md#gatewaylistenercertexpiringsoon"
92+
93+
# Fires when the extension server is actively dropping broken certificates
94+
# from the configuration it sends to the edge. This is expected briefly
95+
# between a certificate failing and the controller withholding the listener.
96+
# If only this fires and GatewayListenerCertUnusable does not, the controller
97+
# may have missed the listener.
98+
- alert: TLSBackstopPruningChains
99+
expr: |
100+
nso_extension_tls_pruned_chains_active > 0
101+
for: 2m
102+
labels:
103+
severity: warning
104+
annotations:
105+
summary: "Extension server is dropping {{ $value }} broken certificate(s) to protect the edge listener"
106+
description: "The extension server is dropping {{ $value }} broken certificate(s) from the configuration it sends to the edge gateway. Check extension server logs for 'pruned invalid TLS chains' to find the affected hostnames. If GatewayListenerCertUnusable is also firing, both layers of protection are working as expected. If only this alert fires, the controller may have missed the listener."
107+
runbook_url: "https://github.com/datum-cloud/network-services-operator/blob/main/docs/runbooks/gateway-tls-certificates.md#tlsbackstoppruningchains"
108+
109+
# Fires when the extension server could not protect a listener because every
110+
# certificate on it is broken. It never removes a listener entirely, so the
111+
# edge will reject the update for that listener — EnvoyListenerUpdateRejected
112+
# (infra) confirms it. It means the controller did not withhold the listener
113+
# before it reached the edge.
114+
- alert: TLSBackstopListenerAllCertsBroken
115+
expr: |
116+
nso_extension_tls_listeners_left_intact_active > 0
117+
for: 2m
118+
labels:
119+
severity: critical
120+
annotations:
121+
summary: "{{ $value }} edge listener(s) have every TLS certificate broken and cannot be protected"
122+
description: "The extension server left {{ $value }} edge listener(s) untouched because every certificate on them is broken. It never removes a listener entirely, so the edge will reject the configuration update for those listeners (EnvoyListenerUpdateRejected confirms it). This means the controller did not withhold the listener before it reached the edge. Check extension server logs for 'listeners_left_intact' and why the controller did not withhold the listener."
123+
runbook_url: "https://github.com/datum-cloud/network-services-operator/blob/main/docs/runbooks/gateway-tls-certificates.md#tlsbackstoplistenerallcertsbroken"
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Runbook: Gateway TLS certificate alerts
2+
3+
These alerts cover the health of the TLS certificates that gateway listeners use
4+
to serve HTTPS. Every HTTPS hostname on a gateway shares a single edge listener,
5+
so an unusable certificate is handled in two layers:
6+
7+
1. The **controller** leaves a listener with an unusable certificate out of the
8+
downstream gateway, so one bad certificate only affects its own hostname and
9+
every other hostname keeps serving. The affected listener reports the problem
10+
to the customer through its status conditions.
11+
2. The **extension server** is a backstop: if a bad certificate reaches the edge
12+
anyway, it drops only the affected part of the listener rather than letting
13+
the whole listener fail.
14+
15+
A certificate is "unusable" when it has expired, is not valid yet, is missing,
16+
its certificate and key do not match, or it has not been issued yet.
17+
18+
Related: issue [#212](https://github.com/datum-cloud/network-services-operator/issues/212).
19+
The infra-side `EnvoyListenerUpdateRejected` alert fires when the edge actually
20+
rejects a listener update — the alerts here are designed to fire *before* that
21+
happens, or to explain it when it does.
22+
23+
## Shared diagnosis
24+
25+
Each alert carries labels identifying the affected object: `namespace`, `name`
26+
(the gateway), `listener`, and usually `hostname`.
27+
28+
Find the gateway and the failing listener's status:
29+
30+
```sh
31+
kubectl -n <namespace> get gateway <name> -o yaml | yq '.status.listeners'
32+
```
33+
34+
A gated listener reports `Programmed: False` (reason `Invalid`) and
35+
`ResolvedRefs: False` (reason `InvalidCertificateRef`) with a plain-language
36+
message naming the hostname.
37+
38+
Inspect the backing certificate on the downstream (edge) cluster. The Certificate
39+
and its Secret are named `<gateway>-<listener>`:
40+
41+
```sh
42+
kubectl --context <downstream> -n <downstream-ns> get certificate <gateway>-<listener> -o yaml
43+
kubectl --context <downstream> -n <downstream-ns> get secret <gateway>-<listener> -o yaml
44+
```
45+
46+
The most common root cause is a customer pointing their domain away from Datum:
47+
ACME renewal then fails, the certificate goes `Ready: False`, and it eventually
48+
expires. That is a customer action, not a platform fault — the listener is
49+
correctly withheld and recovers on its own once the certificate can be issued.
50+
51+
## GatewayListenerCertUnusable
52+
53+
**Meaning.** The controller is withholding a listener because its certificate is
54+
unusable. The customer's HTTPS hostname is unavailable until the certificate
55+
recovers.
56+
57+
**Impact.** Limited to the one hostname. Other hostnames on the gateway are
58+
unaffected — this is the isolation working as intended.
59+
60+
**Diagnose.** Read the `reason` label and the listener status message (see Shared
61+
diagnosis). Check the downstream Certificate's `Ready` condition and its
62+
`status.notAfter`.
63+
64+
**Remediate.** Usually no platform action is needed — confirm whether the
65+
customer's domain still points to Datum. If it does and issuance is genuinely
66+
stuck, investigate cert-manager (the issuer, ACME order, and challenge for that
67+
hostname). The listener returns automatically once the certificate is issued.
68+
69+
## GatewayListenerCertExpiringSoon
70+
71+
**Meaning.** A currently-healthy certificate expires within seven days. This is a
72+
warning to act before it starts gating the listener.
73+
74+
**Impact.** None yet. It becomes `GatewayListenerCertUnusable` if the certificate
75+
expires before it is renewed.
76+
77+
**Diagnose.** Check the downstream Certificate's `status.renewalTime` and whether
78+
recent renewal attempts are failing (cert-manager events / logs for that
79+
Certificate). Confirm the hostname's DNS still resolves to Datum, since ACME
80+
renewal depends on it.
81+
82+
**Remediate.** If renewal is failing because DNS moved away, this will become a
83+
customer-driven gating event — no platform fix. If renewal is failing for a
84+
platform reason, fix the issuer / ACME path so cert-manager can renew.
85+
86+
## TLSBackstopPruningChains
87+
88+
**Meaning.** The extension server is actively dropping broken certificates from
89+
the configuration it sends to the edge. This is expected for a short window
90+
between a certificate failing and the controller withholding the listener.
91+
92+
**Impact.** None on its own — the backstop is protecting the listener. The
93+
affected hostname is the one whose certificate is broken.
94+
95+
**Diagnose.** Check extension server logs for `pruned invalid TLS chains` to find
96+
the affected hostnames:
97+
98+
```sh
99+
kubectl -n <ext-server-ns> logs -l <ext-server-selector> | grep 'pruned invalid TLS chains'
100+
```
101+
102+
If `GatewayListenerCertUnusable` is also firing for the same hostname, both
103+
layers are working as expected and no action is needed. If **only** this alert
104+
fires, the controller did not withhold the listener — see the next alert and
105+
check why (start with the listener's status conditions and the controller logs).
106+
107+
**Remediate.** Generally none. If it persists without a matching
108+
`GatewayListenerCertUnusable`, treat it as a controller gap and investigate the
109+
gateway reconcile for that listener.
110+
111+
## TLSBackstopListenerAllCertsBroken
112+
113+
**Meaning (critical).** Every certificate on an edge listener is broken. The
114+
backstop never removes a listener entirely, so the edge will reject the
115+
configuration update for that listener and its config will freeze on its last
116+
good state.
117+
118+
**Impact.** The listener stops accepting configuration changes. Because the edge
119+
listener is shared, this can affect every hostname on it — this is the
120+
fleet-impacting failure the two-layer design exists to prevent, so reaching it
121+
means the controller-side protection did not catch the listener.
122+
123+
**Diagnose.**
124+
125+
```sh
126+
kubectl -n <ext-server-ns> logs -l <ext-server-selector> | grep 'listeners_left_intact'
127+
```
128+
129+
Cross-check the infra `EnvoyListenerUpdateRejected` alert, which confirms the
130+
edge is rejecting the update. Identify every certificate on the affected listener
131+
and why each is broken (expired, not yet valid, or mismatched), then determine
132+
why the controller did not withhold the listener before it reached the edge.
133+
134+
**Remediate.** Restore or remove the broken certificates so the listener has at
135+
least one usable certificate, which lets the edge accept the update again. Then
136+
follow up on the controller gap that allowed an all-broken listener to be
137+
programmed.

0 commit comments

Comments
 (0)