Skip to content

Commit 474cf86

Browse files
lmicciniclaude
andcommitted
Add metrics Service and TLS support for InstanceHA
- Add a Kubernetes Service exposing the InstanceHA Prometheus metrics endpoint, with labels for automatic discovery by the telemetry operator's ScrapeConfig. - Add MetricsTLS field (tls.SimpleService) to the InstanceHa API, allowing TLS certificate configuration for the metrics endpoint. - Mount TLS certificate secret into the deployment and pass cert/key paths via environment variables when MetricsTLS is enabled. - Validate the MetricsTLS secret in the controller with hash tracking for automatic pod rollout on certificate rotation. - Add field indexer for the metrics TLS secret so the controller reconciles on secret changes. - Update the Python health/metrics server to wrap the HTTP socket with TLS when certificate environment variables are present. - Add RBAC annotation for Services to the InstanceHA controller. - Add functional tests for the metrics Service creation. - Update documentation for Prometheus metrics integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent fc58bcd commit 474cf86

14 files changed

Lines changed: 618 additions & 34 deletions

File tree

apis/bases/instanceha.openstack.org_instancehas.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,18 @@ spec:
100100
default: 7410
101101
format: int32
102102
type: integer
103+
metricsTLS:
104+
description: MetricsTLS - Parameters related to TLS for the metrics
105+
endpoint
106+
properties:
107+
caBundleSecretName:
108+
description: CaBundleSecretName - holding the CA certs in a pre-created
109+
bundle file
110+
type: string
111+
secretName:
112+
description: SecretName - holding the cert, key for the service
113+
type: string
114+
type: object
103115
networkAttachments:
104116
description: |-
105117
NetworkAttachments is a list of NetworkAttachment resource names to expose

apis/instanceha/v1beta1/instanceha_types.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,11 @@ type InstanceHaSpec struct {
115115
// +kubebuilder:validation:Optional
116116
// Auth - Parameters related to authentication
117117
Auth AuthSpec `json:"auth,omitempty"`
118+
119+
// +kubebuilder:validation:Optional
120+
//+operator-sdk:csv:customresourcedefinitions:type=spec
121+
// MetricsTLS - Parameters related to TLS for the metrics endpoint
122+
MetricsTLS tls.SimpleService `json:"metricsTLS,omitempty"`
118123
}
119124

120125
// InstanceHaStatus defines the observed state of InstanceHa

apis/instanceha/v1beta1/zz_generated.deepcopy.go

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

config/crd/bases/instanceha.openstack.org_instancehas.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,18 @@ spec:
100100
default: 7410
101101
format: int32
102102
type: integer
103+
metricsTLS:
104+
description: MetricsTLS - Parameters related to TLS for the metrics
105+
endpoint
106+
properties:
107+
caBundleSecretName:
108+
description: CaBundleSecretName - holding the CA certs in a pre-created
109+
bundle file
110+
type: string
111+
secretName:
112+
description: SecretName - holding the cert, key for the service
113+
type: string
114+
type: object
103115
networkAttachments:
104116
description: |-
105117
NetworkAttachments is a list of NetworkAttachment resource names to expose

docs/instanceha_guide.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,9 @@ groups:
190190
191191
#### Scraping Configuration
192192
193-
The InstanceHA pod exposes metrics on TCP port 8080. To scrape with Prometheus, create a `PodMonitor` or `ServiceMonitor`:
193+
The InstanceHA pod exposes metrics on TCP port 8080. The infra-operator automatically creates a Kubernetes Service (`<instance-name>-metrics`) with the labels `metrics: enabled` and `service: instanceha`, which the telemetry-operator discovers and scrapes via the COO Prometheus. **No manual configuration is needed when the telemetry-operator is deployed.**
194+
195+
For environments using OpenShift user workload monitoring instead of (or in addition to) the telemetry-operator, create a `PodMonitor`:
194196

195197
```yaml
196198
apiVersion: monitoring.coreos.com/v1

docs/instanceha_prometheus.md

Lines changed: 53 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ InstanceHA exposes Prometheus metrics at `:8080/metrics` on the workload pod, co
66

77
The metrics are served by the `prometheus_client` Python library on the same HTTP server used for liveness and readiness probes. No sidecar or additional container is needed.
88

9+
When pod-level TLS is enabled, the metrics endpoint serves over **HTTPS**. The openstack-operator creates a cert-manager Certificate producing a TLS secret (`cert-instanceha-metrics`), which the infra-operator mounts into the pod. The Python HTTP server wraps its socket with TLS automatically when the certificate files are present.
10+
911
---
1012

1113
## Prerequisites
@@ -17,6 +19,34 @@ The metrics are served by the `prometheus_client` Python library on the same HTT
1719

1820
---
1921

22+
## TLS Configuration
23+
24+
When `OpenStackControlPlane` has pod-level TLS enabled (`spec.tls.podLevel.enabled: true`), the openstack-operator automatically provisions a cert-manager Certificate for the InstanceHA metrics endpoint. This produces a Kubernetes TLS secret (`cert-instanceha-metrics`) containing `tls.crt`, `tls.key`, and `ca.crt`.
25+
26+
The infra-operator InstanceHA controller **auto-detects** this secret: if the default secret `cert-instanceha-metrics` exists in the namespace, TLS is enabled automatically without any configuration on the InstanceHa CR. The controller:
27+
1. Validates the TLS secret exists and is well-formed
28+
2. Mounts the certificate at `/etc/pki/tls/certs/metrics.crt` and the key at `/etc/pki/tls/private/metrics.key`
29+
3. Sets `METRICS_TLS_CERT` and `METRICS_TLS_KEY` environment variables
30+
4. Switches liveness and readiness probes to HTTPS
31+
32+
The Python process detects these environment variables and wraps the HTTP server socket with TLS. A single wildcard certificate (`*.NAMESPACE.svc`) covers all InstanceHA instances in a namespace.
33+
34+
To use a custom TLS secret instead of the auto-detected default, set `metricsTLS.secretName` in the InstanceHa CR:
35+
36+
```yaml
37+
apiVersion: instanceha.openstack.org/v1beta1
38+
kind: InstanceHa
39+
metadata:
40+
name: instanceha
41+
spec:
42+
metricsTLS:
43+
secretName: my-custom-metrics-cert
44+
```
45+
46+
When the telemetry-operator is deployed, its `ScrapeConfig` automatically switches to `scheme: HTTPS` with the appropriate TLS configuration when `PrometheusTLS` is enabled — no manual changes are needed.
47+
48+
---
49+
2050
## Enabling Scraping
2151

2252
### Step 1: Deploy a PodMonitor
@@ -95,8 +125,9 @@ curl -sk -H "Authorization: Bearer $TOKEN" \
95125

96126
```bash
97127
# Scrape metrics directly from the pod
128+
# Use https and -k when TLS is enabled
98129
oc exec -n openstack deployment/instanceha-instanceha -- \
99-
curl -s http://localhost:8080/metrics
130+
curl -sk https://localhost:8080/metrics
100131
101132
# Query a specific metric in Prometheus
102133
# (via Prometheus UI or API)
@@ -299,13 +330,17 @@ promtool check rules instanceha-prometheusrule.yaml
299330
### Verify Metrics Endpoint
300331

301332
```bash
302-
# Scrape all metrics from the pod
333+
# Scrape all metrics from the pod (HTTP, when TLS is not enabled)
303334
oc exec -n openstack deployment/instanceha-instanceha -- \
304335
curl -s http://localhost:8080/metrics
305336
337+
# When TLS is enabled, use HTTPS with -k to skip certificate verification
338+
oc exec -n openstack deployment/instanceha-instanceha -- \
339+
curl -sk https://localhost:8080/metrics
340+
306341
# Check a specific metric family
307342
oc exec -n openstack deployment/instanceha-instanceha -- \
308-
curl -s http://localhost:8080/metrics | grep instanceha_poll_cycles_total
343+
curl -sk https://localhost:8080/metrics | grep instanceha_poll_cycles_total
309344
```
310345

311346
Expected output (counters start at zero, increment over time):
@@ -319,16 +354,16 @@ instanceha_poll_cycles_total{result="error"} 0.0
319354

320355
### Verify Poll Loop Metrics
321356

322-
After the pod has been running for a few poll cycles:
357+
After the pod has been running for a few poll cycles (use `https` and `-k` when TLS is enabled):
323358

324359
```bash
325360
# Should show increasing success count
326361
oc exec -n openstack deployment/instanceha-instanceha -- \
327-
curl -s http://localhost:8080/metrics | grep poll_cycles
362+
curl -sk https://localhost:8080/metrics | grep poll_cycles
328363
329364
# Should show 0 consecutive failures (healthy state)
330365
oc exec -n openstack deployment/instanceha-instanceha -- \
331-
curl -s http://localhost:8080/metrics | grep poll_consecutive_failures
366+
curl -sk https://localhost:8080/metrics | grep poll_consecutive_failures
332367
```
333368

334369
### Simulate a Nova API Failure
@@ -353,8 +388,9 @@ Fencing and evacuation counters only increment during actual host failures. To v
353388

354389
```bash
355390
# Check that the metric families are registered (even if values are 0)
391+
# Use https and -k when TLS is enabled
356392
oc exec -n openstack deployment/instanceha-instanceha -- \
357-
curl -s http://localhost:8080/metrics | grep "^instanceha_" | grep "# TYPE"
393+
curl -sk https://localhost:8080/metrics | grep "^instanceha_" | grep "# TYPE"
358394
```
359395

360396
Expected output:
@@ -482,37 +518,23 @@ When the [telemetry-operator](https://github.com/openstack-k8s-operators/telemet
482518
| OpenShift user workload monitoring | `prometheus-user-workload` in `openshift-user-workload-monitoring` | `thanos-querier` route in `openshift-monitoring` |
483519
| telemetry-operator (COO) | `prometheus-metric-storage` in `openstack` | `metric-storage-prometheus.openstack.svc:9090` |
484520

485-
The PodMonitor approach described above places InstanceHA metrics in the OpenShift user workload Prometheus. If you want InstanceHA metrics alongside other OpenStack metrics (Ceilometer, RabbitMQ, node-exporter, OVN) in the COO Prometheus, create a `ScrapeConfig` CR instead.
521+
### Automatic Discovery (default)
486522

487-
### Creating a ScrapeConfig for COO Prometheus
523+
The telemetry-operator **automatically discovers and scrapes InstanceHA metrics** — no manual configuration is required. The infra-operator creates a Kubernetes Service (`<instance-name>-metrics`) with the labels `metrics: enabled` and `service: instanceha`. The telemetry-operator's `MetricStorage` controller watches for Services with these labels and automatically generates a `ScrapeConfig` CR named `telemetry-instanceha` targeting port 8080.
488524

489-
The COO Prometheus only picks up CRs with the label `service: metricStorage`. Create a `ScrapeConfig` targeting the InstanceHA pod:
525+
This works the same way as the OVN metrics integration. When a `MetricStorage` CR exists in the namespace:
490526

491-
```yaml
492-
apiVersion: monitoring.rhobs/v1alpha1
493-
kind: ScrapeConfig
494-
metadata:
495-
name: instanceha-metrics
496-
namespace: openstack
497-
labels:
498-
service: metricStorage
499-
spec:
500-
scrapeInterval: 30s
501-
metricsPath: /metrics
502-
staticConfigs:
503-
- targets:
504-
- "<instanceha-pod-ip>:8080"
505-
```
527+
1. The telemetry-operator discovers the InstanceHA metrics Service via label selectors
528+
2. A `ScrapeConfig` CR is created with the target `<service-name>.<namespace>.svc:8080`
529+
3. The COO Prometheus picks up the `ScrapeConfig` and begins scraping
530+
4. If the InstanceHA Service is deleted or recreated, the `ScrapeConfig` is automatically reconciled
506531

507-
To discover the pod IP dynamically:
532+
To verify the automatic scrapeconfig was created:
508533

509534
```bash
510-
POD_IP=$(oc get pod -n openstack -l service=instanceha -o jsonpath='{.items[0].status.podIP}')
511-
echo "Target: ${POD_IP}:8080"
535+
oc get scrapeconfig -n openstack telemetry-instanceha -o yaml
512536
```
513537

514-
> **Note**: The COO `ScrapeConfig` uses static targets (IP:port), not label-based pod discovery like a `PodMonitor`. If the InstanceHA pod is rescheduled and gets a new IP, the `ScrapeConfig` must be updated. For automatic discovery, consider requesting native InstanceHA support in the telemetry-operator — the OVN metrics integration uses a label-based service discovery pattern that could be extended to InstanceHA.
515-
516538
### Alert Rules for COO Prometheus
517539

518540
The alert rules from the [Alert Rules](#alert-rules) section use the `monitoring.coreos.com/v1` API group, which is picked up by OpenShift's built-in Prometheus Operator. To use these alerts with the COO Prometheus instead, change the API group and add the `service: metricStorage` label:
@@ -532,7 +554,7 @@ spec:
532554
### Which Approach to Use
533555

534556
- **OpenShift user workload monitoring only** (no telemetry-operator): Use the PodMonitor approach from [Enabling Scraping](#enabling-scraping). This is simpler and uses automatic pod discovery.
535-
- **telemetry-operator deployed**: Use the ScrapeConfig approach if you want all OpenStack metrics in a single Prometheus. You can also use both approaches simultaneously — the PodMonitor and ScrapeConfig target different Prometheus instances and do not conflict.
557+
- **telemetry-operator deployed** (default): InstanceHA metrics are automatically scraped by the COO Prometheus alongside other OpenStack metrics (Ceilometer, RabbitMQ, node-exporter, OVN). No manual configuration needed. You can also deploy the PodMonitor simultaneously — it targets the OpenShift user workload Prometheus and does not conflict with the COO scrapeconfig.
536558
- **Querying across both**: OpenShift's `thanos-querier` route aggregates the cluster and user workload Prometheus instances. The COO Prometheus is separate and must be queried directly at `metric-storage-prometheus.openstack.svc:9090`.
537559

538560
---

internal/controller/instanceha/instanceha_controller.go

Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ import (
5555

5656
commondeployment "github.com/openstack-k8s-operators/lib-common/modules/common/deployment"
5757
"github.com/openstack-k8s-operators/lib-common/modules/common/secret"
58+
commonservice "github.com/openstack-k8s-operators/lib-common/modules/common/service"
5859
"github.com/openstack-k8s-operators/lib-common/modules/common/util"
5960

6061
networkv1 "github.com/k8snetworkplumbingwg/network-attachment-definition-client/pkg/apis/k8s.cni.cncf.io/v1"
@@ -80,6 +81,7 @@ func (r *Reconciler) GetLogger(ctx context.Context) logr.Logger {
8081
// +kubebuilder:rbac:groups=instanceha.openstack.org,resources=instancehas/finalizers,verbs=update;patch
8182
// +kubebuilder:rbac:groups=core,resources=configmaps,verbs=get;list;watch;
8283
// +kubebuilder:rbac:groups=core,resources=secrets,verbs=get;list;watch;
84+
// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete
8385
// +kubebuilder:rbac:groups=k8s.cni.cncf.io,resources=network-attachment-definitions,verbs=get;list;watch
8486
// service account, role, rolebinding
8587
// +kubebuilder:rbac:groups="",resources=serviceaccounts,verbs=get;list;watch;create;update;patch
@@ -164,6 +166,7 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ct
164166
condition.UnknownCondition(condition.RoleReadyCondition, condition.InitReason, condition.RoleReadyInitMessage),
165167
condition.UnknownCondition(condition.RoleBindingReadyCondition, condition.InitReason, condition.RoleBindingReadyInitMessage),
166168
condition.UnknownCondition(condition.NetworkAttachmentsReadyCondition, condition.InitReason, condition.NetworkAttachmentsReadyInitMessage),
169+
condition.UnknownCondition(condition.CreateServiceReadyCondition, condition.InitReason, condition.CreateServiceReadyInitMessage),
167170
)
168171
instance.Status.Conditions.Init(&cl)
169172
instance.Status.ObservedGeneration = instance.Generation
@@ -369,8 +372,6 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ct
369372
)
370373
if err != nil {
371374
if k8s_errors.IsNotFound(err) {
372-
// Since the CA cert secret should have been manually created by the user and provided in the spec,
373-
// we treat this as a warning because it means that the service will not be able to start.
374375
instance.Status.Conditions.Set(condition.FalseCondition(
375376
condition.TLSInputReadyCondition,
376377
condition.ErrorReason,
@@ -390,6 +391,38 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ct
390391
configVars[instance.Spec.CaBundleSecretName] = env.SetValue(secretHash)
391392
}
392393

394+
metricsTLSExplicit := instance.Spec.MetricsTLS.Enabled()
395+
if !metricsTLSExplicit {
396+
certName := instanceha.DefaultMetricsCertSecret
397+
instance.Spec.MetricsTLS.SecretName = &certName
398+
}
399+
400+
hash, err := instance.Spec.MetricsTLS.ValidateCertSecret(ctx, helper, instance.Namespace)
401+
if err != nil {
402+
if k8s_errors.IsNotFound(err) {
403+
if metricsTLSExplicit {
404+
instance.Status.Conditions.Set(condition.FalseCondition(
405+
condition.TLSInputReadyCondition,
406+
condition.RequestedReason,
407+
condition.SeverityInfo,
408+
condition.TLSInputReadyWaitingMessage, err.Error()))
409+
return ctrl.Result{}, nil
410+
}
411+
// Auto-detect: default cert not found, proceed without TLS
412+
instance.Spec.MetricsTLS.SecretName = nil
413+
} else {
414+
instance.Status.Conditions.Set(condition.FalseCondition(
415+
condition.TLSInputReadyCondition,
416+
condition.ErrorReason,
417+
condition.SeverityWarning,
418+
condition.TLSInputErrorMessage,
419+
err.Error()))
420+
return ctrl.Result{}, err
421+
}
422+
} else {
423+
configVars[tls.TLSHashName+"_metrics"] = env.SetValue(hash)
424+
}
425+
393426
// all cert input checks out so report InputReady
394427
instance.Status.Conditions.MarkTrue(condition.TLSInputReadyCondition, condition.InputReadyMessage)
395428

@@ -505,6 +538,28 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ct
505538
// remove LastAppliedTopology from the .Status
506539
instance.Status.LastAppliedTopology = nil
507540
}
541+
commonsvc, err := commonservice.NewService(instanceha.MetricsService(instance), time.Duration(5)*time.Second, nil)
542+
if err != nil {
543+
instance.Status.Conditions.Set(condition.FalseCondition(
544+
condition.CreateServiceReadyCondition,
545+
condition.ErrorReason,
546+
condition.SeverityWarning,
547+
condition.CreateServiceReadyErrorMessage,
548+
err.Error()))
549+
return ctrl.Result{}, err
550+
}
551+
sres, serr := commonsvc.CreateOrPatch(ctx, helper)
552+
if serr != nil {
553+
instance.Status.Conditions.Set(condition.FalseCondition(
554+
condition.CreateServiceReadyCondition,
555+
condition.ErrorReason,
556+
condition.SeverityWarning,
557+
condition.CreateServiceReadyErrorMessage,
558+
serr.Error()))
559+
return sres, serr
560+
}
561+
instance.Status.Conditions.MarkTrue(condition.CreateServiceReadyCondition, condition.CreateServiceReadyMessage)
562+
508563
deployment := commondeployment.NewDeployment(instanceha.Deployment(instance, deploymentLabels, serviceAnnotations, cloud, configVarsHash, containerImage, topology, acSecretName), time.Duration(5)*time.Second)
509564
sfres, sferr := deployment.CreateOrPatch(ctx, helper)
510565
if sferr != nil {
@@ -558,6 +613,7 @@ const (
558613
instanceHaConfigMapField = ".spec.instanceHaConfigMap"
559614
topologyField = ".spec.topologyRef.Name"
560615
acSecretField = ".spec.auth.applicationCredentialSecret" // #nosec G101
616+
metricsTLSField = ".spec.metricsTLS.secretName" // #nosec G101
561617
)
562618

563619
var allWatchFields = []string{
@@ -568,6 +624,7 @@ var allWatchFields = []string{
568624
instanceHaConfigMapField,
569625
topologyField,
570626
acSecretField,
627+
metricsTLSField,
571628
}
572629

573630
// SetupWithManager sets up the controller with the Manager.
@@ -649,9 +706,21 @@ func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
649706
return err
650707
}
651708

709+
// index metricsTLSField
710+
if err := mgr.GetFieldIndexer().IndexField(context.Background(), &instancehav1.InstanceHa{}, metricsTLSField, func(rawObj client.Object) []string {
711+
cr := rawObj.(*instancehav1.InstanceHa)
712+
if cr.Spec.MetricsTLS.SecretName == nil {
713+
return []string{instanceha.DefaultMetricsCertSecret}
714+
}
715+
return []string{*cr.Spec.MetricsTLS.SecretName}
716+
}); err != nil {
717+
return err
718+
}
719+
652720
return ctrl.NewControllerManagedBy(mgr).
653721
For(&instancehav1.InstanceHa{}).
654722
Owns(&appsv1.Deployment{}).
723+
Owns(&corev1.Service{}).
655724
Owns(&corev1.ServiceAccount{}).
656725
Owns(&rbacv1.Role{}).
657726
Owns(&rbacv1.RoleBinding{}).

internal/instanceha/const.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
package instanceha
2+
3+
const (
4+
// MetricsCertPath is the path to the metrics certificate file
5+
MetricsCertPath = "/etc/pki/tls/certs/metrics.crt"
6+
// MetricsKeyPath is the path to the metrics private key file
7+
MetricsKeyPath = "/etc/pki/tls/private/metrics.key"
8+
// DefaultMetricsCertSecret is the default secret name for the metrics TLS certificate
9+
DefaultMetricsCertSecret = "cert-instanceha-metrics" //nolint:gosec
10+
)

0 commit comments

Comments
 (0)