NE-1113: Implement podMonitor provisioning with Gateway API by rikatz · Pull Request #1425 · openshift/cluster-ingress-operator

rikatz · 2026-04-28T21:28:05Z

Summary

Adds a new gateway-podmonitor controller that creates a PodMonitor resource per Gateway in openshift-ingress, enabling Prometheus to scrape Istio gateway pod metrics for consumption by Kiali.

New unmanaged dependent controller following the gateway-networkpolicy pattern
Creates a PodMonitor per Gateway selecting pods via gateway.networking.k8s.io/gateway-name
Scrapes Istio envoy metrics at /stats/prometheus (port http-envoy-prom) every 60s
Filters metrics via metricRelabelings to keep only istio_* and select envoy_* series
PodMonitor lifecycle tied to Gateway via OwnerReference (cascading deletion)
RBAC scoped to openshift-ingress namespace via namespaced Role/RoleBinding (not ClusterRole)

Files Changed

File	Change
`pkg/operator/controller/gateway-podmonitor/controller.go`	New controller: watches Gateways and PodMonitors in `openshift-ingress`
`pkg/operator/controller/gateway-podmonitor/podmonitor.go`	PodMonitor ensure/create/update logic using `unstructured.Unstructured`
`pkg/operator/controller/names.go`	Added `GatewayPodMonitorName()` returning `<gateway-name>-monitor`
`pkg/operator/operator.go`	Registered controller as dependent of `gatewayapi` controller
`manifests/01-role.yaml`	Added namespaced Role for `podmonitors` in `openshift-ingress`
`manifests/01-role-binding.yaml`	Added corresponding RoleBinding

Prometheus Cardinality Impact

The metricRelabelings keep filter restricts scraped metrics to:

istio_* (request count, duration, size, TCP metrics)
Select envoy_* (upstream connections/requests, listener activity, server memory, uptime)

Estimated time series per deployment

Scale	Series estimate	Impact
1 Gateway, 2 replicas	~4,000	Negligible
5 Gateways, 3 replicas	~30,000	Modest
20 Gateways, 3 replicas	~120,000	Moderate — monitor `prometheus_tsdb_head_series`

Mitigations in place

keep action with regex filter drops all metrics not matching the allowlist
60s scrape interval reduces ingestion load 2-4x vs default 30s
Per-gateway PodMonitor allows targeted deletion if a specific Gateway causes issues

Recommendations for high-scale environments

Monitor prometheus_tsdb_head_series after enabling Gateway PodMonitors
If histogram buckets drive excessive cardinality, add a drop relabeling for istio_request_duration_milliseconds_bucket (keep only _count and _sum)
For >20 Gateways, consider tightening the regex further based on actual Kiali requirements

Followup

Add a NetworkPolicy for gateway provisioning that blocks metric scrapping and allows just openshift-monitoring namespace to do so

Test Plan

Deploy operator with Gateway API feature gate enabled
Create a Gateway in openshift-ingress
Verify PodMonitor <gateway-name>-monitor is created in openshift-ingress
Verify PodMonitor selector matches Gateway pod labels
Verify Prometheus targets show the Gateway pods as scrape targets
Verify istio_* and allowed envoy_* metrics appear in Prometheus
Delete the Gateway and verify the PodMonitor is garbage-collected
Verify no PodMonitor leak after Gateway deletion
Check Prometheus prometheus_tsdb_head_series for cardinality impact

openshift-ci-robot · 2026-04-28T21:28:09Z

@rikatz: This pull request references NE-1113 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds a new gateway-podmonitor controller that creates a PodMonitor resource per Gateway in openshift-ingress, enabling Prometheus to scrape Istio gateway pod metrics for consumption by Kiali.

New unmanaged dependent controller following the gateway-networkpolicy pattern

Creates a PodMonitor per Gateway selecting pods via gateway.networking.k8s.io/gateway-name

Scrapes Istio envoy metrics at /stats/prometheus (port http-envoy-prom) every 60s

Filters metrics via metricRelabelings to keep only istio_* and select envoy_* series

PodMonitor lifecycle tied to Gateway via OwnerReference (cascading deletion)

RBAC scoped to openshift-ingress namespace via namespaced Role/RoleBinding (not ClusterRole)

Files Changed

File Change

pkg/operator/controller/gateway-podmonitor/controller.go New controller: watches Gateways and PodMonitors in openshift-ingress

pkg/operator/controller/gateway-podmonitor/podmonitor.go PodMonitor ensure/create/update logic using unstructured.Unstructured

pkg/operator/controller/names.go Added GatewayPodMonitorName() returning <gateway-name>-monitor

pkg/operator/operator.go Registered controller as dependent of gatewayapi controller

manifests/01-role.yaml Added namespaced Role for podmonitors in openshift-ingress

manifests/01-role-binding.yaml Added corresponding RoleBinding

Prometheus Cardinality Impact

The metricRelabelings keep filter restricts scraped metrics to:

istio_* (request count, duration, size, TCP metrics)

Select envoy_* (upstream connections/requests, listener activity, server memory, uptime)

Estimated time series per deployment

Scale Series estimate Impact

1 Gateway, 2 replicas ~4,000 Negligible

5 Gateways, 3 replicas ~30,000 Modest

20 Gateways, 3 replicas ~120,000 Moderate — monitor prometheus_tsdb_head_series

Mitigations in place

keep action with regex filter drops all metrics not matching the allowlist

60s scrape interval reduces ingestion load 2-4x vs default 30s

Per-gateway PodMonitor allows targeted deletion if a specific Gateway causes issues

Recommendations for high-scale environments

Monitor prometheus_tsdb_head_series after enabling Gateway PodMonitors

If histogram buckets drive excessive cardinality, add a drop relabeling for istio_request_duration_milliseconds_bucket (keep only _count and _sum)

For >20 Gateways, consider tightening the regex further based on actual Kiali requirements

Followup

Add a NetworkPolicy for gateway provisioning that blocks metric scrapping and allows just openshift-monitoring namespace to do so

Test Plan

Deploy operator with Gateway API feature gate enabled

Create a Gateway in openshift-ingress

Verify PodMonitor <gateway-name>-monitor is created in openshift-ingress

Verify PodMonitor selector matches Gateway pod labels

Verify Prometheus targets show the Gateway pods as scrape targets

Verify istio_* and allowed envoy_* metrics appear in Prometheus

Delete the Gateway and verify the PodMonitor is garbage-collected

Verify no PodMonitor leak after Gateway deletion

Check Prometheus prometheus_tsdb_head_series for cardinality impact

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-28T21:28:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f9c091c6-aa72-4e1e-b8ab-9c0962aaaafb

📥 Commits

Reviewing files that changed from the base of the PR and between 5ceccfd and 7ae9aed.

📒 Files selected for processing (6)

manifests/00-cluster-role.yaml
pkg/operator/controller/gateway-podmonitor/controller.go
pkg/operator/controller/gateway-podmonitor/podmonitor.go
pkg/operator/controller/names.go
pkg/operator/operator.go
test/e2e/gateway_api_test.go

✅ Files skipped from review due to trivial changes (1)

manifests/00-cluster-role.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/operator/controller/names.go
pkg/operator/operator.go
pkg/operator/controller/gateway-podmonitor/controller.go

📝 Walkthrough

Walkthrough

The pull request introduces a new unmanaged gateway-podmonitor controller that reconciles PodMonitor resources for each Gateway. The controller watches Gateway objects and PodMonitor resources in the operand namespace, mapping PodMonitor events to their owning Gateway. On reconciliation, it creates or updates a PodMonitor with configured metrics scraping on port http-envoy-prom at path /stats/prometheus. The controller is integrated into the operator's lifecycle management and includes corresponding RBAC permissions and e2e test coverage to validate PodMonitor creation and configuration.

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: implementing PodMonitor provisioning for Gateway API, which is the core objective of this PR.
Description check	✅ Passed	The description comprehensively covers the changeset with detailed rationale, cardinality analysis, test plan, and implementation notes directly related to the PodMonitor controller addition.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All Ginkgo test declarations in gateway_api_test.go contain stable, deterministic names without dynamic content, fmt.Sprintf, string concatenation, or variable interpolation.
Test Structure And Quality	✅ Passed	The assertGatewayPodMonitor helper function tests a single logical behavior with appropriate setup/cleanup patterns, proper timeouts, meaningful assertion messages, and follows existing codebase patterns.
Microshift Test Compatibility	✅ Passed	New e2e test additions do not use MicroShift-incompatible APIs. PodMonitor resources from monitoring.coreos.com/v1 API and standard Kubernetes APIs are used without OpenShift-specific dependencies.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The assertGatewayPodMonitor test helper function does not make multi-node or HA cluster assumptions and works equally well in SNO environments.
Topology-Aware Scheduling Compatibility	✅ Passed	PR does not introduce topology-aware scheduling constraints; creates only configuration PodMonitors using standard label selectors.
Ote Binary Stdout Contract	✅ Passed	Pull request introduces gateway-podmonitor controller code that properly adheres to OTE Binary Stdout Contract with all logging using configured logf.Logger writing to stderr.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The assertGatewayPodMonitor test helper contains no IPv4-specific assumptions, IP parsing logic, or external connectivity requirements, operating entirely within the Kubernetes cluster using standard client operations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-28T21:29:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign miciah for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/operator/operator.go (1)
358-375: ⚠️ Potential issue | 🟠 Major

Avoid one-shot startup for the PodMonitor dependent controller.

gatewayapi.ensureDependentControllers() starts dependents once and never retries after a Start() error. That worked for the existing Gateway API dependents because their prerequisite CRDs are created by the gatewayapi bootstrap path, but gateway-podmonitor adds a separate dependency on monitoring.coreos.com/v1/PodMonitor. If this controller fails to start during bootstrap, PodMonitor reconciliation stays off until the operator restarts.

Please gate this controller on PodMonitor CRD readiness or make dependent-controller startup retryable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/operator.go` around lines 358 - 375, The gateway-podmonitor
controller (created via gatewaypodmonitorcontroller.NewUnmanaged and passed into
the DependentControllers slice of gatewayapicontroller.New) may fail to start
one-shot if the monitoring.coreos.com/v1/PodMonitor CRD isn’t present; update
the startup logic so PodMonitor is gated on CRD readiness or make
ensureDependentControllers retryable: either add a CRD-ready check for
monitoring.coreos.com/v1/PodMonitor before registering/starting
gatewayPodMonitorController, or change gatewayapi.ensureDependentControllers to
attempt Start() with retry/backoff and continue retrying on transient Start()
errors so PodMonitor reconciliation comes up without requiring an operator
restart.

🧹 Nitpick comments (1)

test/e2e/gateway_api_test.go (1)

1647-1658: Assert the keep relabeling too.

The critical safeguard in this feature is the metricRelabelings keep rule. Right now the test still passes if a future change drops that filter and starts scraping the full Envoy metric set, which is exactly the failure mode called out in the PR description.

Suggested assertion

 	// Verify podMetricsEndpoints configuration.
 	endpoints, found, err := unstructured.NestedSlice(pm.Object, "spec", "podMetricsEndpoints")
 	if assert.NoError(t, err) && assert.True(t, found, "PodMonitor should have spec.podMetricsEndpoints") {
 		if assert.Len(t, endpoints, 1, "PodMonitor should have exactly one endpoint") {
 			ep, ok := endpoints[0].(map[string]interface{})
 			if assert.True(t, ok, "endpoint should be a map") {
 				assert.Equal(t, "/stats/prometheus", ep["path"], "endpoint path should be /stats/prometheus")
 				assert.Equal(t, "60s", ep["interval"], "endpoint interval should be 60s")
 				assert.Equal(t, "http-envoy-prom", ep["port"], "endpoint port should be http-envoy-prom")
+				relabelings, ok := ep["metricRelabelings"].([]interface{})
+				if assert.True(t, ok, "endpoint should define metricRelabelings") &&
+					assert.Len(t, relabelings, 1, "endpoint should have exactly one metricRelabeling") {
+					relabeling, ok := relabelings[0].(map[string]interface{})
+					if assert.True(t, ok, "metricRelabeling should be a map") {
+						assert.Equal(t, "keep", relabeling["action"], "metricRelabeling action should be keep")
+					}
+				}
 			}
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/gateway_api_test.go` around lines 1647 - 1658, Add an assertion that
the endpoint's metricRelabelings includes a keep rule: after you obtain ep from
endpoints[0] (the existing variables pm.Object, endpoints and ep), fetch
ep["metricRelabelings"] as a slice, assert no error and that it exists, then
assert at least one entry is a map containing "action" == "keep" (and optionally
the expected "regex" key if you want to be stricter) so the test fails if the
keep relabeling is removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/gateway-podmonitor/controller.go`:
- Around line 63-75: The mapper enqueueRequestForOwningGateway currently only
uses manifests.OwningGatewayLabel; update it to fall back to the object's owner
references when that label is absent or changed: if
labels[manifests.OwningGatewayLabel] is missing, iterate a.GetOwnerReferences(),
find an owner with Kind matching the Gateway kind (use the Gateway kind constant
if available or "Gateway"), take owner.Name and enqueue a reconcile.Request for
that name (use operatorcontroller.DefaultOperandNamespace for the gateway
namespace to match desiredGatewayPodMonitor behavior); keep the original
label-based path and only use the ownerRef fallback when label lookup fails.

---

Outside diff comments:
In `@pkg/operator/operator.go`:
- Around line 358-375: The gateway-podmonitor controller (created via
gatewaypodmonitorcontroller.NewUnmanaged and passed into the
DependentControllers slice of gatewayapicontroller.New) may fail to start
one-shot if the monitoring.coreos.com/v1/PodMonitor CRD isn’t present; update
the startup logic so PodMonitor is gated on CRD readiness or make
ensureDependentControllers retryable: either add a CRD-ready check for
monitoring.coreos.com/v1/PodMonitor before registering/starting
gatewayPodMonitorController, or change gatewayapi.ensureDependentControllers to
attempt Start() with retry/backoff and continue retrying on transient Start()
errors so PodMonitor reconciliation comes up without requiring an operator
restart.

---

Nitpick comments:
In `@test/e2e/gateway_api_test.go`:
- Around line 1647-1658: Add an assertion that the endpoint's metricRelabelings
includes a keep rule: after you obtain ep from endpoints[0] (the existing
variables pm.Object, endpoints and ep), fetch ep["metricRelabelings"] as a
slice, assert no error and that it exists, then assert at least one entry is a
map containing "action" == "keep" (and optionally the expected "regex" key if
you want to be stricter) so the test fails if the keep relabeling is removed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7025f267-551c-4c6a-be65-faec8aa0a877

📥 Commits

Reviewing files that changed from the base of the PR and between 961ac21 and 5ceccfd.

📒 Files selected for processing (7)

manifests/01-role-binding.yaml
manifests/01-role.yaml
pkg/operator/controller/gateway-podmonitor/controller.go
pkg/operator/controller/gateway-podmonitor/podmonitor.go
pkg/operator/controller/names.go
pkg/operator/operator.go
test/e2e/gateway_api_test.go

rikatz · 2026-04-28T21:55:41Z

The pre-merge findings does not make sense as I am not changing openshift/origin tests

rikatz · 2026-04-28T21:59:42Z

claude answer on guarding the controller for CRD and ensureDependentControllers:

Regarding the ensureDependentControllers CRD readiness finding — no fix is needed. Here's why:

Pre-existing pattern: ensureDependentControllers uses a fire-and-forget Start() in goroutines for all
dependent controllers (networkpolicy, status, labeler, DNS). The controllersStarted flag prevents retry after
the first attempt. This is the established design, not something specific to our PodMonitor controller.
monitoring.coreos.com CRDs are always present on OpenShift: The cluster-monitoring-operator is a required
component and installs these CRDs. The existing ServiceMonitor code in
pkg/operator/controller/ingress/monitoring.go already makes this same assumption — it creates
monitoring.coreos.com/v1/ServiceMonitor resources without any CRD readiness gating.
Scope: Changing ensureDependentControllers to be retryable or CRD-gated would affect all dependent
controllers and is a broader architectural change beyond the scope of this PodMonitor addition.

All the other findings from the review have already been verified and fixed where needed (RBAC list/watch
verbs, %w error wrapping, AlreadyExists guard, unused struct fields, consistent cmp.Equal usage). The
implementation is complete.

rikatz · 2026-04-30T16:53:28Z

rikatz · 2026-04-30T16:53:48Z

rikatz · 2026-04-30T18:40:14Z

/hold
holding for now, we need to discuss a better strategy to ship metrics on Gateways and also possible impacts

openshift-ci · 2026-04-30T20:52:05Z

@rikatz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn	`7ae9aed`	link	true	`/test e2e-aws-ovn`
ci/prow/e2e-aws-operator-techpreview	`7ae9aed`	link	false	`/test e2e-aws-operator-techpreview`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 28, 2026

openshift-ci Bot requested review from grzpiotrowski and rfredette April 28, 2026 21:29

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread pkg/operator/controller/gateway-podmonitor/controller.go

NE-1113: Implement podMonitor provisioning with Gateway API

7ae9aed

rikatz force-pushed the ne-1113 branch from 5ceccfd to 7ae9aed Compare April 30, 2026 17:03

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2026

Conversation

rikatz commented Apr 28, 2026

Summary

Files Changed

Prometheus Cardinality Impact

Estimated time series per deployment

Mitigations in place

Recommendations for high-scale environments

Followup

Test Plan

Uh oh!

openshift-ci-robot commented Apr 28, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files Changed

Prometheus Cardinality Impact

Estimated time series per deployment

Mitigations in place

Recommendations for high-scale environments

Followup

Test Plan

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

openshift-ci Bot commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rikatz commented Apr 28, 2026

Uh oh!

rikatz commented Apr 28, 2026

Uh oh!

rikatz commented Apr 30, 2026

Uh oh!

rikatz commented Apr 30, 2026

Uh oh!

rikatz commented Apr 30, 2026

Uh oh!

openshift-ci Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

openshift-ci-robot commented Apr 28, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading