NE-1113: Implement podMonitor provisioning with Gateway API#1425
Conversation
|
@rikatz: This pull request references NE-1113 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe pull request introduces a new unmanaged 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/operator.go (1)
358-375:⚠️ Potential issue | 🟠 MajorAvoid one-shot startup for the PodMonitor dependent controller.
gatewayapi.ensureDependentControllers()starts dependents once and never retries after aStart()error. That worked for the existing Gateway API dependents because their prerequisite CRDs are created by the gatewayapi bootstrap path, butgateway-podmonitoradds a separate dependency onmonitoring.coreos.com/v1/PodMonitor. If this controller fails to start during bootstrap, PodMonitor reconciliation stays off until the operator restarts.Please gate this controller on PodMonitor CRD readiness or make dependent-controller startup retryable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/operator.go` around lines 358 - 375, The gateway-podmonitor controller (created via gatewaypodmonitorcontroller.NewUnmanaged and passed into the DependentControllers slice of gatewayapicontroller.New) may fail to start one-shot if the monitoring.coreos.com/v1/PodMonitor CRD isn’t present; update the startup logic so PodMonitor is gated on CRD readiness or make ensureDependentControllers retryable: either add a CRD-ready check for monitoring.coreos.com/v1/PodMonitor before registering/starting gatewayPodMonitorController, or change gatewayapi.ensureDependentControllers to attempt Start() with retry/backoff and continue retrying on transient Start() errors so PodMonitor reconciliation comes up without requiring an operator restart.
🧹 Nitpick comments (1)
test/e2e/gateway_api_test.go (1)
1647-1658: Assert the keep relabeling too.The critical safeguard in this feature is the
metricRelabelingskeep rule. Right now the test still passes if a future change drops that filter and starts scraping the full Envoy metric set, which is exactly the failure mode called out in the PR description.Suggested assertion
// Verify podMetricsEndpoints configuration. endpoints, found, err := unstructured.NestedSlice(pm.Object, "spec", "podMetricsEndpoints") if assert.NoError(t, err) && assert.True(t, found, "PodMonitor should have spec.podMetricsEndpoints") { if assert.Len(t, endpoints, 1, "PodMonitor should have exactly one endpoint") { ep, ok := endpoints[0].(map[string]interface{}) if assert.True(t, ok, "endpoint should be a map") { assert.Equal(t, "/stats/prometheus", ep["path"], "endpoint path should be /stats/prometheus") assert.Equal(t, "60s", ep["interval"], "endpoint interval should be 60s") assert.Equal(t, "http-envoy-prom", ep["port"], "endpoint port should be http-envoy-prom") + relabelings, ok := ep["metricRelabelings"].([]interface{}) + if assert.True(t, ok, "endpoint should define metricRelabelings") && + assert.Len(t, relabelings, 1, "endpoint should have exactly one metricRelabeling") { + relabeling, ok := relabelings[0].(map[string]interface{}) + if assert.True(t, ok, "metricRelabeling should be a map") { + assert.Equal(t, "keep", relabeling["action"], "metricRelabeling action should be keep") + } + } } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/gateway_api_test.go` around lines 1647 - 1658, Add an assertion that the endpoint's metricRelabelings includes a keep rule: after you obtain ep from endpoints[0] (the existing variables pm.Object, endpoints and ep), fetch ep["metricRelabelings"] as a slice, assert no error and that it exists, then assert at least one entry is a map containing "action" == "keep" (and optionally the expected "regex" key if you want to be stricter) so the test fails if the keep relabeling is removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/controller/gateway-podmonitor/controller.go`:
- Around line 63-75: The mapper enqueueRequestForOwningGateway currently only
uses manifests.OwningGatewayLabel; update it to fall back to the object's owner
references when that label is absent or changed: if
labels[manifests.OwningGatewayLabel] is missing, iterate a.GetOwnerReferences(),
find an owner with Kind matching the Gateway kind (use the Gateway kind constant
if available or "Gateway"), take owner.Name and enqueue a reconcile.Request for
that name (use operatorcontroller.DefaultOperandNamespace for the gateway
namespace to match desiredGatewayPodMonitor behavior); keep the original
label-based path and only use the ownerRef fallback when label lookup fails.
---
Outside diff comments:
In `@pkg/operator/operator.go`:
- Around line 358-375: The gateway-podmonitor controller (created via
gatewaypodmonitorcontroller.NewUnmanaged and passed into the
DependentControllers slice of gatewayapicontroller.New) may fail to start
one-shot if the monitoring.coreos.com/v1/PodMonitor CRD isn’t present; update
the startup logic so PodMonitor is gated on CRD readiness or make
ensureDependentControllers retryable: either add a CRD-ready check for
monitoring.coreos.com/v1/PodMonitor before registering/starting
gatewayPodMonitorController, or change gatewayapi.ensureDependentControllers to
attempt Start() with retry/backoff and continue retrying on transient Start()
errors so PodMonitor reconciliation comes up without requiring an operator
restart.
---
Nitpick comments:
In `@test/e2e/gateway_api_test.go`:
- Around line 1647-1658: Add an assertion that the endpoint's metricRelabelings
includes a keep rule: after you obtain ep from endpoints[0] (the existing
variables pm.Object, endpoints and ep), fetch ep["metricRelabelings"] as a
slice, assert no error and that it exists, then assert at least one entry is a
map containing "action" == "keep" (and optionally the expected "regex" key if
you want to be stricter) so the test fails if the keep relabeling is removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7025f267-551c-4c6a-be65-faec8aa0a877
📒 Files selected for processing (7)
manifests/01-role-binding.yamlmanifests/01-role.yamlpkg/operator/controller/gateway-podmonitor/controller.gopkg/operator/controller/gateway-podmonitor/podmonitor.gopkg/operator/controller/names.gopkg/operator/operator.gotest/e2e/gateway_api_test.go
|
The pre-merge findings does not make sense as I am not changing openshift/origin tests |
|
claude answer on guarding the controller for CRD and ensureDependentControllers: Regarding the ensureDependentControllers CRD readiness finding — no fix is needed. Here's why:
All the other findings from the review have already been verified and fixed where needed (RBAC list/watch |
|
/hold |
|
@rikatz: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |


Summary
Adds a new
gateway-podmonitorcontroller that creates a PodMonitor resource per Gateway inopenshift-ingress, enabling Prometheus to scrape Istio gateway pod metrics for consumption by Kiali.gateway-networkpolicypatterngateway.networking.k8s.io/gateway-name/stats/prometheus(porthttp-envoy-prom) every 60smetricRelabelingsto keep onlyistio_*and selectenvoy_*seriesopenshift-ingressnamespace via namespaced Role/RoleBinding (not ClusterRole)Files Changed
pkg/operator/controller/gateway-podmonitor/controller.goopenshift-ingresspkg/operator/controller/gateway-podmonitor/podmonitor.gounstructured.Unstructuredpkg/operator/controller/names.goGatewayPodMonitorName()returning<gateway-name>-monitorpkg/operator/operator.gogatewayapicontrollermanifests/01-role.yamlpodmonitorsinopenshift-ingressmanifests/01-role-binding.yamlPrometheus Cardinality Impact
The
metricRelabelingskeepfilter restricts scraped metrics to:istio_*(request count, duration, size, TCP metrics)envoy_*(upstream connections/requests, listener activity, server memory, uptime)Estimated time series per deployment
prometheus_tsdb_head_seriesMitigations in place
keepaction with regex filter drops all metrics not matching the allowlistRecommendations for high-scale environments
prometheus_tsdb_head_seriesafter enabling Gateway PodMonitorsdroprelabeling foristio_request_duration_milliseconds_bucket(keep only_countand_sum)Followup
Test Plan
openshift-ingress<gateway-name>-monitoris created inopenshift-ingressistio_*and allowedenvoy_*metrics appear in Prometheusprometheus_tsdb_head_seriesfor cardinality impact