LOG-9356: Implement Collector ServiceMonitor Changes#3270
LOG-9356: Implement Collector ServiceMonitor Changes#3270Clee2691 wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@Clee2691: This pull request references LOG-9356 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.8.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Review Summary by QodoAdd metric filtering to collector ServiceMonitor
WalkthroughsDescription• Add metric filtering to ServiceMonitor with keep/drop relabel configs • Keep only essential metrics for alerts, dashboards, telemetry • Drop transform component metrics except errors and discarded events • Refactor relabel config structure for improved readability Diagramflowchart LR
SM["ServiceMonitor<br/>MetricRelabelConfigs"]
R1["Relabel 1:<br/>Replace - with _"]
R2["Relabel 2:<br/>Keep allowed metrics"]
R3["Relabel 3:<br/>Drop transform metrics"]
SM --> R1
SM --> R2
SM --> R3
R2 --> Filter["Filtered metrics<br/>for monitoring"]
R3 --> Filter
File Changes1. internal/metrics/service_monitor.go
|
Code Review by Qodo
Context used 1. LFME metrics filtered out
|
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (16)
WalkthroughReconciler now creates two ServiceMonitors (full and minimal) per component, with configurable Prometheus relabel configs and a metrics-collection-profile label. New relabel builder produces precomputed config sets; ServiceMonitor construction/signature and tests updated to accept and validate metricRelabelConfigs, TLS ServerName, and profile labeling. ChangesMetrics relabel & ServiceMonitor flow
Sequence DiagramsequenceDiagram
participant Controller as Controller
participant Relabel as Relabel Builder
participant K8s as Kubernetes API
participant PromOp as Prometheus Operator
Controller->>Relabel: Request relabel configs (full/minimal)
Relabel-->>Controller: Return relabel config sets
Controller->>K8s: Create/Update ServiceMonitor (endpoints, TLS ServerName, metricRelabelConfigs, profile label)
K8s->>PromOp: Prometheus Operator observes ServiceMonitor
PromOp->>K8s: Begin scraping targets per ServiceMonitor spec
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 4❌ Failed checks (4 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| { | ||
| Action: "keep", | ||
| SourceLabels: []monitoringv1.LabelName{"__name__"}, | ||
| Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files", | ||
| }, |
There was a problem hiding this comment.
1. Lfme metrics filtered out 🐞 Bug ≡ Correctness
newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus. This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Agent Prompt
## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.
## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.
## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]
### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
- Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
- Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/metrics/service_monitor.go (1)
43-43: 💤 Low valueConsider extracting the keep-list regex to a named constant.
The allowlist is the core deliverable of this PR and a maintenance hotspot (new metrics are added, removed, or renamed). Inlining it as a 500+ character string makes diffs noisy and typos hard to spot.
♻️ Suggested refactor
+const collectorMetricsKeepRegex = "logcollector_component_event_unmatched_count|" + + "vector_buffer_byte_size|vector_buffer_discarded_events_total|" + + "vector_buffer_events|vector_buffer_sent_events_total|" + + "vector_component_discarded_events_total|vector_component_errors_total|" + + "vector_component_received_bytes_total|vector_component_received_event_bytes_total|" + + "vector_component_received_events_total|vector_component_sent_bytes_total|" + + "vector_events_in_total|vector_http_client_errors_total|" + + "vector_http_client_requests_sent_total|vector_http_client_responses_total|" + + "vector_open_files" // in newServiceMonitor: - Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|...", + Regex: collectorMetricsKeepRegex,The constant can then be referenced in the test as well, removing duplication.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/service_monitor.go` at line 43, Extract the long inline allowlist string used in the Regex field into a clearly named package-level constant (e.g., allowedMetricsRegex or keepListRegex) in internal/metrics/service_monitor.go, replace the inlined string in the struct literal's Regex field with that constant, and update any tests to import/reference the same constant instead of duplicating the string so the allowlist is a single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 83-84: The test currently omits asserting the allowlist regex so
changes silently pass; update the test in service_monitor_test.go to assert
relabelConfigs[1].Regex equals the expected allowlist regex (or, better, compare
it to the constant defined in service_monitor.go if you extract the regex
there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.
---
Nitpick comments:
In `@internal/metrics/service_monitor.go`:
- Line 43: Extract the long inline allowlist string used in the Regex field into
a clearly named package-level constant (e.g., allowedMetricsRegex or
keepListRegex) in internal/metrics/service_monitor.go, replace the inlined
string in the struct literal's Regex field with that constant, and update any
tests to import/reference the same constant instead of duplicating the string so
the allowlist is a single source of truth.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c0367dba-620a-4f1b-abb3-d1fbd24ba9de
📒 Files selected for processing (2)
internal/metrics/service_monitor.gointernal/metrics/service_monitor_test.go
| Expect(string(relabelConfigs[1].Action)).To(Equal("keep")) | ||
| Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"})) |
There was a problem hiding this comment.
The keep-list regex is unasserted — the PR's core change has no test validation.
relabelConfigs[1].Regex is never checked. A typo, accidental truncation, or future metric name change in the allowlist would silently pass the test suite.
✅ Proposed fix
Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
+Expect(relabelConfigs[1].Regex).To(Equal(
+ "logcollector_component_event_unmatched_count|vector_buffer_byte_size|" +
+ "vector_buffer_discarded_events_total|vector_buffer_events|" +
+ "vector_buffer_sent_events_total|vector_component_discarded_events_total|" +
+ "vector_component_errors_total|vector_component_received_bytes_total|" +
+ "vector_component_received_event_bytes_total|vector_component_received_events_total|" +
+ "vector_component_sent_bytes_total|vector_events_in_total|" +
+ "vector_http_client_errors_total|vector_http_client_requests_sent_total|" +
+ "vector_http_client_responses_total|vector_open_files",
+))(Even better: extract the regex to a constant in service_monitor.go and reference it here directly.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Expect(string(relabelConfigs[1].Action)).To(Equal("keep")) | |
| Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"})) | |
| Expect(string(relabelConfigs[1].Action)).To(Equal("keep")) | |
| Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"})) | |
| Expect(relabelConfigs[1].Regex).To(Equal( | |
| "logcollector_component_event_unmatched_count|vector_buffer_byte_size|" + | |
| "vector_buffer_discarded_events_total|vector_buffer_events|" + | |
| "vector_buffer_sent_events_total|vector_component_discarded_events_total|" + | |
| "vector_component_errors_total|vector_component_received_bytes_total|" + | |
| "vector_component_received_event_bytes_total|vector_component_received_events_total|" + | |
| "vector_component_sent_bytes_total|vector_events_in_total|" + | |
| "vector_http_client_errors_total|vector_http_client_requests_sent_total|" + | |
| "vector_http_client_responses_total|vector_open_files", | |
| )) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/metrics/service_monitor_test.go` around lines 83 - 84, The test
currently omits asserting the allowlist regex so changes silently pass; update
the test in service_monitor_test.go to assert relabelConfigs[1].Regex equals the
expected allowlist regex (or, better, compare it to the constant defined in
service_monitor.go if you extract the regex there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.
|
/retest |
|
/hold |
|
Have you considered implementing collection profiles? The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:
We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum. cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!) |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/metrics/relabel.go (2)
13-17: 💤 Low value
excludeMetricsis semantically ambiguous — considerdroppedMetrics.The field name
excludeMetricscan be read as "metrics excluded from the drop action" (i.e., kept), but the implementation puts them inside thedroprule regex, meaning they are the metrics that get dropped for the given label combination.droppedMetricsormetricsToDropForLabelwould remove this ambiguity for future maintainers.♻️ Proposed rename
type metricDropConfig struct { labelName string labelValue string - excludeMetrics []string + droppedMetrics []string }var collectorMinimalDropConfigs = []metricDropConfig{ { labelName: "component_kind", labelValue: "transform", - excludeMetrics: []string{ + droppedMetrics: []string{And in
buildRelabelConfigs:- Regex: drop.labelValue + ";(" + strings.Join(drop.excludeMetrics, "|") + ")", + Regex: drop.labelValue + ";(" + strings.Join(drop.droppedMetrics, "|") + ")",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/relabel.go` around lines 13 - 17, Rename the ambiguous struct field excludeMetrics in metricDropConfig to a clearer name like droppedMetrics (or metricsToDropForLabel) and update all references accordingly; specifically change the field declaration in metricDropConfig and adjust any code that reads or writes metricDropConfig (e.g., the buildRelabelConfigs function and any callers or JSON/YAML tags) so they use the new symbol name to reflect that these are the metrics that will be dropped for the given label combination.
89-95: 💤 Low valueAdd a length guard for
excludeMetricsto match the allowlist pattern.The allowlist block guards
len(allowlist.allowedMetrics) > 0before appending a keep rule, but the drop loop has no equivalent guard. An emptyexcludeMetricsslice produces the regexlabelValue;()withaction: drop. While Prometheus's anchored full-string matching makes this effectively a no-op for real metrics (would only match a sample with an empty__name__), it emits a spurious relabel rule and is inconsistent with the allowlist guard style.♻️ Proposed fix
for _, drop := range dropConfigs { + if len(drop.excludeMetrics) == 0 { + continue + } configs = append(configs, &monitoringv1.RelabelConfig{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/relabel.go` around lines 89 - 95, The drop loop appends a "drop" relabel rule even when drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()"; add the same length guard used for the allowlist: check len(drop.excludeMetrics) > 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the loop iterating over dropConfigs that currently builds SourceLabels/Regex), so only append the drop rule when excludeMetrics is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/metrics/relabel.go`:
- Around line 13-17: Rename the ambiguous struct field excludeMetrics in
metricDropConfig to a clearer name like droppedMetrics (or
metricsToDropForLabel) and update all references accordingly; specifically
change the field declaration in metricDropConfig and adjust any code that reads
or writes metricDropConfig (e.g., the buildRelabelConfigs function and any
callers or JSON/YAML tags) so they use the new symbol name to reflect that these
are the metrics that will be dropped for the given label combination.
- Around line 89-95: The drop loop appends a "drop" relabel rule even when
drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()";
add the same length guard used for the allowlist: check len(drop.excludeMetrics)
> 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the
loop iterating over dropConfigs that currently builds SourceLabels/Regex), so
only append the drop rule when excludeMetrics is non-empty.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: da0fa976-0000-4188-84d0-d85035bde888
📒 Files selected for processing (10)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.go
✅ Files skipped from review due to trivial changes (2)
- internal/constants/labels.go
- internal/constants/constants.go
@simonpasquier I've made changes to this to implement profiles. Can someone on the monitoring team also review? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Clee2691, jcantrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@Clee2691: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 42-74: In QueryPrometheus replace the no-timeout http.Client with
one that has a 30s Timeout, stop using Fail(...) on request/response errors and
instead log the error and return an empty map[string]interface{} (so callers can
retry), validate response.StatusCode (expect 200) before attempting to
read/unmarshal the body and handle non-200 by logging and returning empty map,
and ensure response.Body is always closed; update references in this function
(QueryPrometheus, the http.Client/Request handling, and the json.Unmarshal
logic) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9624b7e9-9df4-450c-918c-851df0a88fa1
📒 Files selected for processing (16)
internal/constants/constants.gointernal/constants/labels.gointernal/controller/observability/collector.gointernal/controller/observability/collector_test.gointernal/metrics/logfilemetricexporter/metric_exporter.gointernal/metrics/logfilemetricexporter/metric_exporter_test.gointernal/metrics/relabel.gointernal/metrics/relabel_test.gointernal/metrics/service_monitor.gointernal/metrics/service_monitor_test.gotest/e2e/collection/metrics/profile_test.gotest/e2e/collection/metrics/suite_test.gotest/e2e/flowcontrol/utils.gotest/e2e/logfilesmetricexporter/lfme_test.gotest/e2e/operator/metrics/e2e_test.gotest/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
- internal/constants/labels.go
- internal/metrics/logfilemetricexporter/metric_exporter_test.go
- internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/controller/observability/collector.go
- internal/metrics/logfilemetricexporter/metric_exporter.go
- internal/controller/observability/collector_test.go
- internal/metrics/service_monitor.go
Description
This PR makes use of metrics collection profiles.
The CMO supports metrics collection profiles that select
ServiceMonitorsby themonitoring.openshift.io/collection-profilelabel. This change adds that label to allServiceMonitorscreated by CLO, enabling cluster admins to reduce metrics cardinality by switching to the minimal profile.Because the minimal
ServiceMonitorhas a different resource name (e.g.,minimal-collector) but targets the same backing service, a serviceNameparameter was added to ensure TLS ServerName verification uses the correct service DNS name.Minimal Metrics - Collector:
logcollector_component_event_unmatched_countvector_buffer_byte_sizevector_buffer_discarded_events_totalvector_buffer_eventsvector_buffer_sent_events_totalvector_component_discarded_events_totalvector_component_errors_totalvector_component_received_bytes_totalvector_component_received_event_bytes_totalvector_component_received_events_totalvector_component_sent_bytes_totalvector_events_in_totalvector_http_client_errors_totalvector_http_client_requests_sent_totalvector_http_client_responses_totalvector_open_filesMinimal Metrics - LFME:
log_logged_bytes_total/cc @vparfonov
/assign @jcantrill
Links
Summary by CodeRabbit
New Features
Tests