Skip to content

LOG-9356: Implement Collector ServiceMonitor Changes#3270

Open
Clee2691 wants to merge 1 commit intoopenshift:masterfrom
Clee2691:LOG-9356
Open

LOG-9356: Implement Collector ServiceMonitor Changes#3270
Clee2691 wants to merge 1 commit intoopenshift:masterfrom
Clee2691:LOG-9356

Conversation

@Clee2691
Copy link
Copy Markdown
Contributor

@Clee2691 Clee2691 commented May 5, 2026

Description

This PR makes use of metrics collection profiles.

The CMO supports metrics collection profiles that select ServiceMonitors by the monitoring.openshift.io/collection-profile label. This change adds that label to all ServiceMonitors created by CLO, enabling cluster admins to reduce metrics cardinality by switching to the minimal profile.

Because the minimal ServiceMonitor has a different resource name (e.g., minimal-collector) but targets the same backing service, a serviceName parameter was added to ensure TLS ServerName verification uses the correct service DNS name.

Minimal Metrics - Collector:

  • logcollector_component_event_unmatched_count
  • vector_buffer_byte_size
  • vector_buffer_discarded_events_total
  • vector_buffer_events
  • vector_buffer_sent_events_total
  • vector_component_discarded_events_total
  • vector_component_errors_total
  • vector_component_received_bytes_total
  • vector_component_received_event_bytes_total
  • vector_component_received_events_total
  • vector_component_sent_bytes_total
  • vector_events_in_total
  • vector_http_client_errors_total
  • vector_http_client_requests_sent_total
  • vector_http_client_responses_total
  • vector_open_files

Minimal Metrics - LFME:

  • log_logged_bytes_total

/cc @vparfonov
/assign @jcantrill

Links

Summary by CodeRabbit

  • New Features

    • Metrics scraping now supports distinct "full" and "minimal" collection profiles, creating separate monitoring endpoints and adding a profile label to ServiceMonitors.
    • Prometheus query helper added for test-time validation of scraped metrics.
  • Tests

    • Expanded unit and end-to-end tests to validate relabeling rules, profile-specific relabel configs, TLS server name, and that full/minimal profiles expose expected metrics.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@Clee2691: This pull request references LOG-9356 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.8.0" version, but no target version was set.

Details

In response to this:

Description

  • Add MetricRelabelConfigs to the collector ServiceMonitor to keep only the metrics needed for alerts, dashboards, and telemetry
  • Drop transform component metrics except errors and discarded events to further reduce cardinality

Allowed Metrics:

  • logcollector_component_event_unmatched_count
  • vector_buffer_byte_size
  • vector_buffer_discarded_events_total
  • vector_buffer_events
  • vector_buffer_sent_events_total
  • vector_component_discarded_events_total
  • vector_component_errors_total
  • vector_component_received_bytes_total
  • vector_component_received_event_bytes_total
  • vector_component_received_events_total
  • vector_component_sent_bytes_total
  • vector_events_in_total
  • vector_http_client_errors_total
  • vector_http_client_requests_sent_total
  • vector_http_client_responses_total
  • vector_open_files

/cc @vparfonov
/assign @jcantrill

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci openshift-ci Bot requested a review from vparfonov May 5, 2026 21:16
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add metric filtering to collector ServiceMonitor

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add metric filtering to ServiceMonitor with keep/drop relabel configs
• Keep only essential metrics for alerts, dashboards, telemetry
• Drop transform component metrics except errors and discarded events
• Refactor relabel config structure for improved readability
Diagram
flowchart LR
  SM["ServiceMonitor<br/>MetricRelabelConfigs"]
  R1["Relabel 1:<br/>Replace - with _"]
  R2["Relabel 2:<br/>Keep allowed metrics"]
  R3["Relabel 3:<br/>Drop transform metrics"]
  SM --> R1
  SM --> R2
  SM --> R3
  R2 --> Filter["Filtered metrics<br/>for monitoring"]
  R3 --> Filter
Loading

Grey Divider

File Changes

1. internal/metrics/service_monitor.go ✨ Enhancement +14/-9

Add metric filtering relabel configurations

• Removed comment explaining label replacement logic
• Reformatted first relabel config for consistency
• Added second relabel config with keep action to filter metrics by name
• Added third relabel config with drop action to exclude transform component metrics

internal/metrics/service_monitor.go


2. internal/metrics/service_monitor_test.go 🧪 Tests +14/-0

Add MetricRelabelConfigs verification tests

• Added test section verifying MetricRelabelConfigs structure
• Validates three relabel configs are present
• Verifies first config replaces hyphens with underscores
• Verifies second config keeps only allowed metrics
• Verifies third config drops transform component metrics

internal/metrics/service_monitor_test.go


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 5, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used

Grey Divider


Action required

1. LFME metrics filtered out 🐞 Bug ≡ Correctness
Description
newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including
LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus.
This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Code

internal/metrics/service_monitor.go[R40-44]

+				{
+					Action:       "keep",
+					SourceLabels: []monitoringv1.LabelName{"__name__"},
+					Regex:        "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files",
+				},
Evidence
The new metric relabel "keep" allowlist in newServiceMonitor() only allows vector_* and one
logcollector_* metric name, so LogFileMetricsExporter metrics (e.g., log_logged_bytes_total) will be
filtered out. The LogFileMetricsExporter controller reuses metrics.ReconcileServiceMonitor(), so it
will receive the same filtering; meanwhile, repository-shipped PrometheusRule recording and
dashboard queries depend on log_logged_bytes_total being present in Prometheus.

internal/metrics/service_monitor.go[33-50]
internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
config/prometheus/collector_alerts.yaml[97-107]
internal/metrics/dashboard/openshift-logging-dashboard.json[585-593]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.

## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.

## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]

### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
 - Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
 - Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@Clee2691 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 30 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3cc41c1c-1c00-4211-9ce1-58112f007950

📥 Commits

Reviewing files that changed from the base of the PR and between 99d21e3 and d94ab5f.

📒 Files selected for processing (16)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
  • test/e2e/collection/metrics/profile_test.go
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/operator/metrics/e2e_test.go
  • test/helpers/prometheus/prometheus.go

Walkthrough

Reconciler now creates two ServiceMonitors (full and minimal) per component, with configurable Prometheus relabel configs and a metrics-collection-profile label. New relabel builder produces precomputed config sets; ServiceMonitor construction/signature and tests updated to accept and validate metricRelabelConfigs, TLS ServerName, and profile labeling.

Changes

Metrics relabel & ServiceMonitor flow

Layer / File(s) Summary
Constants / Labels
internal/constants/..., internal/constants/labels.go
Introduce/relocate metrics-related constants and add LabelMetricsCollectionProfile constant.
Relabel Data & Builders
internal/metrics/relabel.go
Add buildRelabelConfigs with allowlist/drop helpers; export CollectorMinimalRelabelConfigs, LFMEMinimalRelabelConfigs, and FullRelabelConfigs.
Relabel Tests
internal/metrics/relabel_test.go
New tests validating ordering and contents of generated relabel configs across scenarios; asserts prebuilt sets' lengths/actions/regexes.
ServiceMonitor API
internal/metrics/service_monitor.go
newServiceMonitor and ReconcileServiceMonitor signatures updated to accept serviceName, metricRelabelConfigs, and profile; endpoint TLSConfig.ServerName set from serviceName; endpoint MetricRelabelConfigs set from input; profile label applied.
ServiceMonitor Tests
internal/metrics/service_monitor_test.go
Expanded tests for full/minimal profiles asserting exact MetricRelabelConfigs counts and specific Regex/TargetLabel/Action/SourceLabels ordering and TLS server name.
Controller Wiring
internal/controller/observability/collector.go, internal/metrics/logfilemetricexporter/metric_exporter.go
Replace single ServiceMonitor reconciliation with two reconciliations (full and minimal) per component, passing appropriate relabel config sets and profile labels; adjust names/logging.
Controller & Exporter Tests
internal/controller/observability/collector_test.go, internal/metrics/logfilemetricexporter/metric_exporter_test.go
Update tests to expect both full and minimal ServiceMonitors and validate endpoint presence and relabel config counts/actions.
E2E Prometheus Helpers
test/helpers/prometheus/prometheus.go
Add Prometheus/Thanos query helper: token/host discovery via oc, HTTPS query with bearer token, JSON parsing, and HasResults helper.
E2E Tests Updated/Added
test/e2e/...
Add e2e suite and tests asserting ServiceMonitor profile labels and that both allowlisted and non-allowlisted metrics are scraped; adapt utilities to use Prometheus helper and add Prometheus-based assertions in existing e2e tests.
Test Utils
test/e2e/flowcontrol/utils.go
Simplify/remove prior direct Prometheus HTTP logic; delegate queries to test/helpers/prometheus.

Sequence Diagram

sequenceDiagram
    participant Controller as Controller
    participant Relabel as Relabel Builder
    participant K8s as Kubernetes API
    participant PromOp as Prometheus Operator

    Controller->>Relabel: Request relabel configs (full/minimal)
    Relabel-->>Controller: Return relabel config sets
    Controller->>K8s: Create/Update ServiceMonitor (endpoints, TLS ServerName, metricRelabelConfigs, profile label)
    K8s->>PromOp: Prometheus Operator observes ServiceMonitor
    PromOp->>K8s: Begin scraping targets per ServiceMonitor spec
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 4

❌ Failed checks (4 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test quality issues: (1) profile_test.go violates single responsibility - tests two unrelated metrics in one test block; (2) prometheus.go hard Fail() breaks Eventually polling with transient errors Separate profile_test.go metrics test into individual tests. Fix prometheus.go error handling per review: replace Fail() with graceful logging and empty-map returns
Microshift Test Compatibility ⚠️ Warning New e2e tests reference openshift-monitoring namespace, prometheus-k8s, thanos-querier, and ServiceMonitor API unavailable on MicroShift. No skip mechanisms present. Add [Skipped:MicroShift] labels to tests in profile_test.go, e2e_test.go, and lfme_test.go; or guard with exutil.IsMicroShiftCluster() + Skip(). Update prometheus helper for graceful error handling.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning IPv6 issue in test/helpers/prometheus/prometheus.go line 49: fmt.Sprintf("https://%s/...", host) fails for IPv6 addresses. Also hard-fails on errors instead of graceful retry. Use net.JoinHostPort() for IPv6-safe URLs. Replace Fail() calls with graceful error logging and empty-map returns. Add HTTP timeout.
✅ Passed checks (8 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing Collector ServiceMonitor changes related to metrics collection profiles support.
Description check ✅ Passed The description includes mandatory sections (Description with context and rationale), reviewer assignment (/cc), approver assignment (/assign), and links including JIRA reference. It adequately explains the implementation of metrics collection profiles and lists allowed minimal metrics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 31 test declarations use static string literals with no dynamic content, pod name suffixes, timestamps, UUIDs, IPs, or random identifiers. Test names are descriptive and stable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed New e2e tests are SNO-compatible. DaemonSets work on SNO (1 node). Tests use standard OpenShift APIs. No multi-node assumptions detected.
Topology-Aware Scheduling Compatibility ✅ Passed Changes only modify ServiceMonitors and metrics configs. No pod affinity, topology spread constraints, nodeSelector, or replica adjustments based on topology introduced.
Ote Binary Stdout Contract ✅ Passed All Fail() calls are within test contexts (Eventually/It blocks). No module-level stdout writes. Suite setup uses standard Ginkgo patterns correctly.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread internal/metrics/service_monitor.go Outdated
Comment on lines +40 to +44
{
Action: "keep",
SourceLabels: []monitoringv1.LabelName{"__name__"},
Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|vector_buffer_discarded_events_total|vector_buffer_events|vector_buffer_sent_events_total|vector_component_discarded_events_total|vector_component_errors_total|vector_component_received_bytes_total|vector_component_received_event_bytes_total|vector_component_received_events_total|vector_component_sent_bytes_total|vector_events_in_total|vector_http_client_errors_total|vector_http_client_requests_sent_total|vector_http_client_responses_total|vector_open_files",
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Lfme metrics filtered out 🐞 Bug ≡ Correctness

newServiceMonitor() now applies a metric "keep" allowlist to all ServiceMonitors, including
LogFileMetricsExporter’s ServiceMonitor, which will drop log_logged_bytes_total from Prometheus.
This breaks the shipped recording rule and dashboard panels that query log_logged_bytes_total.
Agent Prompt
## Issue description
`metrics.newServiceMonitor()` now installs a metric relabel `keep` allowlist for **every** ServiceMonitor created via `metrics.ReconcileServiceMonitor()`. This unintentionally filters out LogFileMetricsExporter metrics (notably `log_logged_bytes_total`), breaking shipped recording rules and dashboard panels.

## Issue Context
- LogFileMetricsExporter reconciles its ServiceMonitor through the same `metrics.ReconcileServiceMonitor()` path used by the collector.
- The new keep-regex only allows a vector/logcollector metric subset, excluding `log_logged_bytes_total`.

## Fix Focus Areas
- internal/metrics/service_monitor.go[33-50]
- internal/metrics/logfilemetricexporter/metric_exporter.go[78-82]
- config/prometheus/collector_alerts.yaml[97-107]

### Expected fix direction
- Make ServiceMonitor relabeling **component-specific**:
  - Either split into `newCollectorServiceMonitor()` (with allowlist/drop rules) and a generic `newServiceMonitor()` (without restrictive keep), or
  - Add a parameter to `newServiceMonitor(..., metricRelabelConfigs []*monitoringv1.RelabelConfig)` and pass the collector-only configs from the collector reconciler.
- Ensure LFME-required metrics (at least `log_logged_bytes_total`) are not filtered out of Prometheus.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/metrics/service_monitor.go (1)

43-43: 💤 Low value

Consider extracting the keep-list regex to a named constant.

The allowlist is the core deliverable of this PR and a maintenance hotspot (new metrics are added, removed, or renamed). Inlining it as a 500+ character string makes diffs noisy and typos hard to spot.

♻️ Suggested refactor
+const collectorMetricsKeepRegex = "logcollector_component_event_unmatched_count|" +
+    "vector_buffer_byte_size|vector_buffer_discarded_events_total|" +
+    "vector_buffer_events|vector_buffer_sent_events_total|" +
+    "vector_component_discarded_events_total|vector_component_errors_total|" +
+    "vector_component_received_bytes_total|vector_component_received_event_bytes_total|" +
+    "vector_component_received_events_total|vector_component_sent_bytes_total|" +
+    "vector_events_in_total|vector_http_client_errors_total|" +
+    "vector_http_client_requests_sent_total|vector_http_client_responses_total|" +
+    "vector_open_files"

 // in newServiceMonitor:
-                Regex: "logcollector_component_event_unmatched_count|vector_buffer_byte_size|...",
+                Regex: collectorMetricsKeepRegex,

The constant can then be referenced in the test as well, removing duplication.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/service_monitor.go` at line 43, Extract the long inline
allowlist string used in the Regex field into a clearly named package-level
constant (e.g., allowedMetricsRegex or keepListRegex) in
internal/metrics/service_monitor.go, replace the inlined string in the struct
literal's Regex field with that constant, and update any tests to
import/reference the same constant instead of duplicating the string so the
allowlist is a single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/metrics/service_monitor_test.go`:
- Around line 83-84: The test currently omits asserting the allowlist regex so
changes silently pass; update the test in service_monitor_test.go to assert
relabelConfigs[1].Regex equals the expected allowlist regex (or, better, compare
it to the constant defined in service_monitor.go if you extract the regex
there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.

---

Nitpick comments:
In `@internal/metrics/service_monitor.go`:
- Line 43: Extract the long inline allowlist string used in the Regex field into
a clearly named package-level constant (e.g., allowedMetricsRegex or
keepListRegex) in internal/metrics/service_monitor.go, replace the inlined
string in the struct literal's Regex field with that constant, and update any
tests to import/reference the same constant instead of duplicating the string so
the allowlist is a single source of truth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0367dba-620a-4f1b-abb3-d1fbd24ba9de

📥 Commits

Reviewing files that changed from the base of the PR and between ff06c7a and 962597a.

📒 Files selected for processing (2)
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go

Comment on lines +83 to +84
Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The keep-list regex is unasserted — the PR's core change has no test validation.

relabelConfigs[1].Regex is never checked. A typo, accidental truncation, or future metric name change in the allowlist would silently pass the test suite.

✅ Proposed fix
 Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
 Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
+Expect(relabelConfigs[1].Regex).To(Equal(
+    "logcollector_component_event_unmatched_count|vector_buffer_byte_size|" +
+    "vector_buffer_discarded_events_total|vector_buffer_events|" +
+    "vector_buffer_sent_events_total|vector_component_discarded_events_total|" +
+    "vector_component_errors_total|vector_component_received_bytes_total|" +
+    "vector_component_received_event_bytes_total|vector_component_received_events_total|" +
+    "vector_component_sent_bytes_total|vector_events_in_total|" +
+    "vector_http_client_errors_total|vector_http_client_requests_sent_total|" +
+    "vector_http_client_responses_total|vector_open_files",
+))

(Even better: extract the regex to a constant in service_monitor.go and reference it here directly.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
Expect(string(relabelConfigs[1].Action)).To(Equal("keep"))
Expect(relabelConfigs[1].SourceLabels).To(Equal([]monitoringv1.LabelName{"__name__"}))
Expect(relabelConfigs[1].Regex).To(Equal(
"logcollector_component_event_unmatched_count|vector_buffer_byte_size|" +
"vector_buffer_discarded_events_total|vector_buffer_events|" +
"vector_buffer_sent_events_total|vector_component_discarded_events_total|" +
"vector_component_errors_total|vector_component_received_bytes_total|" +
"vector_component_received_event_bytes_total|vector_component_received_events_total|" +
"vector_component_sent_bytes_total|vector_events_in_total|" +
"vector_http_client_errors_total|vector_http_client_requests_sent_total|" +
"vector_http_client_responses_total|vector_open_files",
))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/service_monitor_test.go` around lines 83 - 84, The test
currently omits asserting the allowlist regex so changes silently pass; update
the test in service_monitor_test.go to assert relabelConfigs[1].Regex equals the
expected allowlist regex (or, better, compare it to the constant defined in
service_monitor.go if you extract the regex there), i.e., add an
Expect(relabelConfigs[1].Regex).To(Equal(<expectedRegexOrConstant>)) alongside
the existing Action and SourceLabels assertions.

@Clee2691
Copy link
Copy Markdown
Contributor Author

Clee2691 commented May 6, 2026

/retest

@jcantrill
Copy link
Copy Markdown
Contributor

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2026
Comment thread internal/metrics/service_monitor.go Outdated
Comment thread internal/metrics/service_monitor.go Outdated
@simonpasquier
Copy link
Copy Markdown

simonpasquier commented May 6, 2026

Have you considered implementing collection profiles?

The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:

  • full (default) => collects all metrics available
  • minimal => collects only metrics relevant for alerting rules, dashboards and telemetry
  • telemetry => collects only metrics required for telemetry

We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum.

cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!)

@Clee2691 Clee2691 changed the title LOG-9356: Implement Collector ServiceMontitor Changes LOG-9356: Implement Collector ServiceMonitor Changes May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/metrics/relabel.go (2)

13-17: 💤 Low value

excludeMetrics is semantically ambiguous — consider droppedMetrics.

The field name excludeMetrics can be read as "metrics excluded from the drop action" (i.e., kept), but the implementation puts them inside the drop rule regex, meaning they are the metrics that get dropped for the given label combination. droppedMetrics or metricsToDropForLabel would remove this ambiguity for future maintainers.

♻️ Proposed rename
 type metricDropConfig struct {
 	labelName      string
 	labelValue     string
-	excludeMetrics []string
+	droppedMetrics []string
 }
 var collectorMinimalDropConfigs = []metricDropConfig{
 	{
 		labelName:  "component_kind",
 		labelValue: "transform",
-		excludeMetrics: []string{
+		droppedMetrics: []string{

And in buildRelabelConfigs:

-		Regex: drop.labelValue + ";(" + strings.Join(drop.excludeMetrics, "|") + ")",
+		Regex: drop.labelValue + ";(" + strings.Join(drop.droppedMetrics, "|") + ")",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/relabel.go` around lines 13 - 17, Rename the ambiguous
struct field excludeMetrics in metricDropConfig to a clearer name like
droppedMetrics (or metricsToDropForLabel) and update all references accordingly;
specifically change the field declaration in metricDropConfig and adjust any
code that reads or writes metricDropConfig (e.g., the buildRelabelConfigs
function and any callers or JSON/YAML tags) so they use the new symbol name to
reflect that these are the metrics that will be dropped for the given label
combination.

89-95: 💤 Low value

Add a length guard for excludeMetrics to match the allowlist pattern.

The allowlist block guards len(allowlist.allowedMetrics) > 0 before appending a keep rule, but the drop loop has no equivalent guard. An empty excludeMetrics slice produces the regex labelValue;() with action: drop. While Prometheus's anchored full-string matching makes this effectively a no-op for real metrics (would only match a sample with an empty __name__), it emits a spurious relabel rule and is inconsistent with the allowlist guard style.

♻️ Proposed fix
 	for _, drop := range dropConfigs {
+		if len(drop.excludeMetrics) == 0 {
+			continue
+		}
 		configs = append(configs, &monitoringv1.RelabelConfig{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/relabel.go` around lines 89 - 95, The drop loop appends a
"drop" relabel rule even when drop.excludeMetrics is empty, producing a spurious
regex like "labelValue;()"; add the same length guard used for the allowlist:
check len(drop.excludeMetrics) > 0 before creating/appending the
monitoringv1.RelabelConfig in the loop (the loop iterating over dropConfigs that
currently builds SourceLabels/Regex), so only append the drop rule when
excludeMetrics is non-empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/metrics/relabel.go`:
- Around line 13-17: Rename the ambiguous struct field excludeMetrics in
metricDropConfig to a clearer name like droppedMetrics (or
metricsToDropForLabel) and update all references accordingly; specifically
change the field declaration in metricDropConfig and adjust any code that reads
or writes metricDropConfig (e.g., the buildRelabelConfigs function and any
callers or JSON/YAML tags) so they use the new symbol name to reflect that these
are the metrics that will be dropped for the given label combination.
- Around line 89-95: The drop loop appends a "drop" relabel rule even when
drop.excludeMetrics is empty, producing a spurious regex like "labelValue;()";
add the same length guard used for the allowlist: check len(drop.excludeMetrics)
> 0 before creating/appending the monitoringv1.RelabelConfig in the loop (the
loop iterating over dropConfigs that currently builds SourceLabels/Regex), so
only append the drop rule when excludeMetrics is non-empty.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da0fa976-0000-4188-84d0-d85035bde888

📥 Commits

Reviewing files that changed from the base of the PR and between 962597a and ef5f21d.

📒 Files selected for processing (10)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
✅ Files skipped from review due to trivial changes (2)
  • internal/constants/labels.go
  • internal/constants/constants.go

@Clee2691
Copy link
Copy Markdown
Contributor Author

Clee2691 commented May 7, 2026

Have you considered implementing collection profiles?

The general idea is that the admin defines the level of details for metrics at the cluster level. We have 3 profiles:

* full (default) => collects all metrics available

* minimal => collects only metrics relevant for alerting rules, dashboards and telemetry

* telemetry => collects only metrics required for telemetry

We can help with the implementation (in summary, you need to define 1 additional service monitor per profile). Happy to discuss directly on your forum.

cc @jan--f and @r2d2rnd (who informed us about the work in progress, thanks!)

@simonpasquier I've made changes to this to implement profiles. Can someone on the monitoring team also review?

Copy link
Copy Markdown
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Comment thread internal/metrics/relabel.go
Comment thread internal/metrics/relabel.go
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Clee2691, jcantrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@Clee2691: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/helpers/prometheus/prometheus.go`:
- Around line 42-74: In QueryPrometheus replace the no-timeout http.Client with
one that has a 30s Timeout, stop using Fail(...) on request/response errors and
instead log the error and return an empty map[string]interface{} (so callers can
retry), validate response.StatusCode (expect 200) before attempting to
read/unmarshal the body and handle non-200 by logging and returning empty map,
and ensure response.Body is always closed; update references in this function
(QueryPrometheus, the http.Client/Request handling, and the json.Unmarshal
logic) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9624b7e9-9df4-450c-918c-851df0a88fa1

📥 Commits

Reviewing files that changed from the base of the PR and between ef5f21d and 99d21e3.

📒 Files selected for processing (16)
  • internal/constants/constants.go
  • internal/constants/labels.go
  • internal/controller/observability/collector.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/metrics/relabel.go
  • internal/metrics/relabel_test.go
  • internal/metrics/service_monitor.go
  • internal/metrics/service_monitor_test.go
  • test/e2e/collection/metrics/profile_test.go
  • test/e2e/collection/metrics/suite_test.go
  • test/e2e/flowcontrol/utils.go
  • test/e2e/logfilesmetricexporter/lfme_test.go
  • test/e2e/operator/metrics/e2e_test.go
  • test/helpers/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (3)
  • internal/constants/labels.go
  • internal/metrics/logfilemetricexporter/metric_exporter_test.go
  • internal/constants/constants.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/controller/observability/collector.go
  • internal/metrics/logfilemetricexporter/metric_exporter.go
  • internal/controller/observability/collector_test.go
  • internal/metrics/service_monitor.go

Comment thread test/helpers/prometheus/prometheus.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. release/6.6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants