OCPBUGS-78775: Add namespace constraints to scheduler and controller-manager PromQL queries by Leo6Leo · Pull Request #16296 · openshift/console

Leo6Leo · 2026-04-16T20:03:19Z

Summary

Add namespace="openshift-kube-scheduler" constraint to the SCHEDULERS_UP PromQL query
Add namespace="openshift-kube-controller-manager" constraint to the CONTROLLER_MANAGERS_UP PromQL query
Prevents false positive control plane degradation when user workloads create Prometheus scrape targets with job="scheduler" or job="kube-controller-manager"

Root Cause

The SCHEDULERS_UP and CONTROLLER_MANAGERS_UP queries lacked namespace selectors, matching any Prometheus target with the corresponding job label across all namespaces. When a user creates a Service named scheduler with a ServiceMonitor, Prometheus assigns job="scheduler" to that scrape target. Since the user workload doesn't expose valid Prometheus metrics, the up metric returns 0, diluting the response rate calculation (e.g., 3/4 * 100 = 75%) and triggering a false "Degraded" status in Overview > Control Plane.

This fix aligns both queries with the existing API_SERVERS_UP query, which already constrains to namespace="openshift-kube-apiserver".

Reproduction Steps

Enable user workload monitoring
Create a namespace with a Service named scheduler and a ServiceMonitor targeting it
Observe Overview > Status > Control Plane shows "Degraded" / Schedulers at 75%

Verification

Tested on a live 4.18 cluster with the fake workload present:

Old query (no namespace filter): 75% — false degradation
Fixed query (with namespace filter): 100% — correct

Test plan

Deploy the fix to a 4.18 cluster with user workload monitoring enabled
Create a fake scheduler workload (Service + ServiceMonitor with job="scheduler") in a user namespace
Verify Overview > Control Plane > Schedulers remains at 100% and shows no degradation
Verify that with no fake workload, all control plane components still report correctly
Verify the Control Plane popup shows correct response rates for all 4 components

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved accuracy of control plane component uptime metrics by refining the scope of metric queries to specific namespaces for controller managers and schedulers. This ensures more precise monitoring and visibility of component health.

…omQL queries The SCHEDULERS_UP and CONTROLLER_MANAGERS_UP queries lacked namespace selectors, causing false positive control plane degradation when user workloads create Prometheus targets with job="scheduler" or job="kube-controller-manager". Scope both queries to their respective openshift-kube-* namespaces, consistent with the existing API_SERVERS_UP query which already constrains to namespace="openshift-kube-apiserver".

coderabbitai · 2026-04-16T20:03:28Z

Walkthrough

The Prometheus metric queries in the console app for monitoring control plane component health were updated to scope the up metric to specific namespaces, restricting measurement to openshift-kube-controller-manager for controller managers and openshift-kube-scheduler for schedulers.

Changes

Cohort / File(s)	Summary
Prometheus Metric Query Updates `frontend/packages/console-app/src/queries.ts`	Updated `CONTROLLER_MANAGERS_UP` and `SCHEDULERS_UP` constants to include namespace-scoped filtering in Prometheus queries for more precise control plane component health monitoring.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 10

✅ Passed checks (10 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	The custom check for Ginkgo test names is not applicable to this PR. The PR only modifies Prometheus query strings in TypeScript and contains no Ginkgo test code.
Test Structure And Quality	✅ Passed	Custom check for Ginkgo test code patterns is not applicable; PR modifies only TypeScript constants and enum definitions with no test code.
Microshift Test Compatibility	✅ Passed	This PR modifies a TypeScript configuration file with Prometheus query constants, not Ginkgo e2e tests, so MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests; it only modifies Prometheus query constants in frontend code, so the custom check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only PromQL query strings for monitoring in console UI; no scheduling constraints, deployment manifests, or topology assumptions are introduced.
Ote Binary Stdout Contract	✅ Passed	The OTE Binary Stdout Contract check is not applicable to this pull request. The PR modifies only TypeScript string constants in a frontend console application file containing PromQL query strings with no process-level code, logging, stdout/stderr operations, test suite setup, or binary communication logic.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This pull request does not add or modify any Ginkgo e2e tests. The custom check is specifically designed to assess IPv6 and disconnected network compatibility of new Ginkgo e2e tests. Since this PR only modifies a TypeScript file containing Prometheus query constants for the OpenShift console frontend application, the check is not applicable.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding namespace constraints to scheduler and controller-manager PromQL queries, matching the modifications made to SCHEDULERS_UP and CONTROLLER_MANAGERS_UP constants.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-16T20:03:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Leo6Leo
Once this PR has been reviewed and has the lgtm label, please assign spadgett for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

frontend/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-04-16T20:06:52Z

@Leo6Leo: This pull request references Jira Issue OCPBUGS-77017, which is invalid:

expected the bug to target the "4.18.z" version, but no target version was set
release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected Jira Issue OCPBUGS-77017 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Add namespace="openshift-kube-scheduler" constraint to the SCHEDULERS_UP PromQL query

Add namespace="openshift-kube-controller-manager" constraint to the CONTROLLER_MANAGERS_UP PromQL query

Prevents false positive control plane degradation when user workloads create Prometheus scrape targets with job="scheduler" or job="kube-controller-manager"

Root Cause

The SCHEDULERS_UP and CONTROLLER_MANAGERS_UP queries lacked namespace selectors, matching any Prometheus target with the corresponding job label across all namespaces. When a user creates a Service named scheduler with a ServiceMonitor, Prometheus assigns job="scheduler" to that scrape target. Since the user workload doesn't expose valid Prometheus metrics, the up metric returns 0, diluting the response rate calculation (e.g., 3/4 * 100 = 75%) and triggering a false "Degraded" status in Overview > Control Plane.

This fix aligns both queries with the existing API_SERVERS_UP query, which already constrains to namespace="openshift-kube-apiserver".

Reproduction Steps

Enable user workload monitoring

Create a namespace with a Service named scheduler and a ServiceMonitor targeting it

Observe Overview > Status > Control Plane shows "Degraded" / Schedulers at 75%

Verification

Tested on a live 4.18 cluster with the fake workload present:

Old query (no namespace filter): 75% — false degradation

Fixed query (with namespace filter): 100% — correct

Test plan

Deploy the fix to a 4.18 cluster with user workload monitoring enabled

Create a fake scheduler workload (Service + ServiceMonitor with job="scheduler") in a user namespace

Verify Overview > Control Plane > Schedulers remains at 100% and shows no degradation

Verify that with no fake workload, all control plane components still report correctly

Verify the Control Plane popup shows correct response rates for all 4 components

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Leo6Leo · 2026-04-16T20:07:08Z

@coderabbitai review

coderabbitai · 2026-04-16T20:07:15Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

openshift-ci-robot · 2026-04-16T20:10:50Z

@Leo6Leo: This pull request references Jira Issue OCPBUGS-78775, which is invalid:

expected the bug to target the "4.18.z" version, but no target version was set
release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected Jira Issue OCPBUGS-78775 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Add namespace="openshift-kube-scheduler" constraint to the SCHEDULERS_UP PromQL query

Add namespace="openshift-kube-controller-manager" constraint to the CONTROLLER_MANAGERS_UP PromQL query

Prevents false positive control plane degradation when user workloads create Prometheus scrape targets with job="scheduler" or job="kube-controller-manager"

Root Cause

The SCHEDULERS_UP and CONTROLLER_MANAGERS_UP queries lacked namespace selectors, matching any Prometheus target with the corresponding job label across all namespaces. When a user creates a Service named scheduler with a ServiceMonitor, Prometheus assigns job="scheduler" to that scrape target. Since the user workload doesn't expose valid Prometheus metrics, the up metric returns 0, diluting the response rate calculation (e.g., 3/4 * 100 = 75%) and triggering a false "Degraded" status in Overview > Control Plane.

This fix aligns both queries with the existing API_SERVERS_UP query, which already constrains to namespace="openshift-kube-apiserver".

Reproduction Steps

Enable user workload monitoring

Create a namespace with a Service named scheduler and a ServiceMonitor targeting it

Observe Overview > Status > Control Plane shows "Degraded" / Schedulers at 75%

Verification

Tested on a live 4.18 cluster with the fake workload present:

Old query (no namespace filter): 75% — false degradation

Fixed query (with namespace filter): 100% — correct

Test plan

Deploy the fix to a 4.18 cluster with user workload monitoring enabled

Create a fake scheduler workload (Service + ServiceMonitor with job="scheduler") in a user namespace

Verify Overview > Control Plane > Schedulers remains at 100% and shows no degradation

Verify that with no fake workload, all control plane components still report correctly

Verify the Control Plane popup shows correct response rates for all 4 components

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

Improved accuracy of control plane component uptime metrics by refining the scope of metric queries to specific namespaces for controller managers and schedulers. This ensures more precise monitoring and visibility of component health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Leo6Leo · 2026-04-17T13:09:14Z

/retest-required

Leo6Leo · 2026-04-20T13:42:29Z

/retest-required

Leo6Leo · 2026-04-20T18:35:28Z

/retest-required

openshift-ci · 2026-04-20T21:41:07Z

@Leo6Leo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-console	`f85e68f`	link	true	`/test e2e-gcp-console`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from jhadvig and rhamilto April 16, 2026 20:03

openshift-ci Bot added the component/core Related to console core functionality label Apr 16, 2026

Leo6Leo changed the title ~~Bug: Add namespace constraints to scheduler and controller-manager PromQL queries~~ OCPBUGS-77017: Add namespace constraints to scheduler and controller-manager PromQL queries Apr 16, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 16, 2026

Leo6Leo changed the title ~~OCPBUGS-77017: Add namespace constraints to scheduler and controller-manager PromQL queries~~ OCPBUGS-78775: Add namespace constraints to scheduler and controller-manager PromQL queries Apr 16, 2026

Conversation

Leo6Leo commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Reproduction Steps

Verification

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026

Summary

Root Cause

Reproduction Steps

Verification

Test plan

Uh oh!

Leo6Leo commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 16, 2026

Summary

Root Cause

Reproduction Steps

Verification

Test plan

Summary by CodeRabbit

Uh oh!

Leo6Leo commented Apr 17, 2026

Uh oh!

Leo6Leo commented Apr 20, 2026

Uh oh!

Leo6Leo commented Apr 20, 2026

Uh oh!

openshift-ci Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Leo6Leo commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading