Skip to content

OBSINTA-1219: add e2e tests for UIPlugin incident detection#1038

Merged
openshift-merge-bot[bot] merged 1 commit into
rhobs:mainfrom
DavidRajnoha:test/e2e-uiplugin-cluster-health-analyzer
Apr 24, 2026
Merged

OBSINTA-1219: add e2e tests for UIPlugin incident detection#1038
openshift-merge-bot[bot] merged 1 commit into
rhobs:mainfrom
DavidRajnoha:test/e2e-uiplugin-cluster-health-analyzer

Conversation

@DavidRajnoha
Copy link
Copy Markdown
Collaborator

Add end-to-end tests that validate the monitoring UIPlugin with cluster-health-analyzer: a deployment readiness check and a functional test that triggers a CrashLoopBackOff alert and verifies the cluster_health_components_map incident metric is produced.

Also introduce AssertPromQLResultWithOptions to allow callers to override the default poll interval and timeout, and generalize waitForDBUIPluginDeletion to waitForUIPluginDeletion.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 16, 2026

Hi @DavidRajnoha. Thanks for your PR.

I'm waiting for a rhobs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch 2 times, most recently from 89ac678 to 0768ae2 Compare March 16, 2026 14:36
@tremes
Copy link
Copy Markdown
Contributor

tremes commented Mar 17, 2026

/ok-to-test

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/retest

@DavidRajnoha DavidRajnoha changed the title test: add e2e tests for UIPlugin incident detection OBSINTA-1219: add e2e tests for UIPlugin incident detection Mar 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

@DavidRajnoha: This pull request references OBSINTA-1219 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add end-to-end tests that validate the monitoring UIPlugin with cluster-health-analyzer: a deployment readiness check and a functional test that triggers a CrashLoopBackOff alert and verifies the cluster_health_components_map incident metric is produced.

Also introduce AssertPromQLResultWithOptions to allow callers to override the default poll interval and timeout, and generalize waitForDBUIPluginDeletion to waitForUIPluginDeletion.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch 3 times, most recently from 970ac8c to 90b891c Compare March 18, 2026 11:25
Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go Outdated
Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go Outdated
@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/hold

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/test observability-operator-e2e

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/retest

@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from 90b891c to 23c1e1d Compare March 25, 2026 09:58
@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/unhold

@tremes
Copy link
Copy Markdown
Contributor

tremes commented Mar 25, 2026

Thank you!
/approve
/lgtm

Copy link
Copy Markdown
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the test output and it adds 3m30s to the total duration which is not negligible. Do we really need to deploy a crashlooping pod? Can't we write a "dummy" alerting rule with a static expression (vector(1)) and hardcoded labels?

Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go
Expr: intstr.FromString(fmt.Sprintf(
`max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="%s", pod=~"%s.*", job="kube-state-metrics"}[5m]) >= 1`,
e2eTestNamespace, podPrefix)),
For: ptr.To(monv1.Duration("1m")),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the for clause would speed up the test.

return *p
}

func skipIfClusterVersionBelow(t *testing.T, minVersion string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) could it be a framework function?

t.Log("=== END DEBUG DUMP ===")
}

func ptrInt32(p *int32) int32 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) ptr.To() already exists

Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go
assert.NilError(t, err, "alert %s never fired", alertName)

t.Log("Waiting for cluster-health-analyzer to expose incident metric...")
incidentQuery := fmt.Sprintf(`cluster_health_components_map{src_alertname="%s"}`, alertName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from checking the metric generated by the cluster health analyzer, is there any other outcome that we should verify?

return fmt.Errorf("expected incident metric, got: %v", v)
}
for _, sample := range vec {
if string(sample.Metric["src_alertname"]) != alertName {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already checked by the PromQL expression

return fmt.Errorf("expected src_alertname=%s, got %s", alertName, sample.Metric["src_alertname"])
}
if string(sample.Metric["src_severity"]) != "warning" {
return fmt.Errorf("expected src_severity=warning, got %s", sample.Metric["src_severity"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be checked in the promql expression

@openshift-ci openshift-ci Bot removed the lgtm label Mar 31, 2026
@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/hold

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

@simonpasquier I updated the test to use a static rule. For reference, here is the current timing breakdown of the test:

Step Duration Elapsed
UIPlugin creation 824ms 0.8s
Deployment ready 5.4s 6.2s
PrometheusRule creation 239ms 6.4s
Alert firing (PromQL poll) 32.2s 38.6s
Incident metric (PromQL poll) 32.0s 1m 10.6s

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 297-302: The DumpNamespaceDebug method uses t.Context() which is
cancelled before t.Cleanup runs; replace the cancelled context with a
non-cancelled one (e.g. use context.Background() or a background context with a
timeout) so the List calls inside DumpNamespaceDebug succeed during cleanup.
Update the ctx variable in Framework.DumpNamespaceDebug (replace ctx :=
t.Context()) to use a persistent background context and ensure the context
package is imported; keep all List calls (deployments, pods, events) using this
new ctx.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 42-46: The cleanup callbacks and the debug helper are using
t.Context() which is already canceled when t.Cleanup runs; replace usages of
t.Context() in the t.Cleanup closures and inside dumpClusterHealthAnalyzerDebug
by creating an uncancelable context first (ctx :=
context.WithoutCancel(t.Context())) and pass that ctx into any API calls
(delete/get/list) and into dumpClusterHealthAnalyzerDebug instead of
t.Context(); update the anonymous t.Cleanup funcs and the
dumpClusterHealthAnalyzerDebug function to accept/use the new ctx so
cleanup-time API calls are not given a canceled context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99ba088d-c938-4352-9e06-056a58ac5d82

📥 Commits

Reviewing files that changed from the base of the PR and between 7dc043f and 25f4b52.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e/uiplugin_test.go
  • test/e2e/framework/assertions.go

Comment thread test/e2e/framework/framework.go Outdated
Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/framework/framework.go (1)

300-303: ⚠️ Potential issue | 🟠 Major

Use a non-cancelled context in cleanup diagnostics (already reported).

Line 302 uses t.Context(). When this helper is invoked from t.Cleanup, list calls can fail with context canceled, which defeats failure-time diagnostics.

Suggested fix
 func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) {
 	t.Helper()
-	ctx := t.Context()
+	ctx := context.WithoutCancel(t.Context())
In Go 1.24+ testing package behavior: is testing.T.Context canceled before T.Cleanup callbacks run?

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/framework/framework.go` around lines 300 - 303, The helper
DumpNamespaceDebug currently grabs ctx from t.Context() which can be canceled
when run inside t.Cleanup; change it to use a non-cancelled context (e.g., ctx
:= context.Background() or context.TODO()) so diagnostic list calls in
DumpNamespaceDebug won’t fail with “context canceled”; update the function
(DumpNamespaceDebug) to import context if needed and replace the t.Context()
usage with the background context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/framework/framework.go`:
- Around line 300-303: The helper DumpNamespaceDebug currently grabs ctx from
t.Context() which can be canceled when run inside t.Cleanup; change it to use a
non-cancelled context (e.g., ctx := context.Background() or context.TODO()) so
diagnostic list calls in DumpNamespaceDebug won’t fail with “context canceled”;
update the function (DumpNamespaceDebug) to import context if needed and replace
the t.Context() usage with the background context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 566b5429-753e-4378-bcfa-22e74ddb0c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 25f4b52 and a0245db.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/e2e/uiplugin_test.go
  • test/e2e/framework/assertions.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go

@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from a0245db to f5b49a6 Compare April 17, 2026 07:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/e2e/framework/framework.go (1)

262-295: Minor: -0 suffix on canonical versions may skew comparisons for pre-releases.

semver.Canonical("v4.19") returns "v4.19.0", so appending -0 produces "v4.19.0-0" (a pre-release lower than v4.19.0). For plain release versions on both sides this is harmless since the suffix cancels out, but for a cluster on something like v4.19.0-rc.1 the result becomes "v4.19.0-rc.1-0", which compares below "v4.19.0-0" and would cause release candidates of the minimum version to be treated as below minimum. If that's intentional (require GA), a comment explaining the trick would help; otherwise drop the -0 suffix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/framework/framework.go` around lines 262 - 295, The code in
SkipIfClusterVersionBelow builds canonicalActual and canonicalMin by calling
semver.Canonical(...) then appending "-0", which incorrectly downgrades
pre-release versions (e.g., v4.19.0-rc.1 becomes v4.19.0-rc.1-0) and can cause
RCs to be treated as older than the minimum; remove the "-0" suffix so
canonicalActual := semver.Canonical(actual) and canonicalMin :=
semver.Canonical(minVersion) (or, if the intent is to require GA only, instead
add a clear comment near the semver.Canonical calls explaining the deliberate
"-0" trick).
test/e2e/uiplugin_cluster_health_analyzer_test.go (1)

49-51: Low-entropy suffix can collide across reruns.

time.Now().UnixNano()%100000 gives ~100ms wrap-around and can collide with a leftover PrometheusRule/alert from a recent rerun if cleanup failed. Consider a stronger unique suffix (full nanos, or rand.Int63()) to avoid "AlreadyExists" flakes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 49 - 51, The
current low-entropy suffix generation (suffix :=
strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it
with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update
the variables that use it (suffix, ruleName, alertName). Specifically, change
the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or
strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and
seeded appropriately in the test setup, and leave ruleName :=
"e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix
unchanged so they get the stronger unique suffix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/framework/framework.go`:
- Around line 262-295: The code in SkipIfClusterVersionBelow builds
canonicalActual and canonicalMin by calling semver.Canonical(...) then appending
"-0", which incorrectly downgrades pre-release versions (e.g., v4.19.0-rc.1
becomes v4.19.0-rc.1-0) and can cause RCs to be treated as older than the
minimum; remove the "-0" suffix so canonicalActual := semver.Canonical(actual)
and canonicalMin := semver.Canonical(minVersion) (or, if the intent is to
require GA only, instead add a clear comment near the semver.Canonical calls
explaining the deliberate "-0" trick).

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The current low-entropy suffix generation (suffix :=
strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it
with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update
the variables that use it (suffix, ruleName, alertName). Specifically, change
the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or
strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and
seeded appropriately in the test setup, and leave ruleName :=
"e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix
unchanged so they get the stronger unique suffix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ffbb481e-3ce6-4f90-9884-e0f3363dc6c8

📥 Commits

Reviewing files that changed from the base of the PR and between a0245db and f5b49a6.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/uiplugin_test.go

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/unhold

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/retest

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
test/e2e/uiplugin_cluster_health_analyzer_test.go (2)

38-42: Cleanup ordering note.

f.CleanUp calls inside newAlwaysFiringRule (registered after this block) will run before this dump, so the rule is gone by the time diagnostics are captured. If the rule state matters for debugging, register the dump cleanup after rule creation, or capture rule status inside the dump helper before resource deletion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 38 - 42, The
current t.Cleanup that calls dumpClusterHealthAnalyzerDebug runs before cleanup
registered inside newAlwaysFiringRule, so the rule may be deleted before
diagnostics are captured; to fix, either move this t.Cleanup registration to
after newAlwaysFiringRule is called (so the dump runs last) or modify
dumpClusterHealthAnalyzerDebug to first capture and store the rule's
status/state (e.g., query the rule created by newAlwaysFiringRule using
plugin.Name) before other cleanups run; update references in the test to ensure
newAlwaysFiringRule, dumpClusterHealthAnalyzerDebug, and the t.Cleanup call
order reflect this change.

59-85: Long timeouts (10m + 15m) dominate test duration.

PR description reports alert firing in ~32s and incident metric in ~32s. The 10/15 min timeouts are ~20x real latency and will slow failure feedback substantially. Consider tightening to something like 3–5 minutes to fail fast while keeping headroom.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 59 - 85, The
test uses f.AssertPromQLResultWithOptions for alertQuery and incidentQuery with
overly long timeouts (framework.WithTimeout(10*time.Minute) and
framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to
fail faster while allowing headroom—update the two calls to use shorter timeouts
(e.g., framework.WithTimeout(5*time.Minute) for both) and reduce
framework.WithPollInterval from 30*time.Second to something like 10*time.Second
so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and
alertName) returns quicker on failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 282-294: Check the outputs of semver.Canonical for both actual and
minVersion and if either returns an empty string treat the version as non-semver
instead of proceeding with the bogus comparison: after computing canonicalActual
and canonicalMin, if canonicalActual == "" or canonicalMin == "" call t.Skipf
with a clear message (e.g., include cv.Status.Desired.Version and minVersion)
indicating the cluster version could not be parsed, so the test explicitly
skips/logs non-semver versions rather than silently making a meaningless
comparison.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The suffix generation using time.Now().UnixNano()%100000 is
too small and risks collisions; update the code that sets suffix (used to build
ruleName and alertName) to produce a larger-entropy identifier — e.g., use
rand.Int63() (seeded once with time.Now().UnixNano()) or concatenate
time.Now().UnixNano() with t.Name() — so the generated suffix is effectively
unique across parallel/retried runs and prevents PrometheusRule/alert name
collisions.

---

Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 38-42: The current t.Cleanup that calls
dumpClusterHealthAnalyzerDebug runs before cleanup registered inside
newAlwaysFiringRule, so the rule may be deleted before diagnostics are captured;
to fix, either move this t.Cleanup registration to after newAlwaysFiringRule is
called (so the dump runs last) or modify dumpClusterHealthAnalyzerDebug to first
capture and store the rule's status/state (e.g., query the rule created by
newAlwaysFiringRule using plugin.Name) before other cleanups run; update
references in the test to ensure newAlwaysFiringRule,
dumpClusterHealthAnalyzerDebug, and the t.Cleanup call order reflect this
change.
- Around line 59-85: The test uses f.AssertPromQLResultWithOptions for
alertQuery and incidentQuery with overly long timeouts
(framework.WithTimeout(10*time.Minute) and
framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to
fail faster while allowing headroom—update the two calls to use shorter timeouts
(e.g., framework.WithTimeout(5*time.Minute) for both) and reduce
framework.WithPollInterval from 30*time.Second to something like 10*time.Second
so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and
alertName) returns quicker on failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: bbf117ff-7493-4ae1-8ddb-7df0eb5b0b18

📥 Commits

Reviewing files that changed from the base of the PR and between a0245db and f5b49a6.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/uiplugin_test.go

Comment thread test/e2e/framework/framework.go
Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go
@simonpasquier
Copy link
Copy Markdown
Contributor

/ok-to-test

Copy link
Copy Markdown
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are non-blocking and can be addressed in a follow-up (my main concern is the failure mode if/when the version can't be determined).

Comment thread test/e2e/framework/assertions.go
Comment thread test/e2e/framework/framework.go
f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment)
f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t)

suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to ensure that we can run the test multiple times locally and not hit conflicts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to ensure that potential leftovers from prior tests would not cause any potential conflicts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about this?

assert.NilError(t, err, "incident metric for %s never appeared", alertName)
}

func newMonitoringUIPlugin(t *testing.T) *uiv1.UIPlugin {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit misleading since it will delete the plugin if one already exists. IIUC and given that UI plugin resources are singletons, we can't test a given plugin in parallel. It might be better to enforce this invariant in the structure of TestUIPlugin but it can be a follow-up exercise.

Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go Outdated
@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from 86cc8b0 to 9146e40 Compare April 21, 2026 09:14
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/uiplugin_cluster_health_analyzer_test.go (1)

31-32: Register scheme once, not per test invocation.

monv1.AddToScheme mutates the shared f.K8sClient.Scheme(). Calling it inside the test body works but repeats on every run and is racy if the test is ever parallelized with another that uses the same types. Prefer registering in TestMain (or a package-level init) alongside other scheme registrations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 31 - 32, Move
the call to monv1.AddToScheme away from the per-test body and register it once
during package initialization; remove monv1.AddToScheme(f.K8sClient.Scheme())
from the test and instead invoke monv1.AddToScheme(...) from a package-level
init function or TestMain where other scheme registrations occur (ensuring you
call it with the shared scheme used by tests), so monv1.AddToScheme is applied
only once and not during each test execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 107-113: The delete error handling for
f.K8sClient.Delete(existing) currently calls t.Fatalf on any error and doesn't
tolerate races; update the block around the Delete call (the use of
f.K8sClient.Delete, existing, and subsequent waitForUIPluginDeletion) to treat
errors.IsNotFound(err) as non-fatal — only call t.Fatalf for errors that are not
IsNotFound — so that a concurrent removal does not fail the test.

---

Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 31-32: Move the call to monv1.AddToScheme away from the per-test
body and register it once during package initialization; remove
monv1.AddToScheme(f.K8sClient.Scheme()) from the test and instead invoke
monv1.AddToScheme(...) from a package-level init function or TestMain where
other scheme registrations occur (ensuring you call it with the shared scheme
used by tests), so monv1.AddToScheme is applied only once and not during each
test execution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 533751f8-3833-4a44-a1ef-676c42f8ff1c

📥 Commits

Reviewing files that changed from the base of the PR and between 86cc8b0 and 9146e40.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/framework/assertions.go

Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go Outdated
@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from 1942245 to 0522dc4 Compare April 21, 2026 12:53
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/e2e/framework/framework.go (1)

289-293: ⚠️ Potential issue | 🟡 Minor

Fail on unparseable cluster versions before comparing.

semver.Canonical(...) can return ""; formatting that as "-0" makes the comparison depend on invalid semver operands instead of the real OpenShift version. Treat an unparseable actual/minimum version as a hard test setup failure, not a skip.

Suggested fix
-	canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual))
-	canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion))
+	canonicalActual := semver.Canonical(actual)
+	canonicalMin := semver.Canonical(minVersion)
+	if canonicalActual == "" || canonicalMin == "" {
+		t.Fatalf("failed to parse cluster version for comparison: actual=%q minimum=%q", cv.Status.Desired.Version, minVersion)
+		return
+	}
+
+	canonicalActual = fmt.Sprintf("%s-0", canonicalActual)
+	canonicalMin = fmt.Sprintf("%s-0", canonicalMin)

Verification command:

#!/bin/bash
set -euo pipefail

# Verify the local comparison code.
rg -n -C3 'semver\.Canonical|canonicalActual|canonicalMin|semver\.Compare' --type go

# Verify semver.Canonical/Compare behavior for golang.org/x/mod v0.34.0.
curl -fsSL https://raw.githubusercontent.com/golang/mod/v0.34.0/semver/semver.go \
  | sed -n '/func Canonical/,/func Major/p;/func Compare/,/func Max/p'

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/framework/framework.go` around lines 289 - 293, The current version
comparison uses semver.Canonical on actual and minVersion then appends "-0"
which can hide parse errors; update the logic around semver.Canonical(actual)
and semver.Canonical(minVersion) (the canonicalActual/canonicalMin variables
used with semver.Compare) to detect when Canonical returns an empty string and
treat that as a hard test setup failure: if either canonicalActual=="" or
canonicalMin=="" call t.Fatalf with a clear message including the original
actual/minVersion values so the test fails fast on unparseable versions rather
than silently skipping or comparing invalid semver operands.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 24-27: Replace the hard-coded prometheusRuleNamespace
("e2e-health-analyzer-rules") with a generated, unique namespace for the test
run and ensure cleanup only deletes that generated namespace; update uses of
prometheusRuleNamespace in the test (including where it's accepted if
already-existing and where it's deleted) to instead use the generated value
(create it at test setup with a helper like newTestNamespace or a UUID-suffixed
name), pass that generated namespace into functions interacting with Prometheus
rules and the healthAnalyzerDeploymentName, and modify teardown to only remove
the namespace the test created (or skip deletion if it pre-existed) so you don't
delete shared static namespaces or cause flakiness.

---

Duplicate comments:
In `@test/e2e/framework/framework.go`:
- Around line 289-293: The current version comparison uses semver.Canonical on
actual and minVersion then appends "-0" which can hide parse errors; update the
logic around semver.Canonical(actual) and semver.Canonical(minVersion) (the
canonicalActual/canonicalMin variables used with semver.Compare) to detect when
Canonical returns an empty string and treat that as a hard test setup failure:
if either canonicalActual=="" or canonicalMin=="" call t.Fatalf with a clear
message including the original actual/minVersion values so the test fails fast
on unparseable versions rather than silently skipping or comparing invalid
semver operands.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 488e88a5-2578-4d8d-8042-94f7f358f894

📥 Commits

Reviewing files that changed from the base of the PR and between 9146e40 and 0522dc4.

📒 Files selected for processing (4)
  • test/e2e/framework/assertions.go
  • test/e2e/framework/framework.go
  • test/e2e/uiplugin_cluster_health_analyzer_test.go
  • test/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/framework/assertions.go

Comment thread test/e2e/uiplugin_cluster_health_analyzer_test.go
@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from 0522dc4 to 2595306 Compare April 24, 2026 08:41
Add an end-to-end test that deploys a Monitoring UIPlugin with the
cluster-health-analyzer enabled, fires a synthetic PrometheusRule alert,
and verifies the corresponding incident metric appears.

Framework additions: SkipIfClusterVersionBelow (semver gate that fails
on unknown version instead of silently skipping), DumpNamespaceDebug
(on-failure diagnostic dump of deployments, pods and events), and
AssertPromQLResultWithOptions (configurable poll interval/timeout).

The UIPlugin setup is race-tolerant: deleteUIPluginIfExists issues a
direct Delete and treats IsNotFound as success, avoiding a TOCTOU window
between Get and Delete that could cause spurious failures.

Made-with: Cursor
@DavidRajnoha DavidRajnoha force-pushed the test/e2e-uiplugin-cluster-health-analyzer branch from 2595306 to 4fbe7ef Compare April 24, 2026 09:16
Copy link
Copy Markdown
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

Feel free to unblock and address my non-blocking remarks in a follow-up PR.

// DumpNamespaceDebug logs deployments (with conditions), pods (with container
// statuses), and events for the given namespace. Useful as a t.Cleanup or
// on-failure diagnostic helper.
func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like if we have a follow-up Jira ticket or PR to generalize the gathering of post-test data to help with troubleshooting.

)

func clusterHealthAnalyzer(t *testing.T) {
f.SkipIfClusterVersionBelow(t, "4.19")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) a follow-up change could collect the cluster version before in the setup phase so that we don't have to query ClusterVersion multiple times (and avoid failures if the API is flaky).

f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment)
f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t)

suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about this?

@simonpasquier
Copy link
Copy Markdown
Contributor

/approve

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DavidRajnoha, simonpasquier, tremes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

Followup work tracked in #1071 and #1072

@DavidRajnoha
Copy link
Copy Markdown
Collaborator Author

/unhold

@openshift-merge-bot openshift-merge-bot Bot merged commit 942f362 into rhobs:main Apr 24, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants