OBSINTA-1219: add e2e tests for UIPlugin incident detection#1038
Conversation
|
Hi @DavidRajnoha. Thanks for your PR. I'm waiting for a rhobs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
89ac678 to
0768ae2
Compare
|
/ok-to-test |
|
/retest |
|
@DavidRajnoha: This pull request references OBSINTA-1219 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
970ac8c to
90b891c
Compare
|
/hold |
|
/test observability-operator-e2e |
|
/retest |
90b891c to
23c1e1d
Compare
|
/unhold |
|
Thank you! |
simonpasquier
left a comment
There was a problem hiding this comment.
I've checked the test output and it adds 3m30s to the total duration which is not negligible. Do we really need to deploy a crashlooping pod? Can't we write a "dummy" alerting rule with a static expression (vector(1)) and hardcoded labels?
| Expr: intstr.FromString(fmt.Sprintf( | ||
| `max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="%s", pod=~"%s.*", job="kube-state-metrics"}[5m]) >= 1`, | ||
| e2eTestNamespace, podPrefix)), | ||
| For: ptr.To(monv1.Duration("1m")), |
There was a problem hiding this comment.
removing the for clause would speed up the test.
| return *p | ||
| } | ||
|
|
||
| func skipIfClusterVersionBelow(t *testing.T, minVersion string) { |
There was a problem hiding this comment.
(nit) could it be a framework function?
| t.Log("=== END DEBUG DUMP ===") | ||
| } | ||
|
|
||
| func ptrInt32(p *int32) int32 { |
There was a problem hiding this comment.
(nit) ptr.To() already exists
| assert.NilError(t, err, "alert %s never fired", alertName) | ||
|
|
||
| t.Log("Waiting for cluster-health-analyzer to expose incident metric...") | ||
| incidentQuery := fmt.Sprintf(`cluster_health_components_map{src_alertname="%s"}`, alertName) |
There was a problem hiding this comment.
apart from checking the metric generated by the cluster health analyzer, is there any other outcome that we should verify?
| return fmt.Errorf("expected incident metric, got: %v", v) | ||
| } | ||
| for _, sample := range vec { | ||
| if string(sample.Metric["src_alertname"]) != alertName { |
There was a problem hiding this comment.
already checked by the PromQL expression
| return fmt.Errorf("expected src_alertname=%s, got %s", alertName, sample.Metric["src_alertname"]) | ||
| } | ||
| if string(sample.Metric["src_severity"]) != "warning" { | ||
| return fmt.Errorf("expected src_severity=warning, got %s", sample.Metric["src_severity"]) |
There was a problem hiding this comment.
could also be checked in the promql expression
|
/hold |
|
@simonpasquier I updated the test to use a static rule. For reference, here is the current timing breakdown of the test:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 297-302: The DumpNamespaceDebug method uses t.Context() which is
cancelled before t.Cleanup runs; replace the cancelled context with a
non-cancelled one (e.g. use context.Background() or a background context with a
timeout) so the List calls inside DumpNamespaceDebug succeed during cleanup.
Update the ctx variable in Framework.DumpNamespaceDebug (replace ctx :=
t.Context()) to use a persistent background context and ensure the context
package is imported; keep all List calls (deployments, pods, events) using this
new ctx.
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 42-46: The cleanup callbacks and the debug helper are using
t.Context() which is already canceled when t.Cleanup runs; replace usages of
t.Context() in the t.Cleanup closures and inside dumpClusterHealthAnalyzerDebug
by creating an uncancelable context first (ctx :=
context.WithoutCancel(t.Context())) and pass that ctx into any API calls
(delete/get/list) and into dumpClusterHealthAnalyzerDebug instead of
t.Context(); update the anonymous t.Cleanup funcs and the
dumpClusterHealthAnalyzerDebug function to accept/use the new ctx so
cleanup-time API calls are not given a canceled context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 99ba088d-c938-4352-9e06-056a58ac5d82
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e/uiplugin_test.go
- test/e2e/framework/assertions.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/framework/framework.go (1)
300-303:⚠️ Potential issue | 🟠 MajorUse a non-cancelled context in cleanup diagnostics (already reported).
Line 302 uses
t.Context(). When this helper is invoked fromt.Cleanup, list calls can fail withcontext canceled, which defeats failure-time diagnostics.Suggested fix
func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { t.Helper() - ctx := t.Context() + ctx := context.WithoutCancel(t.Context())In Go 1.24+ testing package behavior: is testing.T.Context canceled before T.Cleanup callbacks run?As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/framework/framework.go` around lines 300 - 303, The helper DumpNamespaceDebug currently grabs ctx from t.Context() which can be canceled when run inside t.Cleanup; change it to use a non-cancelled context (e.g., ctx := context.Background() or context.TODO()) so diagnostic list calls in DumpNamespaceDebug won’t fail with “context canceled”; update the function (DumpNamespaceDebug) to import context if needed and replace the t.Context() usage with the background context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/framework/framework.go`:
- Around line 300-303: The helper DumpNamespaceDebug currently grabs ctx from
t.Context() which can be canceled when run inside t.Cleanup; change it to use a
non-cancelled context (e.g., ctx := context.Background() or context.TODO()) so
diagnostic list calls in DumpNamespaceDebug won’t fail with “context canceled”;
update the function (DumpNamespaceDebug) to import context if needed and replace
the t.Context() usage with the background context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 566b5429-753e-4378-bcfa-22e74ddb0c0d
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- test/e2e/uiplugin_test.go
- test/e2e/framework/assertions.go
- test/e2e/uiplugin_cluster_health_analyzer_test.go
a0245db to
f5b49a6
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/e2e/framework/framework.go (1)
262-295: Minor:-0suffix on canonical versions may skew comparisons for pre-releases.
semver.Canonical("v4.19")returns"v4.19.0", so appending-0produces"v4.19.0-0"(a pre-release lower thanv4.19.0). For plain release versions on both sides this is harmless since the suffix cancels out, but for a cluster on something likev4.19.0-rc.1the result becomes"v4.19.0-rc.1-0", which compares below"v4.19.0-0"and would cause release candidates of the minimum version to be treated as below minimum. If that's intentional (require GA), a comment explaining the trick would help; otherwise drop the-0suffix.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/framework/framework.go` around lines 262 - 295, The code in SkipIfClusterVersionBelow builds canonicalActual and canonicalMin by calling semver.Canonical(...) then appending "-0", which incorrectly downgrades pre-release versions (e.g., v4.19.0-rc.1 becomes v4.19.0-rc.1-0) and can cause RCs to be treated as older than the minimum; remove the "-0" suffix so canonicalActual := semver.Canonical(actual) and canonicalMin := semver.Canonical(minVersion) (or, if the intent is to require GA only, instead add a clear comment near the semver.Canonical calls explaining the deliberate "-0" trick).test/e2e/uiplugin_cluster_health_analyzer_test.go (1)
49-51: Low-entropy suffix can collide across reruns.
time.Now().UnixNano()%100000gives ~100ms wrap-around and can collide with a leftoverPrometheusRule/alert from a recent rerun if cleanup failed. Consider a stronger unique suffix (full nanos, orrand.Int63()) to avoid "AlreadyExists" flakes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 49 - 51, The current low-entropy suffix generation (suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update the variables that use it (suffix, ruleName, alertName). Specifically, change the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and seeded appropriately in the test setup, and leave ruleName := "e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix unchanged so they get the stronger unique suffix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/framework/framework.go`:
- Around line 262-295: The code in SkipIfClusterVersionBelow builds
canonicalActual and canonicalMin by calling semver.Canonical(...) then appending
"-0", which incorrectly downgrades pre-release versions (e.g., v4.19.0-rc.1
becomes v4.19.0-rc.1-0) and can cause RCs to be treated as older than the
minimum; remove the "-0" suffix so canonicalActual := semver.Canonical(actual)
and canonicalMin := semver.Canonical(minVersion) (or, if the intent is to
require GA only, instead add a clear comment near the semver.Canonical calls
explaining the deliberate "-0" trick).
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The current low-entropy suffix generation (suffix :=
strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it
with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update
the variables that use it (suffix, ruleName, alertName). Specifically, change
the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or
strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and
seeded appropriately in the test setup, and leave ruleName :=
"e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix
unchanged so they get the stronger unique suffix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ffbb481e-3ce6-4f90-9884-e0f3363dc6c8
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/uiplugin_test.go
|
/unhold |
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
test/e2e/uiplugin_cluster_health_analyzer_test.go (2)
38-42: Cleanup ordering note.
f.CleanUpcalls insidenewAlwaysFiringRule(registered after this block) will run before this dump, so the rule is gone by the time diagnostics are captured. If the rule state matters for debugging, register the dump cleanup after rule creation, or capture rule status inside the dump helper before resource deletion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 38 - 42, The current t.Cleanup that calls dumpClusterHealthAnalyzerDebug runs before cleanup registered inside newAlwaysFiringRule, so the rule may be deleted before diagnostics are captured; to fix, either move this t.Cleanup registration to after newAlwaysFiringRule is called (so the dump runs last) or modify dumpClusterHealthAnalyzerDebug to first capture and store the rule's status/state (e.g., query the rule created by newAlwaysFiringRule using plugin.Name) before other cleanups run; update references in the test to ensure newAlwaysFiringRule, dumpClusterHealthAnalyzerDebug, and the t.Cleanup call order reflect this change.
59-85: Long timeouts (10m + 15m) dominate test duration.PR description reports alert firing in ~32s and incident metric in ~32s. The 10/15 min timeouts are ~20x real latency and will slow failure feedback substantially. Consider tightening to something like 3–5 minutes to fail fast while keeping headroom.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 59 - 85, The test uses f.AssertPromQLResultWithOptions for alertQuery and incidentQuery with overly long timeouts (framework.WithTimeout(10*time.Minute) and framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to fail faster while allowing headroom—update the two calls to use shorter timeouts (e.g., framework.WithTimeout(5*time.Minute) for both) and reduce framework.WithPollInterval from 30*time.Second to something like 10*time.Second so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and alertName) returns quicker on failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 282-294: Check the outputs of semver.Canonical for both actual and
minVersion and if either returns an empty string treat the version as non-semver
instead of proceeding with the bogus comparison: after computing canonicalActual
and canonicalMin, if canonicalActual == "" or canonicalMin == "" call t.Skipf
with a clear message (e.g., include cv.Status.Desired.Version and minVersion)
indicating the cluster version could not be parsed, so the test explicitly
skips/logs non-semver versions rather than silently making a meaningless
comparison.
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The suffix generation using time.Now().UnixNano()%100000 is
too small and risks collisions; update the code that sets suffix (used to build
ruleName and alertName) to produce a larger-entropy identifier — e.g., use
rand.Int63() (seeded once with time.Now().UnixNano()) or concatenate
time.Now().UnixNano() with t.Name() — so the generated suffix is effectively
unique across parallel/retried runs and prevents PrometheusRule/alert name
collisions.
---
Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 38-42: The current t.Cleanup that calls
dumpClusterHealthAnalyzerDebug runs before cleanup registered inside
newAlwaysFiringRule, so the rule may be deleted before diagnostics are captured;
to fix, either move this t.Cleanup registration to after newAlwaysFiringRule is
called (so the dump runs last) or modify dumpClusterHealthAnalyzerDebug to first
capture and store the rule's status/state (e.g., query the rule created by
newAlwaysFiringRule using plugin.Name) before other cleanups run; update
references in the test to ensure newAlwaysFiringRule,
dumpClusterHealthAnalyzerDebug, and the t.Cleanup call order reflect this
change.
- Around line 59-85: The test uses f.AssertPromQLResultWithOptions for
alertQuery and incidentQuery with overly long timeouts
(framework.WithTimeout(10*time.Minute) and
framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to
fail faster while allowing headroom—update the two calls to use shorter timeouts
(e.g., framework.WithTimeout(5*time.Minute) for both) and reduce
framework.WithPollInterval from 30*time.Second to something like 10*time.Second
so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and
alertName) returns quicker on failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: bbf117ff-7493-4ae1-8ddb-7df0eb5b0b18
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/uiplugin_test.go
|
/ok-to-test |
simonpasquier
left a comment
There was a problem hiding this comment.
Most of my comments are non-blocking and can be addressed in a follow-up (my main concern is the failure mode if/when the version can't be determined).
| f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) | ||
| f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t) | ||
|
|
||
| suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10) |
There was a problem hiding this comment.
is this to ensure that we can run the test multiple times locally and not hit conflicts?
There was a problem hiding this comment.
Yes, to ensure that potential leftovers from prior tests would not cause any potential conflicts.
There was a problem hiding this comment.
can you add a comment about this?
| assert.NilError(t, err, "incident metric for %s never appeared", alertName) | ||
| } | ||
|
|
||
| func newMonitoringUIPlugin(t *testing.T) *uiv1.UIPlugin { |
There was a problem hiding this comment.
The name is a bit misleading since it will delete the plugin if one already exists. IIUC and given that UI plugin resources are singletons, we can't test a given plugin in parallel. It might be better to enforce this invariant in the structure of TestUIPlugin but it can be a follow-up exercise.
86cc8b0 to
9146e40
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/uiplugin_cluster_health_analyzer_test.go (1)
31-32: Register scheme once, not per test invocation.
monv1.AddToSchememutates the sharedf.K8sClient.Scheme(). Calling it inside the test body works but repeats on every run and is racy if the test is ever parallelized with another that uses the same types. Prefer registering inTestMain(or a package-levelinit) alongside other scheme registrations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 31 - 32, Move the call to monv1.AddToScheme away from the per-test body and register it once during package initialization; remove monv1.AddToScheme(f.K8sClient.Scheme()) from the test and instead invoke monv1.AddToScheme(...) from a package-level init function or TestMain where other scheme registrations occur (ensuring you call it with the shared scheme used by tests), so monv1.AddToScheme is applied only once and not during each test execution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 107-113: The delete error handling for
f.K8sClient.Delete(existing) currently calls t.Fatalf on any error and doesn't
tolerate races; update the block around the Delete call (the use of
f.K8sClient.Delete, existing, and subsequent waitForUIPluginDeletion) to treat
errors.IsNotFound(err) as non-fatal — only call t.Fatalf for errors that are not
IsNotFound — so that a concurrent removal does not fail the test.
---
Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 31-32: Move the call to monv1.AddToScheme away from the per-test
body and register it once during package initialization; remove
monv1.AddToScheme(f.K8sClient.Scheme()) from the test and instead invoke
monv1.AddToScheme(...) from a package-level init function or TestMain where
other scheme registrations occur (ensuring you call it with the shared scheme
used by tests), so monv1.AddToScheme is applied only once and not during each
test execution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 533751f8-3833-4a44-a1ef-676c42f8ff1c
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/framework/assertions.go
1942245 to
0522dc4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/e2e/framework/framework.go (1)
289-293:⚠️ Potential issue | 🟡 MinorFail on unparseable cluster versions before comparing.
semver.Canonical(...)can return""; formatting that as"-0"makes the comparison depend on invalid semver operands instead of the real OpenShift version. Treat an unparseable actual/minimum version as a hard test setup failure, not a skip.Suggested fix
- canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual)) - canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion)) + canonicalActual := semver.Canonical(actual) + canonicalMin := semver.Canonical(minVersion) + if canonicalActual == "" || canonicalMin == "" { + t.Fatalf("failed to parse cluster version for comparison: actual=%q minimum=%q", cv.Status.Desired.Version, minVersion) + return + } + + canonicalActual = fmt.Sprintf("%s-0", canonicalActual) + canonicalMin = fmt.Sprintf("%s-0", canonicalMin)Verification command:
#!/bin/bash set -euo pipefail # Verify the local comparison code. rg -n -C3 'semver\.Canonical|canonicalActual|canonicalMin|semver\.Compare' --type go # Verify semver.Canonical/Compare behavior for golang.org/x/mod v0.34.0. curl -fsSL https://raw.githubusercontent.com/golang/mod/v0.34.0/semver/semver.go \ | sed -n '/func Canonical/,/func Major/p;/func Compare/,/func Max/p'As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/framework/framework.go` around lines 289 - 293, The current version comparison uses semver.Canonical on actual and minVersion then appends "-0" which can hide parse errors; update the logic around semver.Canonical(actual) and semver.Canonical(minVersion) (the canonicalActual/canonicalMin variables used with semver.Compare) to detect when Canonical returns an empty string and treat that as a hard test setup failure: if either canonicalActual=="" or canonicalMin=="" call t.Fatalf with a clear message including the original actual/minVersion values so the test fails fast on unparseable versions rather than silently skipping or comparing invalid semver operands.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 24-27: Replace the hard-coded prometheusRuleNamespace
("e2e-health-analyzer-rules") with a generated, unique namespace for the test
run and ensure cleanup only deletes that generated namespace; update uses of
prometheusRuleNamespace in the test (including where it's accepted if
already-existing and where it's deleted) to instead use the generated value
(create it at test setup with a helper like newTestNamespace or a UUID-suffixed
name), pass that generated namespace into functions interacting with Prometheus
rules and the healthAnalyzerDeploymentName, and modify teardown to only remove
the namespace the test created (or skip deletion if it pre-existed) so you don't
delete shared static namespaces or cause flakiness.
---
Duplicate comments:
In `@test/e2e/framework/framework.go`:
- Around line 289-293: The current version comparison uses semver.Canonical on
actual and minVersion then appends "-0" which can hide parse errors; update the
logic around semver.Canonical(actual) and semver.Canonical(minVersion) (the
canonicalActual/canonicalMin variables used with semver.Compare) to detect when
Canonical returns an empty string and treat that as a hard test setup failure:
if either canonicalActual=="" or canonicalMin=="" call t.Fatalf with a clear
message including the original actual/minVersion values so the test fails fast
on unparseable versions rather than silently skipping or comparing invalid
semver operands.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 488e88a5-2578-4d8d-8042-94f7f358f894
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/framework/assertions.go
0522dc4 to
2595306
Compare
Add an end-to-end test that deploys a Monitoring UIPlugin with the cluster-health-analyzer enabled, fires a synthetic PrometheusRule alert, and verifies the corresponding incident metric appears. Framework additions: SkipIfClusterVersionBelow (semver gate that fails on unknown version instead of silently skipping), DumpNamespaceDebug (on-failure diagnostic dump of deployments, pods and events), and AssertPromQLResultWithOptions (configurable poll interval/timeout). The UIPlugin setup is race-tolerant: deleteUIPluginIfExists issues a direct Delete and treats IsNotFound as success, avoiding a TOCTOU window between Get and Delete that could cause spurious failures. Made-with: Cursor
2595306 to
4fbe7ef
Compare
simonpasquier
left a comment
There was a problem hiding this comment.
/lgtm
/hold
Feel free to unblock and address my non-blocking remarks in a follow-up PR.
| // DumpNamespaceDebug logs deployments (with conditions), pods (with container | ||
| // statuses), and events for the given namespace. Useful as a t.Cleanup or | ||
| // on-failure diagnostic helper. | ||
| func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { |
There was a problem hiding this comment.
I'd like if we have a follow-up Jira ticket or PR to generalize the gathering of post-test data to help with troubleshooting.
| ) | ||
|
|
||
| func clusterHealthAnalyzer(t *testing.T) { | ||
| f.SkipIfClusterVersionBelow(t, "4.19") |
There was a problem hiding this comment.
(nit) a follow-up change could collect the cluster version before in the setup phase so that we don't have to query ClusterVersion multiple times (and avoid failures if the API is flaky).
| f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) | ||
| f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t) | ||
|
|
||
| suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10) |
There was a problem hiding this comment.
can you add a comment about this?
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DavidRajnoha, simonpasquier, tremes The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
Add end-to-end tests that validate the monitoring UIPlugin with cluster-health-analyzer: a deployment readiness check and a functional test that triggers a CrashLoopBackOff alert and verifies the cluster_health_components_map incident metric is produced.
Also introduce AssertPromQLResultWithOptions to allow callers to override the default poll interval and timeout, and generalize waitForDBUIPluginDeletion to waitForUIPluginDeletion.