Skip to content

Commit 6c7ee78

Browse files
feat: add diagnostic log collection on test failure
Adds collectDiagnosticLogs() to KubernetesClientHelper that captures cluster state (events, pods, deployments, statefulsets, routes, configmaps, per-container pod logs) to files on test failure. TeardownReporter now tracks failed projects and collects diagnostics before namespace deletion. Log collection runs on both CI and local; namespace deletion remains CI-only. Bumps version to 1.1.34.
1 parent 69cc164 commit 6c7ee78

8 files changed

Lines changed: 258 additions & 24 deletions

File tree

docs/.vitepress/config.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ export default defineConfig({
3333
{ text: "Examples", link: "/examples/" },
3434
{ text: "Overlay Testing", link: "/overlay/" },
3535
{
36-
text: "v1.1.33",
36+
text: "v1.1.34",
3737
items: [{ text: "Changelog", link: "/changelog" }],
3838
},
3939
],

docs/changelog.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,27 @@
22

33
All notable changes to this project will be documented in this file.
44

5-
## [1.1.33] - Current
5+
## [1.1.34] - Current
6+
7+
### Added
8+
9+
- **Diagnostic log collection on failure**: `collectDiagnosticLogs(namespace, outputDir?)` on `KubernetesClientHelper` captures comprehensive cluster state (events, pod status, deployments, statefulsets, routes, and per-container pod logs including init containers and previous restarts) to files under `node_modules/.cache/e2e-test-results/logs/<namespace>/`. Uses `kubectl` for cross-platform compatibility. Empty files (e.g. no previous logs) are not created.
10+
- **TeardownReporter collects diagnostics on test failure**: When any test in a project fails, the teardown reporter automatically calls `collectDiagnosticLogs` before namespace deletion. Diagnostic collection runs on both CI and local; namespace deletion remains CI-only.
11+
- **Per-container pod log collection**: Logs are collected per-container (init + app containers) instead of `--all-containers`, which fails entirely if any container hasn't started. Files saved to `pods/<pod-name>/<container-name>.log` and `pods/<pod-name>/<container-name>.previous.log`.
12+
13+
### Changed
14+
15+
- **TeardownReporter tracks test failures**: Added `_projectsWithFailures` set to track which projects had test failures, so diagnostic logs are only collected when needed.
16+
- **TeardownReporter active on non-CI**: The reporter now processes `onTestEnd`/`onEnd` regardless of `CI` env var. Log collection runs always; namespace deletion is still gated on `CI=true`.
17+
18+
## [1.1.33]
619

720
### Added
821

922
- **Automatic Vault secret loading for local development**: Set `VAULT=1` or `VAULT=true` to automatically fetch secrets from HashiCorp Vault during global setup. Handles OIDC login, fetches global and per-workspace secrets, and injects them into `process.env`. Only secret key names are logged, never values. Configurable via `VAULT_ADDR` and `VAULT_BASE_PATH` env vars. Logs a Slack channel (`#rhdh-e2e-tests`) when permission is denied.
1023

24+
## [1.1.32]
25+
1126
### Fixed
1227

1328
- **Normalize `-dynamic` suffix in `extractPluginName`**: Plugins whose metadata `dynamicArtifact` is a local path (ending in `-dynamic`) were not matched during PR OCI resolution or config injection, because the metadata map key included the `-dynamic` suffix while OCI URL lookups did not. `extractPluginName` now strips the `-dynamic` suffix so local paths and OCI refs for the same logical plugin produce the same key. ([RHDHBUGS-2987](https://issues.redhat.com/browse/RHDHBUGS-2987))

docs/guide/core-concepts/error-handling.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,14 @@ await page.click('button[data-testid="save"]');
280280
await expect(page.getByText("Saved")).toBeVisible();
281281
```
282282

283+
## Cluster Diagnostic Logs
284+
285+
When tests fail, the framework automatically collects cluster diagnostics (pod logs, events, deployments) to `node_modules/.cache/e2e-test-results/logs/<namespace>/`. This includes per-container logs for all pods (init and app containers), with previous restart logs when available.
286+
287+
Check these files first when debugging deployment or pod failures — they're often more useful than Playwright's HTML report for infrastructure issues.
288+
289+
See [Kubernetes Client — Diagnostic Log Collection](/guide/utilities/kubernetes-client#diagnostic-log-collection) for the full list of collected resources and API details.
290+
283291
## Error Handling Checklist
284292

285293
- [ ] Use specific error messages that include context

docs/guide/utilities/kubernetes-client.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,44 @@ When a failure is detected, the method:
121121
2. Fetches container logs via `oc logs`
122122
3. Throws an error with the failure details
123123

124+
## Diagnostic Log Collection
125+
126+
### `collectDiagnosticLogs(namespace, outputDir?)`
127+
128+
Collects comprehensive cluster diagnostics and saves them to files. Uses `kubectl` for cross-platform compatibility (OpenShift, EKS, GKE, etc.). OpenShift-specific resources (routes) are collected on a best-effort basis.
129+
130+
```typescript
131+
await k8sClient.collectDiagnosticLogs("my-namespace");
132+
// Saves to: node_modules/.cache/e2e-test-results/logs/my-namespace/
133+
134+
// Or with a custom output directory:
135+
await k8sClient.collectDiagnosticLogs("my-namespace", "/tmp/debug-logs");
136+
```
137+
138+
**Collected resources:**
139+
140+
| File | Content |
141+
|------|---------|
142+
| `events.txt` | Namespace events sorted by timestamp |
143+
| `pods.txt` | Pod status (`kubectl get pods -o wide`) |
144+
| `describe-pods.txt` | Full pod descriptions |
145+
| `deployments.txt` | Deployment status |
146+
| `describe-deployments.txt` | Full deployment descriptions |
147+
| `statefulsets.txt` | StatefulSet status |
148+
| `routes.txt` | OpenShift routes (skipped on non-OpenShift clusters) |
149+
| `pods/<pod>/<container>.log` | Current logs per container (init + app) |
150+
| `pods/<pod>/<container>.previous.log` | Previous restart logs (only if pod restarted) |
151+
152+
**Key behaviors:**
153+
- Logs are collected per-container rather than `--all-containers`, so a failed init container doesn't block collection of other container logs
154+
- Empty files are not created (e.g., when there are no previous logs)
155+
- Resource types that don't exist on the cluster (e.g., routes on non-OpenShift) are silently skipped
156+
- All resource collection runs in parallel via `Promise.allSettled`
157+
158+
**Automatic collection on test failure:**
159+
160+
In the overlay testing flow, you don't need to call this manually. The built-in `TeardownReporter` automatically calls `collectDiagnosticLogs` for any project that had test failures. This works on both CI and local runs.
161+
124162
## Deployment Operations
125163

126164
### `scaleDeployment(namespace, name, replicas)`

docs/overlay/reference/troubleshooting.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,40 @@ oc login --token=<token> --server=<server>
271271
- Check route/service configuration
272272
- Verify network policies
273273

274+
## Diagnostic Logs
275+
276+
When tests fail, the `TeardownReporter` automatically collects cluster diagnostics and saves them to:
277+
278+
```
279+
node_modules/.cache/e2e-test-results/logs/<project-name>/
280+
├── events.txt # Namespace events (sorted by time)
281+
├── pods.txt # Pod status
282+
├── describe-pods.txt # Full pod descriptions
283+
├── deployments.txt # Deployment status
284+
├── describe-deployments.txt
285+
├── statefulsets.txt
286+
├── routes.txt # OpenShift routes
287+
└── pods/
288+
└── <pod-name>/
289+
├── <container>.log # Current logs
290+
└── <container>.previous.log # Previous restart logs
291+
```
292+
293+
This runs automatically on **both CI and local** — no configuration needed. Namespace deletion remains CI-only.
294+
295+
**When using `run-e2e.sh`**, logs are written relative to the repo root. When running from a workspace (`cd workspaces/my-plugin/e2e-tests && yarn test`), they're relative to the `e2e-tests/` directory.
296+
297+
**Logs are only collected for projects with failures.** If all tests pass, no diagnostic logs are written.
298+
299+
To collect diagnostics manually (e.g., from a custom script):
300+
301+
```typescript
302+
import { KubernetesClientHelper } from "@red-hat-developer-hub/e2e-test-utils/utils";
303+
304+
const k8sClient = new KubernetesClientHelper();
305+
await k8sClient.collectDiagnosticLogs("my-namespace", "./my-logs");
306+
```
307+
274308
## Debugging Tips
275309

276310
### Use Headed Mode

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@red-hat-developer-hub/e2e-test-utils",
3-
"version": "1.1.33",
3+
"version": "1.1.34",
44
"description": "Test utilities for RHDH E2E tests",
55
"license": "Apache-2.0",
66
"repository": {

src/playwright/teardown-reporter.ts

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,37 @@ import type {
44
TestCase,
55
TestResult,
66
} from "@playwright/test/reporter";
7+
import path from "path";
78
import { KubernetesClientHelper } from "../utils/kubernetes-client.js";
89
import { getTeardownNamespaces } from "./teardown-namespaces.js";
910

1011
/**
11-
* Playwright reporter that deletes namespaces per-project as soon as all tests
12-
* in that project finish. This frees cluster resources early instead of waiting
13-
* for the entire suite to complete.
12+
* Playwright reporter that collects diagnostic logs on failure and deletes
13+
* namespaces per-project as soon as all tests in that project finish.
14+
*
15+
* Why a reporter (not afterAll / worker fixture teardown):
16+
* - afterAll runs when a worker dies. On test failure Playwright kills the
17+
* worker for retries, so afterAll deletes the namespace before the retry.
18+
* - Worker fixture teardown has the same problem — it fires on worker exit.
19+
* - A reporter runs in the main Playwright process, survives worker restarts,
20+
* and can track per-project completion including retries.
1421
*
1522
* Handles retries: a test is only counted as done when it passes/is skipped,
1623
* or exhausts all retry attempts.
1724
*
1825
* Falls back in onEnd() to clean up any projects that didn't complete naturally
1926
* (e.g., interrupted runs, maxFailures).
2027
*
21-
* Only active when process.env.CI === "true".
28+
* Diagnostic log collection runs always (CI and local).
29+
* Namespace deletion only runs when process.env.CI === "true".
2230
*
2331
* By default, deletes the namespace matching the project name.
2432
* For custom namespaces, consumers can register them via registerTeardownNamespace().
2533
*/
2634
export default class TeardownReporter implements Reporter {
2735
private _projectTestCounts = new Map<string, number>();
2836
private _projectCompleted = new Map<string, number>();
37+
private _projectsWithFailures = new Set<string>();
2938
private _pendingDeletions = new Map<string, Promise<void>>();
3039

3140
onBegin(_config: unknown, suite: Suite): void {
@@ -42,8 +51,6 @@ export default class TeardownReporter implements Reporter {
4251
}
4352

4453
onTestEnd(test: TestCase, result: TestResult): void {
45-
if (process.env.CI !== "true") return;
46-
4754
const project = test.parent.project();
4855
if (!project) return;
4956

@@ -55,10 +62,15 @@ export default class TeardownReporter implements Reporter {
5562
if (!isDone) return;
5663

5764
const name = project.name;
65+
66+
if (result.status !== "passed" && result.status !== "skipped") {
67+
this._projectsWithFailures.add(name);
68+
}
69+
5870
const completed = (this._projectCompleted.get(name) ?? 0) + 1;
5971
this._projectCompleted.set(name, completed);
6072

61-
// Start deletion immediately (fire-and-forget here, awaited in onEnd)
73+
// Start cleanup immediately (fire-and-forget here, awaited in onEnd)
6274
if (
6375
completed === this._projectTestCounts.get(name) &&
6476
!this._pendingDeletions.has(name)
@@ -68,15 +80,14 @@ export default class TeardownReporter implements Reporter {
6880
}
6981

7082
async onEnd(): Promise<void> {
71-
if (process.env.CI !== "true") return;
72-
73-
// Await all in-flight deletions started from onTestEnd
83+
// Await all in-flight cleanups started from onTestEnd
7484
await Promise.all(this._pendingDeletions.values());
7585

7686
// Fallback: clean up projects that didn't complete naturally
77-
// (e.g., interrupted run, maxFailures hit)
87+
// (e.g., interrupted run, maxFailures hit) — always collect diagnostics
7888
for (const [project] of this._projectTestCounts) {
7989
if (!this._pendingDeletions.has(project)) {
90+
this._projectsWithFailures.add(project);
8091
await this._deleteProjectNamespaces(project);
8192
}
8293
}
@@ -88,7 +99,7 @@ export default class TeardownReporter implements Reporter {
8899
k8sClient = new KubernetesClientHelper();
89100
} catch (error) {
90101
console.error(
91-
`[TeardownReporter] Cannot connect to cluster, skipping teardown:`,
102+
`[TeardownReporter] Cannot connect to cluster, skipping cleanup:`,
92103
error,
93104
);
94105
return;
@@ -98,11 +109,28 @@ export default class TeardownReporter implements Reporter {
98109
const namespaces =
99110
customNamespaces.length > 0 ? customNamespaces : [projectName];
100111

101-
for (const ns of namespaces) {
102-
console.log(
103-
`[TeardownReporter] Deleting namespace "${ns}" (project: ${projectName})`,
104-
);
105-
await k8sClient.deleteNamespace(ns);
112+
// Collect diagnostic logs on failure (always, regardless of CI)
113+
if (this._projectsWithFailures.has(projectName)) {
114+
for (const ns of namespaces) {
115+
const outputDir = path.join(
116+
"node_modules",
117+
".cache",
118+
"e2e-test-results",
119+
"logs",
120+
projectName,
121+
);
122+
await k8sClient.collectDiagnosticLogs(ns, outputDir);
123+
}
124+
}
125+
126+
// Delete namespaces only in CI
127+
if (process.env.CI === "true") {
128+
for (const ns of namespaces) {
129+
console.log(
130+
`[TeardownReporter] Deleting namespace "${ns}" (project: ${projectName})`,
131+
);
132+
await k8sClient.deleteNamespace(ns);
133+
}
106134
}
107135
}
108136
}

0 commit comments

Comments
 (0)