Skip to content

Commit aae3956

Browse files
feat: add diagnostic log collection on test failure
Adds collectDiagnosticLogs() to KubernetesClientHelper that captures cluster state (events, pods, deployments, statefulsets, routes, per-container pod logs) to files on test failure. TeardownReporter now tracks failed projects and collects diagnostics before namespace deletion. Log collection runs on both CI and local; namespace deletion remains CI-only. Bumps version to 1.1.34.
1 parent 69cc164 commit aae3956

8 files changed

Lines changed: 249 additions & 21 deletions

File tree

docs/.vitepress/config.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ export default defineConfig({
3333
{ text: "Examples", link: "/examples/" },
3434
{ text: "Overlay Testing", link: "/overlay/" },
3535
{
36-
text: "v1.1.33",
36+
text: "v1.1.34",
3737
items: [{ text: "Changelog", link: "/changelog" }],
3838
},
3939
],

docs/changelog.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,27 @@
22

33
All notable changes to this project will be documented in this file.
44

5-
## [1.1.33] - Current
5+
## [1.1.34] - Current
6+
7+
### Added
8+
9+
- **Diagnostic log collection on failure**: `collectDiagnosticLogs(namespace, outputDir?)` on `KubernetesClientHelper` captures comprehensive cluster state (events, pod status, deployments, statefulsets, routes, and per-container pod logs including init containers and previous restarts) to files under `node_modules/.cache/e2e-test-results/logs/<namespace>/`. Uses `kubectl` for cross-platform compatibility. Empty files (e.g. no previous logs) are not created.
10+
- **TeardownReporter collects diagnostics on test failure**: When any test in a project fails, the teardown reporter automatically calls `collectDiagnosticLogs` before namespace deletion. Diagnostic collection runs on both CI and local; namespace deletion remains CI-only.
11+
- **Per-container pod log collection**: Logs are collected per-container (init + app containers) instead of `--all-containers`, which fails entirely if any container hasn't started. Files saved to `pods/<pod-name>/<container-name>.log` and `pods/<pod-name>/<container-name>.previous.log`.
12+
13+
### Changed
14+
15+
- **TeardownReporter tracks test failures**: Added `_projectsWithFailures` set to track which projects had test failures, so diagnostic logs are only collected when needed.
16+
- **TeardownReporter active on non-CI**: The reporter now processes `onTestEnd`/`onEnd` regardless of `CI` env var. Log collection runs always; namespace deletion is still gated on `CI=true`.
17+
18+
## [1.1.33]
619

720
### Added
821

922
- **Automatic Vault secret loading for local development**: Set `VAULT=1` or `VAULT=true` to automatically fetch secrets from HashiCorp Vault during global setup. Handles OIDC login, fetches global and per-workspace secrets, and injects them into `process.env`. Only secret key names are logged, never values. Configurable via `VAULT_ADDR` and `VAULT_BASE_PATH` env vars. Logs a Slack channel (`#rhdh-e2e-tests`) when permission is denied.
1023

24+
## [1.1.32]
25+
1126
### Fixed
1227

1328
- **Normalize `-dynamic` suffix in `extractPluginName`**: Plugins whose metadata `dynamicArtifact` is a local path (ending in `-dynamic`) were not matched during PR OCI resolution or config injection, because the metadata map key included the `-dynamic` suffix while OCI URL lookups did not. `extractPluginName` now strips the `-dynamic` suffix so local paths and OCI refs for the same logical plugin produce the same key. ([RHDHBUGS-2987](https://issues.redhat.com/browse/RHDHBUGS-2987))

docs/guide/core-concepts/error-handling.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,14 @@ await page.click('button[data-testid="save"]');
280280
await expect(page.getByText("Saved")).toBeVisible();
281281
```
282282

283+
## Cluster Diagnostic Logs
284+
285+
When tests fail, the framework automatically collects cluster diagnostics (pod logs, events, deployments) to `node_modules/.cache/e2e-test-results/logs/<namespace>/`. This includes per-container logs for all pods (init and app containers), with previous restart logs when available.
286+
287+
Check these files first when debugging deployment or pod failures — they're often more useful than Playwright's HTML report for infrastructure issues.
288+
289+
See [Kubernetes Client — Diagnostic Log Collection](/guide/utilities/kubernetes-client#diagnostic-log-collection) for the full list of collected resources and API details.
290+
283291
## Error Handling Checklist
284292

285293
- [ ] Use specific error messages that include context

docs/guide/utilities/kubernetes-client.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,44 @@ When a failure is detected, the method:
121121
2. Fetches container logs via `oc logs`
122122
3. Throws an error with the failure details
123123

124+
## Diagnostic Log Collection
125+
126+
### `collectDiagnosticLogs(namespace, outputDir?)`
127+
128+
Collects comprehensive cluster diagnostics and saves them to files. Uses `kubectl` for cross-platform compatibility (OpenShift, EKS, GKE, etc.). OpenShift-specific resources (routes) are collected on a best-effort basis.
129+
130+
```typescript
131+
await k8sClient.collectDiagnosticLogs("my-namespace");
132+
// Saves to: node_modules/.cache/e2e-test-results/logs/my-namespace/
133+
134+
// Or with a custom output directory:
135+
await k8sClient.collectDiagnosticLogs("my-namespace", "/tmp/debug-logs");
136+
```
137+
138+
**Collected resources:**
139+
140+
| File | Content |
141+
|------|---------|
142+
| `events.txt` | Namespace events sorted by timestamp |
143+
| `pods.txt` | Pod status (`kubectl get pods -o wide`) |
144+
| `describe-pods.txt` | Full pod descriptions |
145+
| `deployments.txt` | Deployment status |
146+
| `describe-deployments.txt` | Full deployment descriptions |
147+
| `statefulsets.txt` | StatefulSet status |
148+
| `routes.txt` | OpenShift routes (skipped on non-OpenShift clusters) |
149+
| `pods/<pod>/<container>.log` | Current logs per container (init + app) |
150+
| `pods/<pod>/<container>.previous.log` | Previous restart logs (only if pod restarted) |
151+
152+
**Key behaviors:**
153+
- Logs are collected per-container rather than `--all-containers`, so a failed init container doesn't block collection of other container logs
154+
- Empty files are not created (e.g., when there are no previous logs)
155+
- Resource types that don't exist on the cluster (e.g., routes on non-OpenShift) are silently skipped
156+
- All resource collection runs in parallel via `Promise.allSettled`
157+
158+
**Automatic collection on test failure:**
159+
160+
In the overlay testing flow, you don't need to call this manually. The built-in `TeardownReporter` automatically calls `collectDiagnosticLogs` for any project that had test failures. This works on both CI and local runs.
161+
124162
## Deployment Operations
125163

126164
### `scaleDeployment(namespace, name, replicas)`

docs/overlay/reference/troubleshooting.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,40 @@ oc login --token=<token> --server=<server>
271271
- Check route/service configuration
272272
- Verify network policies
273273

274+
## Diagnostic Logs
275+
276+
When tests fail, the `TeardownReporter` automatically collects cluster diagnostics and saves them to:
277+
278+
```
279+
node_modules/.cache/e2e-test-results/logs/<project-name>/
280+
├── events.txt # Namespace events (sorted by time)
281+
├── pods.txt # Pod status
282+
├── describe-pods.txt # Full pod descriptions
283+
├── deployments.txt # Deployment status
284+
├── describe-deployments.txt
285+
├── statefulsets.txt
286+
├── routes.txt # OpenShift routes
287+
└── pods/
288+
└── <pod-name>/
289+
├── <container>.log # Current logs
290+
└── <container>.previous.log # Previous restart logs
291+
```
292+
293+
This runs automatically on **both CI and local** — no configuration needed. Namespace deletion remains CI-only.
294+
295+
**When using `run-e2e.sh`**, logs are written relative to the repo root. When running from a workspace (`cd workspaces/my-plugin/e2e-tests && yarn test`), they're relative to the `e2e-tests/` directory.
296+
297+
**Logs are only collected for projects with failures.** If all tests pass, no diagnostic logs are written.
298+
299+
To collect diagnostics manually (e.g., from a custom script):
300+
301+
```typescript
302+
import { KubernetesClientHelper } from "@red-hat-developer-hub/e2e-test-utils/utils";
303+
304+
const k8sClient = new KubernetesClientHelper();
305+
await k8sClient.collectDiagnosticLogs("my-namespace", "./my-logs");
306+
```
307+
274308
## Debugging Tips
275309

276310
### Use Headed Mode

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@red-hat-developer-hub/e2e-test-utils",
3-
"version": "1.1.33",
3+
"version": "1.1.34",
44
"description": "Test utilities for RHDH E2E tests",
55
"license": "Apache-2.0",
66
"repository": {

src/playwright/teardown-reporter.ts

Lines changed: 36 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import type {
44
TestCase,
55
TestResult,
66
} from "@playwright/test/reporter";
7+
import path from "path";
78
import { KubernetesClientHelper } from "../utils/kubernetes-client.js";
89
import { getTeardownNamespaces } from "./teardown-namespaces.js";
910

@@ -18,14 +19,16 @@ import { getTeardownNamespaces } from "./teardown-namespaces.js";
1819
* Falls back in onEnd() to clean up any projects that didn't complete naturally
1920
* (e.g., interrupted runs, maxFailures).
2021
*
21-
* Only active when process.env.CI === "true".
22+
* Diagnostic log collection runs always (CI and local).
23+
* Namespace deletion only runs when process.env.CI === "true".
2224
*
2325
* By default, deletes the namespace matching the project name.
2426
* For custom namespaces, consumers can register them via registerTeardownNamespace().
2527
*/
2628
export default class TeardownReporter implements Reporter {
2729
private _projectTestCounts = new Map<string, number>();
2830
private _projectCompleted = new Map<string, number>();
31+
private _projectsWithFailures = new Set<string>();
2932
private _pendingDeletions = new Map<string, Promise<void>>();
3033

3134
onBegin(_config: unknown, suite: Suite): void {
@@ -42,8 +45,6 @@ export default class TeardownReporter implements Reporter {
4245
}
4346

4447
onTestEnd(test: TestCase, result: TestResult): void {
45-
if (process.env.CI !== "true") return;
46-
4748
const project = test.parent.project();
4849
if (!project) return;
4950

@@ -55,10 +56,15 @@ export default class TeardownReporter implements Reporter {
5556
if (!isDone) return;
5657

5758
const name = project.name;
59+
60+
if (result.status !== "passed" && result.status !== "skipped") {
61+
this._projectsWithFailures.add(name);
62+
}
63+
5864
const completed = (this._projectCompleted.get(name) ?? 0) + 1;
5965
this._projectCompleted.set(name, completed);
6066

61-
// Start deletion immediately (fire-and-forget here, awaited in onEnd)
67+
// Start cleanup immediately (fire-and-forget here, awaited in onEnd)
6268
if (
6369
completed === this._projectTestCounts.get(name) &&
6470
!this._pendingDeletions.has(name)
@@ -68,15 +74,14 @@ export default class TeardownReporter implements Reporter {
6874
}
6975

7076
async onEnd(): Promise<void> {
71-
if (process.env.CI !== "true") return;
72-
73-
// Await all in-flight deletions started from onTestEnd
77+
// Await all in-flight cleanups started from onTestEnd
7478
await Promise.all(this._pendingDeletions.values());
7579

7680
// Fallback: clean up projects that didn't complete naturally
77-
// (e.g., interrupted run, maxFailures hit)
81+
// (e.g., interrupted run, maxFailures hit) — always collect diagnostics
7882
for (const [project] of this._projectTestCounts) {
7983
if (!this._pendingDeletions.has(project)) {
84+
this._projectsWithFailures.add(project);
8085
await this._deleteProjectNamespaces(project);
8186
}
8287
}
@@ -88,7 +93,7 @@ export default class TeardownReporter implements Reporter {
8893
k8sClient = new KubernetesClientHelper();
8994
} catch (error) {
9095
console.error(
91-
`[TeardownReporter] Cannot connect to cluster, skipping teardown:`,
96+
`[TeardownReporter] Cannot connect to cluster, skipping cleanup:`,
9297
error,
9398
);
9499
return;
@@ -98,11 +103,28 @@ export default class TeardownReporter implements Reporter {
98103
const namespaces =
99104
customNamespaces.length > 0 ? customNamespaces : [projectName];
100105

101-
for (const ns of namespaces) {
102-
console.log(
103-
`[TeardownReporter] Deleting namespace "${ns}" (project: ${projectName})`,
104-
);
105-
await k8sClient.deleteNamespace(ns);
106+
// Collect diagnostic logs on failure (always, regardless of CI)
107+
if (this._projectsWithFailures.has(projectName)) {
108+
for (const ns of namespaces) {
109+
const outputDir = path.join(
110+
"node_modules",
111+
".cache",
112+
"e2e-test-results",
113+
"logs",
114+
projectName,
115+
);
116+
await k8sClient.collectDiagnosticLogs(ns, outputDir);
117+
}
118+
}
119+
120+
// Delete namespaces only in CI
121+
if (process.env.CI === "true") {
122+
for (const ns of namespaces) {
123+
console.log(
124+
`[TeardownReporter] Deleting namespace "${ns}" (project: ${projectName})`,
125+
);
126+
await k8sClient.deleteNamespace(ns);
127+
}
106128
}
107129
}
108130
}

src/utils/kubernetes-client.ts

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -629,15 +629,12 @@ class KubernetesClientHelper {
629629
await new Promise((r) => setTimeout(r, pollIntervalMs));
630630
}
631631

632-
// Timeout reached - collect diagnostic info before throwing
632+
// Timeout reached - print diagnostics to stdio before throwing
633633
console.log(`\n[K8sHelper] ═══ Pod Diagnostics (timeout reached) ═══`);
634634
try {
635635
console.log(`\n[K8sHelper] ─── Pod Status ───`);
636636
await $`oc get pods -n ${namespace} -l ${labelSelector} -o wide`;
637637

638-
console.log(`\n[K8sHelper] ─── Namespace Events ───`);
639-
await $`oc get events -n ${namespace} --sort-by='.lastTimestamp'`;
640-
641638
console.log(`\n[K8sHelper] ─── Pod Logs ───`);
642639
await $`oc logs -n ${namespace} -l ${labelSelector} --all-containers --tail=100 2>&1 || true`;
643640
} catch {
@@ -650,6 +647,120 @@ class KubernetesClientHelper {
650647
);
651648
}
652649

650+
/**
651+
* Collects diagnostic logs for all resources in a namespace and saves them as files.
652+
* Uses kubectl for cross-platform compatibility (works on OpenShift, EKS, GKE, etc.).
653+
* OpenShift-specific resources (routes) are collected on a best-effort basis.
654+
*
655+
* @param namespace - Namespace to collect diagnostics from
656+
* @param outputDir - Directory to write log files to (defaults to playwright-report/logs/<namespace>)
657+
*/
658+
async collectDiagnosticLogs(
659+
namespace: string,
660+
outputDir: string = path.join(
661+
"node_modules",
662+
".cache",
663+
"e2e-test-results",
664+
"logs",
665+
namespace,
666+
),
667+
): Promise<void> {
668+
fs.mkdirSync(outputDir, { recursive: true });
669+
console.log(
670+
`[K8sHelper] Collecting diagnostic logs for "${namespace}" → ${outputDir}`,
671+
);
672+
const quiet = $({
673+
stdio: ["pipe", "pipe", "pipe"],
674+
timeout: "20s",
675+
});
676+
677+
const save = async (filePath: string, cmd: Promise<{ stdout: string }>) => {
678+
try {
679+
const result = await cmd;
680+
fs.mkdirSync(path.dirname(filePath), { recursive: true });
681+
fs.writeFileSync(filePath, result.stdout);
682+
} catch {
683+
// ignore — resource type may not exist on this cluster
684+
}
685+
};
686+
687+
await Promise.allSettled([
688+
save(
689+
path.join(outputDir, "events.txt"),
690+
quiet`kubectl get events -n ${namespace} --sort-by='.lastTimestamp'`,
691+
),
692+
save(
693+
path.join(outputDir, "pods.txt"),
694+
quiet`kubectl get pods -n ${namespace} -o wide`,
695+
),
696+
save(
697+
path.join(outputDir, "describe-pods.txt"),
698+
quiet`kubectl describe pods -n ${namespace}`,
699+
),
700+
save(
701+
path.join(outputDir, "deployments.txt"),
702+
quiet`kubectl get deployments -n ${namespace} -o wide`,
703+
),
704+
save(
705+
path.join(outputDir, "describe-deployments.txt"),
706+
quiet`kubectl describe deployments -n ${namespace}`,
707+
),
708+
save(
709+
path.join(outputDir, "statefulsets.txt"),
710+
quiet`kubectl get statefulsets -n ${namespace} -o wide`,
711+
),
712+
save(
713+
path.join(outputDir, "routes.txt"),
714+
quiet`kubectl get routes -n ${namespace} -o wide`,
715+
),
716+
]);
717+
718+
try {
719+
const pods = (await this._k8sApi.listNamespacedPod({ namespace })).items;
720+
const saveLogs = async (
721+
filePath: string,
722+
cmd: Promise<{ stdout: string }>,
723+
) => {
724+
try {
725+
const result = await cmd;
726+
if (result.stdout.trim()) {
727+
fs.mkdirSync(path.dirname(filePath), { recursive: true });
728+
fs.writeFileSync(filePath, result.stdout);
729+
}
730+
} catch {
731+
// ignore — container may not have started or no previous logs
732+
}
733+
};
734+
735+
await Promise.allSettled(
736+
pods
737+
.filter((pod) => pod.metadata?.name)
738+
.flatMap((pod) => {
739+
const podName = pod.metadata!.name!;
740+
const podDir = path.join(outputDir, "pods", podName);
741+
const containers = [
742+
...(pod.spec?.initContainers ?? []),
743+
...(pod.spec?.containers ?? []),
744+
];
745+
return containers
746+
.filter((c) => c.name)
747+
.flatMap((c) => [
748+
saveLogs(
749+
path.join(podDir, `${c.name}.log`),
750+
quiet`kubectl logs ${podName} -n ${namespace} -c ${c.name}`,
751+
),
752+
saveLogs(
753+
path.join(podDir, `${c.name}.previous.log`),
754+
quiet`kubectl logs ${podName} -n ${namespace} -c ${c.name} --previous`,
755+
),
756+
]);
757+
}),
758+
);
759+
} catch {
760+
// ignore
761+
}
762+
}
763+
653764
/**
654765
* Check if a pod is in a failure state. Returns failure info or null if healthy.
655766
*/

0 commit comments

Comments
 (0)