Skip to content

Commit 441bcb2

Browse files
authored
ci: Improve che-happy-path test reliability with retry logic and health checks (#1581)
* Improve Che happy-path test reliability with retry logic and health checks This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to significantly improve test reliability in CI environments by adding: - Health checks for DWO and Che deployments using kubectl wait - Retry logic with exponential backoff (2 retries, 60s base delay) - Comprehensive artifact collection on failures - Graceful error handling and cleanup between retries - Clear error messages with stage identification The improvements address flakiness in the v14-che-happy-path Prow test by handling transient failures (image pull timeouts, API server issues, operator reconciliation delays) and providing detailed diagnostics for genuine failures. Key features: - DWO verification: Waits for deployment condition=available - Che verification: Waits for CheCluster condition=Available - Retry strategy: 2 attempts with exponential backoff + jitter - Artifact collection: Operator logs, CheCluster CR, pod info, events - Cleanup: Deletes failed deployments before retry - Realistic timeouts: 24 hours (86400s) for pod wait/ready Expected impact: Reduce CI flakiness from ~50% to >90% success rate for infrastructure-related failures, with significantly better diagnostics. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com> Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com> * fixup! Improve Che happy-path test reliability with retry logic and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com> * fixup! fixup! Improve Che happy-path test reliability with retry logic and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com> * fixup! fixup! fixup! Improve Che happy-path test reliability with retry logic and health checks - Remove broken verifyCheDeployment() (CheCluster has no condition=Available) - Fix exit 1 -> return 1 in main() - Fix cleanup: use kubectl wait --for=delete instead of sleep 10 - Add retry for happy-path test (1 retry with 30s delay) - Add failure classification, reporting, and PR commenting - Fix pipe delimiter injection, variable quoting, artifact overwrite - Update README to match code changes Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com> * fixup! fixup! fixup! fixup! Improve Che happy-path test reliability with retry logic and health checks Add CHE_REPO_BRANCH format validation to prevent injection Assisted-by: Claude Opus 4.6 (1M context) Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com> --------- Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
1 parent 2d6966e commit 441bcb2

File tree

2 files changed

+815
-13
lines changed

2 files changed

+815
-13
lines changed

.ci/README-CHE-HAPPY-PATH.md

Lines changed: 302 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,302 @@
1+
# Che Happy-Path Test
2+
3+
**Script**: `.ci/oci-devworkspace-happy-path.sh`
4+
**Purpose**: Integration test validating DevWorkspace Operator with Eclipse Che deployment
5+
6+
## Overview
7+
8+
This script deploys and validates the full DevWorkspace Operator + Eclipse Che stack on OpenShift, ensuring the happy-path user workflow succeeds. It's used in the `v14-che-happy-path` Prow CI test.
9+
10+
## Features
11+
12+
### Retry Logic
13+
- **Che deployment**: 2 attempts with exponential backoff (60s base + jitter)
14+
- **Cleanup**: Waits for CheCluster CR deletion before retry
15+
- **Happy-path test retry**: 1 retry with 30s delay if Selenium test fails
16+
17+
### Health Checks
18+
- **OLM**: Verifies `catalog-operator` and `olm-operator` are available before Che deployment (2-minute timeout each)
19+
- **DWO**: Waits for `deployment condition=available` (5-minute timeout)
20+
- **Che**: chectl's built-in readiness checks ensure deployment is healthy
21+
22+
### Artifact Collection
23+
On each failure, collects:
24+
- OLM diagnostics (Subscription, InstallPlan, CSV, CatalogSource)
25+
- CatalogSource pod logs
26+
- Che operator logs (last 1000 lines)
27+
- CheCluster CR status (full YAML)
28+
- All pod logs from Che namespace
29+
- Kubernetes events
30+
- chectl server logs
31+
32+
### Error Handling
33+
- Graceful error handling with stage-specific messages
34+
- Progress indicators: "Attempt 1/2", "Retrying in 71s..."
35+
- No crash on failures
36+
37+
## Configuration
38+
39+
Environment variables (all optional):
40+
41+
| Variable | Default | Description |
42+
|----------|---------|-------------|
43+
| `CHE_NAMESPACE` | `eclipse-che` | Namespace for Che deployment |
44+
| `MAX_RETRIES` | `2` | Maximum retry attempts |
45+
| `BASE_DELAY` | `60` | Base delay in seconds for exponential backoff |
46+
| `MAX_JITTER` | `15` | Maximum jitter in seconds |
47+
| `ARTIFACT_DIR` | `/tmp/dwo-e2e-artifacts` | Directory for diagnostic artifacts |
48+
| `DEVWORKSPACE_OPERATOR` | (required) | DWO image to deploy |
49+
50+
## Usage
51+
52+
### In Prow CI
53+
54+
The script is called automatically by the `v14-che-happy-path` Prow job. Prow sets `DEVWORKSPACE_OPERATOR` based on the context:
55+
56+
**For PR checks** (testing PR code):
57+
```bash
58+
export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:pr-${PR_NUMBER}-${COMMIT_SHA}"
59+
./.ci/oci-devworkspace-happy-path.sh
60+
```
61+
62+
**For periodic/nightly runs** (testing main branch):
63+
```bash
64+
export DEVWORKSPACE_OPERATOR="quay.io/devfile/devworkspace-controller:next"
65+
./.ci/oci-devworkspace-happy-path.sh
66+
```
67+
68+
### Local Testing
69+
```bash
70+
export DEVWORKSPACE_OPERATOR="quay.io/youruser/devworkspace-controller:your-tag"
71+
export ARTIFACT_DIR="/tmp/my-test-artifacts"
72+
./.ci/oci-devworkspace-happy-path.sh
73+
```
74+
75+
## Test Flow
76+
77+
1. **Deploy DWO**
78+
- Runs `make install`
79+
- Waits for controller deployment to be available
80+
- Collects artifacts if deployment fails
81+
82+
2. **Deploy Che** (with retry)
83+
- Runs `chectl server:deploy` with extended timeouts (24h)
84+
- chectl handles readiness checks internally
85+
- Collects artifacts on failure
86+
- Cleans up and retries if needed
87+
88+
3. **Run Happy-Path Test**
89+
- Downloads test script from Eclipse Che repository
90+
- Executes Che happy-path workflow
91+
- Retries once after 30s if test fails
92+
- Collects artifacts on failure
93+
94+
## Exit Codes
95+
96+
- `0`: Success - All stages completed
97+
- `1`: Failure - Check `$ARTIFACT_DIR` for diagnostics
98+
99+
## Timeouts
100+
101+
| Component | Timeout | Purpose |
102+
|-----------|---------|---------|
103+
| DWO deployment | 5 minutes | Pod becomes available |
104+
| chectl pod wait/ready | 24 hours | Generous for slow environments |
105+
106+
## Common Failures
107+
108+
### OLM Infrastructure Not Ready
109+
**Symptoms**: "ERROR: OLM infrastructure is not healthy, cannot proceed with Che deployment"
110+
**Check**: `$ARTIFACT_DIR/olm-diagnostics-olm-check.yaml`
111+
**Common causes**:
112+
- OLM operators not running (`catalog-operator`, `olm-operator`)
113+
- Cluster provisioning issues during bootstrap
114+
- Resource constraints preventing OLM operator scheduling
115+
**Resolution**: This indicates a fundamental cluster infrastructure issue. Check cluster health and OLM operator logs before retrying.
116+
117+
### DWO Deployment Fails
118+
**Symptoms**: "ERROR: DWO controller is not ready"
119+
**Check**: `$ARTIFACT_DIR/devworkspace-controller-info/`
120+
**Common causes**: Image pull errors, resource constraints, webhook conflicts
121+
122+
### Che Deployment Timeout
123+
**Symptoms**: "ERROR: chectl server:deploy failed" with timeout-related messages
124+
**Check**: `$ARTIFACT_DIR/che-operator-logs-attempt-*.log`, `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/chectl-logs-attempt-*/`
125+
**Common causes**:
126+
- OLM subscription timeout (check `olm-diagnostics` for subscription state)
127+
- Database connection issues
128+
- Image pull failures
129+
- Operator reconciliation errors
130+
- chectl timeout waiting for pods/resources to become ready
131+
132+
### Pod CrashLoopBackOff
133+
**Symptoms**: "ERROR: chectl server:deploy failed"
134+
**Check**: `$ARTIFACT_DIR/eclipse-che-info/` for pod logs
135+
**Common causes**: Configuration errors, resource limits, TLS certificate issues
136+
137+
### OLM Subscription Stuck
138+
**Symptoms**: Subscription timeout after 120 seconds with no resources created
139+
**Check**: `$ARTIFACT_DIR/olm-diagnostics-attempt-*.yaml`, `$ARTIFACT_DIR/catalogsource-logs-attempt-*.log`
140+
**Common causes**:
141+
- CatalogSource pod not pulling/running
142+
- InstallPlan not created (subscription cannot resolve dependencies)
143+
- Cluster resource exhaustion preventing operator pod scheduling
144+
**Resolution**: Check OLM operator logs and CatalogSource pod status. See "Advanced Troubleshooting" section for monitoring and alternative deployment options.
145+
146+
## Artifact Locations
147+
148+
After a failed test run:
149+
```
150+
$ARTIFACT_DIR/
151+
├── attempt-log.txt
152+
├── failure-report.json
153+
├── failure-report.md
154+
├── devworkspace-controller-info/
155+
│ ├── <pod-name>-<container>.log
156+
│ └── events.log
157+
├── eclipse-che-info/
158+
│ ├── <pod-name>-<container>.log
159+
│ └── events.log
160+
├── che-operator-logs-attempt-1.log
161+
├── che-operator-logs-attempt-2.log
162+
├── checluster-status-attempt-1.yaml
163+
├── checluster-status-attempt-2.yaml
164+
├── olm-diagnostics-attempt-1.yaml
165+
├── olm-diagnostics-attempt-2.yaml
166+
├── catalogsource-logs-attempt-1.log
167+
├── catalogsource-logs-attempt-2.log
168+
├── chectl-logs-attempt-1/
169+
└── chectl-logs-attempt-2/
170+
```
171+
172+
## Dependencies
173+
174+
- `kubectl` - Kubernetes CLI
175+
- `oc` - OpenShift CLI (for log collection)
176+
- `chectl` - Eclipse Che CLI (v7.114.0+)
177+
- `jq` - JSON processor (for chectl)
178+
179+
## Advanced Troubleshooting
180+
181+
### OLM Infrastructure Issues
182+
183+
If you experience persistent OLM subscription timeouts (see `olm-diagnostics-*.yaml` artifacts):
184+
185+
#### Option 1: OLM Health Check (Implemented)
186+
The script now verifies OLM infrastructure health before deploying Che:
187+
- Checks `catalog-operator` is available
188+
- Checks `olm-operator` is available
189+
- Verifies `openshift-marketplace` is accessible
190+
191+
If OLM is unhealthy, the test fails fast with diagnostic artifacts instead of waiting through timeouts.
192+
193+
#### Option 2: Monitor Subscription Progress (Advanced)
194+
For debugging stuck subscriptions, you can add active monitoring to detect zero-progress scenarios earlier:
195+
196+
```bash
197+
# Example: Monitor subscription state every 10 seconds
198+
while [ $elapsed -lt 300 ]; do
199+
state=$(kubectl get subscription eclipse-che -n eclipse-che \
200+
-o jsonpath='{.status.state}' 2>/dev/null)
201+
echo "[$elapsed/300s] Subscription state: ${state:-unknown}"
202+
if [ "$state" = "AtLatestKnown" ]; then
203+
break
204+
fi
205+
sleep 10
206+
elapsed=$((elapsed + 10))
207+
done
208+
```
209+
210+
This helps identify whether subscriptions are progressing slowly vs. completely stuck.
211+
212+
#### Option 3: Skip OLM Installation (Alternative Approach)
213+
For CI environments with persistent OLM issues, consider deploying Che operator directly instead of via OLM:
214+
215+
```bash
216+
chectl server:deploy \
217+
--installer=operator \ # Uses direct YAML deployment
218+
-p openshift \
219+
--batch \
220+
--telemetry=off \
221+
--skip-devworkspace-operator \
222+
--chenamespace="$CHE_NAMESPACE"
223+
```
224+
225+
**Trade-offs**:
226+
- ✅ Bypasses OLM infrastructure entirely
227+
- ✅ More reliable in resource-constrained CI environments
228+
- ❌ Doesn't test OLM integration path (used by production OperatorHub)
229+
- ❌ May miss OLM-specific issues
230+
231+
**When to use**: Temporary workaround for CI infrastructure issues while OLM problems are being resolved.
232+
233+
### Subscription Timeout Issues
234+
235+
If OLM subscriptions consistently timeout (visible in `olm-diagnostics-*.yaml`):
236+
237+
1. **Check OLM operator logs**:
238+
```bash
239+
kubectl logs -n openshift-operator-lifecycle-manager \
240+
deployment/catalog-operator --tail=100
241+
kubectl logs -n openshift-operator-lifecycle-manager \
242+
deployment/olm-operator --tail=100
243+
```
244+
245+
2. **Verify CatalogSource pod is running**:
246+
```bash
247+
kubectl get pods -n openshift-marketplace \
248+
-l olm.catalogSource=eclipse-che
249+
kubectl logs -n openshift-marketplace \
250+
-l olm.catalogSource=eclipse-che
251+
```
252+
253+
3. **Check InstallPlan creation**:
254+
```bash
255+
kubectl get installplan -n eclipse-che -o yaml
256+
```
257+
- If no InstallPlan exists, OLM couldn't resolve the subscription
258+
- If InstallPlan exists but isn't complete, check its status conditions
259+
260+
## CI Failure Reports
261+
262+
The script automatically generates failure reports and posts them as PR comments after each run (both failures and successes with retries). **Do not delete these comments** — they are used to track flakiness patterns across PRs.
263+
264+
### What gets reported
265+
266+
Each report includes a table of all attempts with:
267+
- **Attempt**: Which attempt number (e.g., `1/2`, `2/2`)
268+
- **Stage**: Which function failed (`deployChe`, `runHappyPathTest`, etc.)
269+
- **Result**: `PASSED` or `FAILED`
270+
- **Reason**: Classified failure reason (e.g., "Che operator reconciliation failure")
271+
272+
### Failure categories
273+
274+
| Category | Meaning | Retryable? |
275+
|----------|---------|------------|
276+
| `INFRA` | Infrastructure issue (OLM, image pull, operator reconciliation) | Yes — `/retest` |
277+
| `TEST` | Test execution issue (Dashboard UI timeout, workspace start) | Maybe |
278+
| `MIXED` | Both infrastructure and test issues across attempts | Yes — `/retest` |
279+
| `UNKNOWN` | Could not classify — check artifacts | Investigate |
280+
281+
### Report artifacts
282+
283+
Reports are always saved to `$ARTIFACT_DIR/` regardless of whether PR commenting succeeds:
284+
- `failure-report.json` — structured data for programmatic analysis
285+
- `failure-report.md` — human-readable markdown (same as the PR comment)
286+
- `attempt-log.txt` — raw attempt tracking log
287+
288+
### Why these comments matter
289+
290+
Over time, these reports reveal:
291+
- Which failure categories are most common
292+
- Whether flakiness is improving or worsening
293+
- Which infrastructure components are least reliable
294+
- Whether retry logic is effective (passed-on-retry patterns)
295+
296+
## Related Documentation
297+
298+
- [Eclipse Che Documentation](https://eclipse.dev/che/docs/)
299+
- [chectl GitHub Repository](https://github.com/che-incubator/chectl)
300+
- [OLM Troubleshooting Guide](https://olm.operatorframework.io/docs/troubleshooting/)
301+
- [DevWorkspace Operator README](../README.md)
302+
- [Contributing Guidelines](../CONTRIBUTING.md)

0 commit comments

Comments
 (0)