Skip to content

Commit 664e936

Browse files
kaovilaiclaudeweshayutin
authored
Enhance Claude failure analysis with Velero source and must-gather feedback (#2051)
* Enhance Claude failure analysis with Velero source and must-gather feedback - Clone openshift/velero (oadp-dev branch) in ci-Dockerfile for source code investigation during failure analysis - Add Velero source code investigation prompts to analyze_failures.sh, enabling Claude to trace errors back to Velero implementation - Add must-gather improvement suggestions section to analysis output, creating a feedback loop for improving diagnostics collection - Add data mover volume restore limitation to error ignore patterns (claim Selector not supported per velero-io/velero#7946) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add OADP operator source code to failure analysis prompts Enable Claude to investigate OADP operator source at /go/src/github.com/openshift/oadp-operator/ during failure analysis: - Add OADP operator source to Available Artifacts section - Rename "Velero Source Code Investigation" to "Source Code Investigation" with subsections for both Velero and OADP packages - Update Claude invocation prompt to reference OADP source - List key OADP packages: internal/controller/, pkg/velero/, pkg/credentials/, api/v1alpha1/, tests/e2e/lib/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Wesley Hayutin <138787+weshayutin@users.noreply.github.com>
1 parent b23a69e commit 664e936

3 files changed

Lines changed: 80 additions & 2 deletions

File tree

build/ci-Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,12 @@ RUN curl -fsSL https://rpm.nodesource.com/setup_20.x | bash - && \
2121
npm install -g @anthropic-ai/claude-code && \
2222
dnf clean all
2323

24+
# Clone openshift/velero source code for failure analysis
25+
# Uses oadp-dev branch to match OADP operator development
26+
RUN git clone --depth 1 --branch oadp-dev \
27+
https://github.com/openshift/velero.git \
28+
/go/src/github.com/openshift/velero
29+
2430
RUN go mod download && \
2531
mkdir -p $(go env GOCACHE) && \
2632
chmod -R 777 ./ $(go env GOCACHE) $(go env GOPATH)

tests/e2e/lib/flakes.go

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@ var errorIgnorePatterns = []string{
2222
"level=error msg=\"error patch for managed fields ",
2323
"VolumeSnapshot has a temporary error Failed to create snapshot: error updating status for volume snapshot content snapcontent-",
2424
"Skipping hypershift plugin execution - not a hypershift backup: error checking for HostedControlPlane CRD",
25-
"claim Selector is not supported",
25+
26+
// Data mover volume restore limitation per https://github.com/vmware-tanzu/velero/issues/7946#issuecomment-2196590014
27+
"failed to restore volume with StorageClass, claim Selector is not supported",
2628
}
2729

2830
type FlakePattern struct {

tests/e2e/scripts/analyze_failures.sh

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ Read the log file and output a summary containing:
9797
9898
5. **Correlation**: Group related errors together - if multiple errors reference the same resource (backup name, PVC, pod), keep them together with their context.
9999
100+
6. **Source references**: When you find errors from Velero packages (pkg/backup/, pkg/restore/, pkg/controller/, pkg/nodeagent/), note the file:line references for later source code investigation.
101+
100102
Format each error group as:
101103
--- [package/component name] ---
102104
[context lines from same package]
@@ -215,6 +217,14 @@ You are analyzing a failed OADP (OpenShift API for Data Protection) E2E test run
215217
4. **preprocessed-logs.txt**: Pre-extracted errors from large log files (>1MB)
216218
- Contains error summaries from large logs that were too big to analyze directly
217219
- Use this for quick access to relevant errors without reading full logs
220+
5. **Velero Source Code**: `/go/src/github.com/openshift/velero/`
221+
- OpenShift's fork of Velero with OADP-specific patches
222+
- Use to investigate error messages originating from Velero packages
223+
- Key directories: `pkg/backup/`, `pkg/restore/`, `pkg/controller/`, `pkg/nodeagent/`
224+
6. **OADP Operator Source Code**: `/go/src/github.com/openshift/oadp-operator/`
225+
- The OADP operator codebase being tested
226+
- Key directories: `internal/controller/`, `pkg/`, `api/v1alpha1/`
227+
- Use to investigate OADP-specific errors and reconciliation logic
218228
219229
**Note**: Prow's build-log.txt is written by CI infrastructure after tests complete and is NOT available during this analysis. Use the artifacts listed above.
220230
@@ -229,6 +239,35 @@ This file contains:
229239
230240
Cross-reference failures against these patterns before diagnosing as real failures.
231241
242+
## Source Code Investigation
243+
244+
When analyzing failures, use the source code to understand error origins:
245+
246+
1. Locate the error message in the source code
247+
2. Trace the code path that led to the error
248+
3. Identify what conditions trigger the error
249+
4. Check if the error is recoverable, transient, or indicates a real bug
250+
5. Look for related error handling or retry logic
251+
252+
### Velero Source (`/go/src/github.com/openshift/velero/`)
253+
254+
Key Velero packages:
255+
- `pkg/backup/` - Backup workflow and item processing
256+
- `pkg/restore/` - Restore workflow and item processing
257+
- `pkg/controller/` - Kubernetes controllers for backup/restore CRs
258+
- `pkg/nodeagent/` - Node agent (restic/kopia) operations
259+
- `pkg/persistence/` - Object storage operations
260+
- `pkg/plugin/` - Plugin framework and built-in plugins
261+
262+
### OADP Operator Source (`/go/src/github.com/openshift/oadp-operator/`)
263+
264+
Key OADP packages:
265+
- `internal/controller/` - DPA reconciler and other controllers
266+
- `pkg/velero/` - Velero deployment and configuration
267+
- `pkg/credentials/` - Cloud credential management
268+
- `api/v1alpha1/` - CRD type definitions
269+
- `tests/e2e/lib/` - E2E test utilities and flake patterns
270+
232271
## Analysis Tasks
233272
234273
1. Parse junit_report.xml to identify all failed tests and extract failure messages
@@ -337,6 +376,27 @@ From must-gather analysis:
337376
2. Check if failures match existing GitHub issues
338377
3. Re-run flakes to confirm transient nature
339378
4. Investigate environmental issues in cluster/cloud provider
379+
380+
## Must-Gather Improvement Suggestions
381+
382+
If information was missing or incomplete during analysis, list what additional data would have helped:
383+
384+
### Missing Data That Would Have Helped
385+
- <What was needed and why it would have helped diagnosis>
386+
- <Specific resource/log/metric that was missing>
387+
388+
### Recommended Must-Gather Enhancements
389+
1. **<Category>**: <Specific improvement suggestion>
390+
- Current gap: <What's missing>
391+
- Suggested addition: <What to collect>
392+
- Example: <Concrete example of the data needed>
393+
394+
Examples of potential improvements:
395+
- Additional pod logs (e.g., init containers, sidecar containers)
396+
- Specific CRD status fields not currently captured
397+
- Cluster-level resources affecting OADP (NetworkPolicies, ResourceQuotas)
398+
- Timing/metrics data (pod startup times, API latencies)
399+
- Cloud provider specific diagnostics (S3 bucket policies, IAM roles)
340400
```
341401
342402
## Important Guidelines
@@ -349,6 +409,9 @@ From must-gather analysis:
349409
- Cross-reference: Link similar failures across multiple tests
350410
- Prioritize: Put critical issues before warnings before flakes
351411
- Use preprocessed-logs.txt: Check this file first for errors from large log files
412+
- Must-gather feedback: When you cannot determine root cause due to missing information,
413+
explicitly note what additional must-gather data would have helped. This feedback loop
414+
improves future debugging capabilities.
352415
PROMPT_EOF
353416

354417
# Count failed tests from JUnit (count individual test failures, not just suites)
@@ -389,9 +452,16 @@ Analyze these artifacts:
389452
2. Preprocessed log errors: ${ARTIFACT_DIR}/preprocessed-logs.txt (check this FIRST for large log summaries)
390453
3. Must-gather: ${ARTIFACT_DIR}/must-gather/
391454
4. Per-test failure directories: ${ARTIFACT_DIR}/*/
455+
5. Velero source code: /go/src/github.com/openshift/velero/
456+
6. OADP operator source code: /go/src/github.com/openshift/oadp-operator/
457+
458+
When errors reference Velero or OADP packages, read the relevant source code to understand:
459+
- What conditions trigger the error
460+
- If there's retry logic that should have handled it
461+
- If this is a known limitation or edge case
392462
393463
Note: Prow's build-log.txt is NOT available during this analysis (it's written after tests complete).
394-
Focus on JUnit results, preprocessed log summaries, must-gather diagnostics, and per-test pod logs.
464+
Focus on JUnit results, preprocessed log summaries, must-gather diagnostics, per-test pod logs, and source code investigation.
395465
396466
Generate comprehensive failure analysis following the output format specified in the prompt.
397467
Focus on actionable insights and clear root cause identification.

0 commit comments

Comments
 (0)