Create test failure analysis skill and simplify workflow

StackRox Automation · claude · StackRox Automation · commit 1aef3b3ba450 · 2026-05-20T22:38:48.000Z
Following the pattern from PR #3170, moved all analysis logic into a proper Claude Code skill. Added: - .claude/commands/analyze-test-failures.md - Skill definition with usage, implementation details - Follows repo's slash command pattern - Contains all analysis instructions and report format Changed: - .github/workflows/analyze-and-notify.yml - Simplified prompt to just invoke the skill - Removed ~60 lines of inline instructions - Cleaner, more maintainable workflow Benefits: - Single source of truth for analysis logic - Can be tested locally: `claude /analyze-test-failures` - Can be improved independently of workflow - Follows established repo conventions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
diff --git a/.claude/commands/analyze-test-failures.md b/.claude/commands/analyze-test-failures.md
@@ -0,0 +1,94 @@
+---
+name: analyze-test-failures
+description: Analyze test failure artifacts and generate root cause analysis report
+---
+
+# Test Failure Analysis
+
+Analyze test failures from CI artifacts and generate a concise root cause analysis for the oncall team.
+
+## Usage
+
+```
+/analyze-test-failures <artifacts-dir> <workflow-name> <failed-jobs>
+```
+
+**Arguments:**
+- `artifacts-dir`: Directory containing test artifacts (default: test-artifacts/)
+- `workflow-name`: Name of the workflow that failed (e.g., "Integration Tests")
+- `failed-jobs`: Comma-separated list of failed job names
+
+**Example:**
+```
+/analyze-test-failures test-artifacts/ "Integration Tests" "amd64-integration-tests,arm64-integration-tests"
+```
+
+## What This Does
+
+1. **Find test reports**: Searches for JUnit XML files (integration-test-report-*.xml, junit.xml)
+2. **Parse failures**: Extracts test names, error messages, stack traces
+3. **Investigate code**: Reads failing test source and implementation code
+4. **Check git history**: Looks for recent changes that may have caused failures
+5. **Identify patterns**: Detects platform-specific issues (arch/OS)
+6. **Generate report**: Creates analysis-report.md with findings
+
+## Report Format
+
+The generated `analysis-report.md` contains:
+
+```markdown
+**🤖 AI Analysis**
+
+**Root Cause**: [1-2 sentence summary with file:line references]
+
+**Evidence**:
+• [Specific code observations]
+• [Patterns across failures]
+• [Recent changes correlation]
+
+**Affected Platforms**: [Architectures/OS if pattern found]
+
+**Recommendations**:
+• [Specific file:line to fix with suggested change]
+• [Additional investigation needed]
+• [Prevention strategy]
+
+---
+**Statistics**
+• Total Failures: [count]
+• Total Errors: [count]
+• Failed Jobs: [list]
+```
+
+## Implementation
+
+Start by finding and parsing test reports:
+
+```bash
+# Find all XML test reports
+find <artifacts-dir> -name "*.xml" -type f
+```
+
+For each failure:
+- Read the test source code to understand intent
+- Examine the implementation being tested
+- Check `git log --oneline -20` for recent changes
+- Look for patterns across different platforms
+
+Generate the report focusing on **actionable insights** for the oncall engineer:
+- File paths and line numbers for fixes
+- Platform-specific patterns (endianness, timing, etc.)
+- Links to similar past failures if found
+
+Keep the analysis **under 500 words** and emphasize:
+- What broke
+- Why it broke
+- How to fix it
+
+**IMPORTANT**: After analysis, create the report file:
+
+```bash
+cat > analysis-report.md <<'EOF'
+[your markdown report]
+EOF
+```
diff --git a/.github/workflows/analyze-and-notify.yml b/.github/workflows/analyze-and-notify.yml
@@ -48,58 +48,15 @@ jobs:
         env:
           ANTHROPIC_VERTEX_PROJECT_ID: ${{ secrets.GCP_CLAUDE_PROJECT_ID }}
           CLOUD_ML_REGION: us-east5
+          TEST_MODE: ${{ inputs.is-test && 'true' || 'false' }}
         with:
           use_vertex: true
-          allowed_tools: "Read,Grep,Glob,Bash"
           prompt: |
-            ${{ inputs.is-test && '[TEST MODE] ' || '' }}Analyze test failures for the ${{ inputs.workflow-name }} workflow.
-
-            Failed jobs: ${{ inputs.failed-jobs }}
-            Artifacts directory: test-artifacts/
-
-            Instructions:
-            1. Find all XML test reports in test-artifacts/ (integration-test-report-*.xml, junit.xml)
-            2. Parse the XML to extract failed tests, error messages, and stack traces
-            3. Read the failing test source code to understand what's being tested
-            4. Examine the implementation code that's failing
-            5. Check git log for recent changes that might have caused failures
-            6. Look for platform-specific patterns (amd64, arm64, s390x, ppc64le)
-            7. Look for OS-specific patterns (RHEL, Ubuntu, COS, Flatcar, etc.)
-
-            Generate a markdown report at analysis-report.md with:
-
-            **🤖 AI Analysis${{ inputs.is-test && ' [TEST MODE]' || '' }}**
-
-            **Root Cause**: [1-2 sentence summary with file:line references]
-
-            **Evidence**:
-            • [Specific code observations]
-            • [Patterns across failures]
-            • [Recent changes correlation]
-
-            **Affected Platforms**: [Architectures/OS if pattern found]
-
-            **Recommendations**:
-            • [Specific file:line to fix with suggested change]
-            • [Additional investigation needed]
-            • [Prevention strategy]
-
-            ---
-            **Statistics**
-            • Total Failures: [count]
-            • Total Errors: [count]
-            • Failed Jobs: ${{ inputs.failed-jobs }}
-
-            Keep it under 500 words and actionable.
-
-            IMPORTANT: After your analysis, you MUST create the file analysis-report.md with your markdown report using:
-            cat > analysis-report.md <<'EOF'
-            [your markdown report here]
-            EOF
-          claude_env: |
-            WORKFLOW_NAME: ${{ inputs.workflow-name }}
-            FAILED_JOBS: ${{ inputs.failed-jobs }}
-            ARTIFACTS_DIR: test-artifacts
+            Run the test failure analysis skill:
+
+            /analyze-test-failures test-artifacts/ "${{ inputs.workflow-name }}" "${{ inputs.failed-jobs }}"
+
+            ${{ inputs.is-test && 'Note: This is a TEST run. Add [TEST MODE] prefix to the report title.' || '' }}
 
       - name: Check Claude action result
         if: always()