Skip to content

Commit a1aad30

Browse files
jerm-droclaude
andauthored
Add deflake skill for finding and fixing flaky tests (#4746)
* Add deflake skill for finding and fixing flaky tests Adds a /deflake skill that analyzes GitHub Actions failures on main to discover, rank, and plan fixes for flaky tests. The skill includes a Python collection script that deterministically fetches failed run logs in parallel, extracts test names from Ginkgo and gotestfmt output, and aggregates failures into a ranked report. Used this skill to identify and fix the #1 flake (workload lifecycle E2E test, 12/147 runs) in #4745. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review feedback on collect-flakes script - Extract per-test log context (50 lines before the failure marker) before classifying failure mode, so tests in the same run get accurate individual mode labels instead of all inheriting the first match from the full run log - Add try/except around future.result() so one failed run doesn't crash the script and lose all collected data - Fix misleading comment about MAX_PAGES covering 300 Main build runs — the API returns all workflows' runs, not just Main build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix per-test failure mode extraction in collect-flakes The previous attempt to extract per-test failure context used a 50-line window before the [FAIL] summary line, but Ginkgo's [FAILED] reason line (e.g., "Timed out after 120s") can appear thousands of lines earlier. Also needed ANSI stripping when searching for [FAILED] markers. Now searches backwards from the [FAIL] summary to find all [FAILED] lines in the failure block, uses the earliest one (which has the actual failure reason), and extracts context spanning all of them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent aa68b58 commit a1aad30

2 files changed

Lines changed: 430 additions & 0 deletions

File tree

.claude/skills/deflake/SKILL.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
name: deflake
3+
description: Finds flaky tests on the main branch by analyzing GitHub Actions failures, ranks them by frequency, and enters parallel plan mode to design deflake strategies. Use when you want to find and fix the flakiest tests.
4+
---
5+
6+
# Deflake Tests
7+
8+
Discovers, ranks, and plans fixes for flaky tests by analyzing GitHub Actions failures on `main`.
9+
10+
## Arguments
11+
12+
```
13+
/deflake # Full analysis: discover, rank, and plan fixes
14+
/deflake --report # Report only: show flake rankings without planning fixes
15+
/deflake --top N # Analyze and plan fixes for the top N flakes (default: 3)
16+
```
17+
18+
---
19+
20+
## Phase 1: Collect and Rank Flakes
21+
22+
Run the collection script. It handles all deterministic data collection and aggregation. If CI log formats change over time, update the script directly.
23+
24+
```bash
25+
python3 .claude/skills/deflake/collect-flakes.py
26+
```
27+
28+
The script outputs three sections:
29+
1. **FLAKE REPORT** — overall stats (total runs, failure rate, date range)
30+
2. **RANKED FAILURES** — table sorted by failure count with job, mode, and test name
31+
3. **FAILURE DETAILS** — per-test breakdown with links to each failed run
32+
33+
### Phase 1 complete
34+
35+
Read the script output and use it directly for the report. The LLM's only job in this phase is to **categorize** each entry as a flake, real bug, or infra issue:
36+
37+
- **Flake**: Appears multiple times intermittently, interspersed with successful runs
38+
- **Real bug**: Appeared after a specific commit and every run after that failed until a fix landed. Check `git log` for related fixes
39+
- **Infra flake**: Entries tagged `[INFRA]` by the script, or failures with mode `connection refused` / `infra`
40+
41+
---
42+
43+
## Phase 2: Present the Report
44+
45+
Present the script output as a formatted report. Add categorization (flake / real bug / infra) to each entry. Example format:
46+
47+
```markdown
48+
## Flake Report — main branch
49+
50+
**Period**: 2026-04-01 to 2026-04-10
51+
**Runs analyzed**: 23 total, 8 failed (35% failure rate)
52+
53+
### Top Flaky Tests
54+
55+
| Rank | Test | Job | Failures | Failure Mode |
56+
|------|------|-----|----------|--------------|
57+
| 1 | Workload lifecycle ... [It] should track ... | E2E (api-workloads) | 5/23 | timeout (120s) |
58+
| 2 | ... | ... | ... | ... |
59+
60+
### Real Bugs (not flakes)
61+
- [Test name] — Introduced by [commit], fixed by [commit/PR]
62+
63+
### Infra Failures
64+
- [N] runs failed due to [description]
65+
```
66+
67+
If the user passed `--report`, stop here. Otherwise continue to Phase 3.
68+
69+
---
70+
71+
## Phase 3: Plan Deflake Fixes
72+
73+
### 3.1 Parallel Investigation
74+
75+
For the top N flakes (default 3), launch **parallel agents** to investigate each one simultaneously.
76+
77+
For each flake, spawn an Agent (subagent_type: `general-purpose`) that:
78+
79+
1. **Reads the test code**: Find the test file, understand what it does and what behavior it's verifying
80+
2. **Reads the production code**: Read all the production code that the test exercises — handlers, services, middleware, etc. Understand the code path end-to-end
81+
3. **Maps test coverage for this feature**: Search the entire repo for all tests that cover this same feature or code path. Don't assume test locations — grep for the feature name, function names, and related keywords across the whole codebase. Tests may live in `_test.go` files alongside prod code, in `e2e/`, in `acceptance_test` files, or elsewhere. For each test found, document what it covers, what level it operates at (unit/integration/E2E), and whether it's stable or also flaky
82+
4. **Reads the failure logs**: Get 2-3 example failure logs from different runs
83+
5. **Identifies the root cause**: Why does this test fail intermittently?
84+
- Timing-dependent (hardcoded sleeps, tight timeouts)?
85+
- Resource contention (port conflicts, shared state)?
86+
- Ordering dependency (relies on another test's side effects)?
87+
- External dependency (network call, container pull)?
88+
- Race condition (concurrent access, missing synchronization)?
89+
6. **Proposes a fix strategy**: Following the deflake principles below, informed by the full picture of prod code and existing test coverage
90+
91+
**IMPORTANT**: Launch all agents in a single message so they run in parallel.
92+
93+
Wait for all agents to complete, then consolidate findings.
94+
95+
### 3.2 Present Deflake Plans
96+
97+
For each flake, present a high-level plan with alternatives considered:
98+
99+
```markdown
100+
### Flake #N: [Test Name]
101+
102+
**Root cause**: [one-sentence explanation]
103+
**Failure logs**: [links to 2-3 example runs]
104+
105+
**Options considered**:
106+
1. [Option A][why it was rejected or chosen]
107+
2. [Option B][why it was rejected or chosen]
108+
3. [Option C][why it was rejected or chosen]
109+
110+
**Recommended approach**: [which option and why it's the best fit]
111+
- [High-level description of the changes]
112+
113+
**Confidence**: High / Medium / Low
114+
**Risk**: [What could go wrong with this approach]
115+
```
116+
117+
Present all plans and wait for user feedback. The user may choose a different option, combine approaches, or ask for more investigation. Do NOT enter plan mode or start implementing until the user approves the approach for each flake.
118+
119+
### 3.3 Implement Approved Fixes
120+
121+
Once the user approves approaches, enter plan mode to design the detailed implementation. The plan should:
122+
123+
- Group related fixes (e.g., if multiple tests share the same root cause)
124+
- Order by impact (fix the flake that fails most often first)
125+
- Each fix should be its own commit for easy revert
126+
127+
---
128+
129+
## Deflake Principles
130+
131+
These principles guide all fix proposals. **Prefer simplifying code and tests over adding complexity.**
132+
133+
### Prefer removal over addition
134+
- Delete flaky tests only if they're duplicative with other **stable tests at the same level**
135+
- If multiple E2E tests cover fine-grained behavior for one feature, move the fine-grained cases to unit tests and keep a single E2E smoke test
136+
- Never remove **all** E2E coverage for a feature — at least one smoke test must remain
137+
- Remove unnecessary setup/teardown that introduces timing sensitivity
138+
139+
### Fix the test, not the production code
140+
- If flakiness exposes a real bug, fix the production code
141+
- Do NOT add complexity to production code just to make a flaky test pass (retry logic, test-only hooks, feature flags)
142+
- Ask: what's the intention of this test? Can we capture it in a more reliable form?
143+
144+
### Fix options
145+
- **Delete the test** if redundant (keeping at least one E2E smoke test per feature)
146+
- **Rewrite as a unit test** if the behavior can be tested without integration
147+
- **Refactor hard-to-test code** so the behavior under test can be easily isolated and reliably examined
148+
- **Reduce scope** — test one thing instead of a full lifecycle
149+
- **Use polling with short intervals** instead of fixed sleeps (e.g., `Eventually` with 1s poll interval)
150+
- **Increase timeouts** — only as a last resort, and only for `Eventually`/`Consistently` matchers, not arbitrary `time.Sleep`
151+
152+
### Anti-patterns to avoid
153+
- Adding `time.Sleep()` to "fix" timing issues
154+
- Adding retry loops around flaky assertions
155+
- Marking tests as `[Flaky]` or `Skip` without fixing them
156+
- Adding production code complexity (feature flags, test modes) to make tests pass
157+
- Increasing parallelism limits or resource requests as a band-aid

0 commit comments

Comments
 (0)