Skip to content

Commit c941249

Browse files
docs: added the guide for collecting dumps for hanged tests. (#10687)
Added docs: added the guide for collecting dumps from hanged smoke tests. Removed link to internal resouces. Merge branch 'master' into alexeyk/how-to-dump-docs WIP WIP polished Co-authored-by: alexey.kuznetsov <alexey.kuznetsov@datadoghq.com>
1 parent c6896b7 commit c941249

4 files changed

Lines changed: 108 additions & 0 deletions

File tree

docs/how_to_dump_hanged_test.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Reproduce hung test and collect dumps (including child process)
2+
3+
Sometimes a test that passes locally fails on CI due to a timeout.
4+
One common indicator is an error saying that a condition was not satisfied after some time and several attempts.
5+
For example:
6+
7+
```
8+
Condition not satisfied after 30.00 seconds and 39 attempts
9+
...
10+
Caused by: Condition not satisfied:
11+
decodeTraces.find { predicate.apply(it) } != null
12+
```
13+
14+
This failure often masks a real problem, usually a deadlock or livelock in the tested application or in the test
15+
process itself.
16+
To investigate these issues, collect thread and heap dumps to simplify root-cause analysis.
17+
18+
Use this guide when a test repeatedly times out on CI while passing locally and you need actionable JVM dumps.
19+
Step 1 is optional; it only reduces CI turnaround time.
20+
21+
See this [PR](https://github.com/DataDog/dd-trace-java/pull/10698) for an example investigation using this guide.
22+
23+
## Step 0: Setup
24+
25+
Create a branch for testing.
26+
27+
## Step 1 (Optional): Modify build scripts to minimize CI time.
28+
29+
These are temporary debugging-only changes. Revert them after collecting dumps.
30+
31+
Modify `.gitlab-ci.yml`:
32+
33+
- Keep Java versions you want to test, for example Java 21 only:
34+
35+
```
36+
DEFAULT_TEST_JVMS: /^(21)$/
37+
```
38+
39+
- Comment out heavy jobs, like `check_base, check_inst, muzzle, test_base, test_inst, test_inst_latest`.
40+
41+
Modify `buildSrc/src/main/kotlin/dd-trace-java.configure-tests.gradle.kts`:
42+
43+
- Replace the timeout from 20 minutes to 10 minutes:
44+
45+
```
46+
timeout.set(Duration.of(10, ChronoUnit.MINUTES))
47+
```
48+
49+
## Step 2: Modify the target test.
50+
51+
Adjust the target test so it stays alive until Gradle timeout triggers dump collection. For Spock tests, one option is
52+
to use `PollingConditions` with a long timeout in a base class or directly in the target test class:
53+
54+
```
55+
@Shared
56+
protected final PollingConditions hangedPoll = new PollingConditions(timeout: 700, initialDelay: 0, delay: 5, factor: 2)
57+
```
58+
59+
> [!NOTE]
60+
> Use `timeout: 700` if you executed step 1, otherwise use `timeout: 1500`
61+
62+
This poll keeps the test running until Gradle detects timeout and `DumpHangedTestPlugin` triggers dump collection.
63+
Use this poll in the test, for example by replacing `defaultPoll` with `hangedPoll`:
64+
65+
```
66+
waitForTrace(hangedPoll, checkTrace())
67+
```
68+
69+
In other test frameworks, use an equivalent approach to keep the test running past the timeout (for example,
70+
`Thread.sleep(XXX)` in a temporary debugging branch).
71+
The main goal is to keep the test process alive to allow dump collection for all related JVMs.
72+
73+
## Step 3: Run the test on CI and collect dumps.
74+
75+
- Commit your changes.
76+
- Push the reproducer branch to trigger the GitLab pipeline.
77+
- Wait for the target test job to hit timeout.
78+
- In job logs, confirm the dump hook executed (look for `Taking dumps after ... for :...`).
79+
- Wait until the job fails and download job artifacts.
80+
![Download dumps](how_to_dump_hanged_test/download_dumps.png)
81+
82+
> [!NOTE]
83+
> You may need to re-run CI several times if the bug is not reproduced on the first try.
84+
85+
Quick verification checklist:
86+
87+
- The test job timed out (not failed fast for another reason).
88+
- Logs contain `Taking dumps after ... for :...`.
89+
- Downloaded artifacts include dump files from the failed run.
90+
91+
## Step 4: Locate dumps by JVM type
92+
93+
### HotSpot/OpenJDK (heap + thread dumps):
94+
95+
- Open the report folder of the failed module/test task.
96+
- You should see files such as `<pid>-heap-dump-<timestamp>.hprof`, `<pid>-thread-dump-<timestamp>.log`, and
97+
`all-thread-dumps-<timestamp>.log`.
98+
![Dumps](how_to_dump_hanged_test/dumps.png)
99+
100+
### IBM JDK (javacore thread dumps only):
101+
102+
- In this case, dumps are produced via `kill -3` and written as `javacore` text files (thread dumps).
103+
- Collect root-level javacore artifacts with the path pattern `reports/javacore.YYYYMMDD.HHMMSS.PID.SEQ.txt`.
104+
![Javacores](how_to_dump_hanged_test/javacores.png)
105+
106+
## Step 5: Run the investigation
107+
108+
Use tools like Eclipse MAT, or ask Codex or Claude to analyze collected dumps.
369 KB
Loading
212 KB
Loading
125 KB
Loading

0 commit comments

Comments
 (0)