Skip to content

Commit efe47c7

Browse files
Added docs: added the guide for collecting dumps from hanged smoke tests.
1 parent 969d21d commit efe47c7

4 files changed

Lines changed: 102 additions & 0 deletions

File tree

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Reproduce hanged smoke test and collect dumps (including child process)
2+
3+
Sometimes a smoke test that passes locally fails on CI.
4+
Such tests can be found
5+
on [Test optimization: Flaky Management](https://app.datadoghq.com/ci/test/flaky?query=%40git.repository.id_v2%3Agithub.com%2Fdatadog%2Fdd-trace-java%20%40test.suite%3A%2Asmoke%2A&sort=-pipelines_failed&viewMode=flaky).
6+
An indicator of a potential problem in the smoke application is an error saying that a condition was not
7+
satisfied after some time and several attempts.
8+
For example:
9+
10+
```
11+
Condition not satisfied after 30.00 seconds and 39 attempts
12+
...
13+
Caused by: Condition not satisfied:
14+
decodeTraces.find { predicate.apply(it) } != null
15+
```
16+
17+
With high probability this is hiding a real problem, usually a deadlock in the smoke application because of a potential
18+
bug in instrumentation.
19+
To investigate such issues we need thread and heap dumps for both the test and the smoke application started as a child
20+
process.
21+
22+
This document describes several steps that will simplify collection of dumps. Some steps are optional and only reduce CI
23+
turnaround time.
24+
25+
## Step 0: Setup
26+
27+
Create a branch for testing.
28+
29+
## Step 1 (optional): Modify build scripts to minimize CI time.
30+
31+
Modify `.gitlab-ci.yml`:
32+
33+
- Keep Java versions you want to test, for example Java 21 only:
34+
35+
```
36+
DEFAULT_TEST_JVMS: /^(21)$/
37+
```
38+
39+
- Comment out heavy jobs, like `check_base, check_inst, muzzle, test_base, test_inst, test_inst_latest`.
40+
41+
Modify `buildSrc/src/main/kotlin/dd-trace-java.configure-tests.gradle.kts`:
42+
43+
- Replace timeout of 20 mins with 10 mins:
44+
45+
```
46+
timeout.set(Duration.of(10, ChronoUnit.MINUTES))
47+
```
48+
49+
## Step 2: Modify target test.
50+
51+
Add special poll to `AbstractSmokeTest.groovy` that would prevent test from retry:
52+
53+
```
54+
@Shared
55+
protected final PollingConditions hangedPoll = new PollingConditions(timeout: 700, initialDelay: 0, delay: 5, factor: 2)
56+
```
57+
58+
> [!NOTE]
59+
> Use `timeout: 700` if you executed step 1, otherwise use `timeout: 1500`
60+
61+
This poll would literally wait until Gradle detects timeout and triggers thread and heap dumps collection by
62+
`DumpHangedTestPlugin`.
63+
Use this poll in test, something like this (usually just replace `defaultPoll` with `hangedPoll`):
64+
65+
```
66+
waitForTrace(hangedPoll, checkTrace())
67+
```
68+
69+
In other situations just make test continue to work longer than test timeout, for example with `Thread.sleep(XXX)`.
70+
The main goal is to keep the test alive to allow dump collection for the smoke application.
71+
72+
## Step 3: Run test on CI and collect dumps.
73+
74+
- Commit your changes.
75+
- Push the reproducer branch that will trigger the GitLab pipeline.
76+
- Wait for the smoke test job to hit timeout.
77+
- In job logs, confirm the dump hook executed (look for `Taking dumps after ... for :...`).
78+
- Wait until the job fails and download job artifacts (see screenshot).
79+
![Download dumps](how_to_dump_hanged_smoke_test/download_dumps.png)
80+
81+
> [!NOTE]
82+
> You may need to re-run CI several times if the bug is not reproduced on the first try.
83+
84+
## Step 4: Locate dumps by JVM type
85+
86+
### HotSpot/OpenJDK (heap + thread dumps):
87+
88+
- Open the report folder of the failed smoke module (the hanged test folder), for example under
89+
`reports/dd-smoke-tests/...`.
90+
- There will be files, such as `<pid>-heap-dump-<timestamp>.hprof`, `<pid>-thread-dump-<timestamp>.log`, and
91+
`all-thread-dumps-<timestamp>.log`.
92+
![Dumps](how_to_dump_hanged_smoke_test/dumps.png)
93+
94+
### IBM JDK (javacore thread dumps only):
95+
96+
- In this case dumps are produced via `kill -3` and written as `javacore` text files, basically thread dumps.
97+
- Collect root-level javacore artifacts from `reports/`, for example `javacore.YYYYMMDD.HHMMSS.PID.SEQ.txt`.
98+
![Javacores](how_to_dump_hanged_smoke_test/javacores.png)
99+
100+
## Step 5: Run the investigation
101+
102+
Use tools like Eclipse MAT, or simply ask Codex or Claude to analyze collected dumps.
369 KB
Loading
212 KB
Loading
125 KB
Loading

0 commit comments

Comments
 (0)