|
| 1 | +# Reproduce hung test and collect dumps (including child process) |
| 2 | + |
| 3 | +Sometimes a test that passes locally fails on CI due to a timeout. |
| 4 | +One common indicator is an error saying that a condition was not satisfied after some time and several attempts. |
| 5 | +For example: |
| 6 | + |
| 7 | +``` |
| 8 | +Condition not satisfied after 30.00 seconds and 39 attempts |
| 9 | +... |
| 10 | +Caused by: Condition not satisfied: |
| 11 | +decodeTraces.find { predicate.apply(it) } != null |
| 12 | +``` |
| 13 | + |
| 14 | +This failure often masks a real problem, usually a deadlock or livelock in the tested application or in the test |
| 15 | +process itself. |
| 16 | +To investigate these issues, collect thread and heap dumps to simplify root-cause analysis. |
| 17 | + |
| 18 | +Use this guide when a test repeatedly times out on CI while passing locally and you need actionable JVM dumps. |
| 19 | +Step 1 is optional; it only reduces CI turnaround time. |
| 20 | + |
| 21 | +See this [PR](https://github.com/DataDog/dd-trace-java/pull/10698) for an example investigation using this guide. |
| 22 | + |
| 23 | +## Step 0: Setup |
| 24 | + |
| 25 | +Create a branch for testing. |
| 26 | + |
| 27 | +## Step 1 (Optional): Modify build scripts to minimize CI time. |
| 28 | + |
| 29 | +These are temporary debugging-only changes. Revert them after collecting dumps. |
| 30 | + |
| 31 | +Modify `.gitlab-ci.yml`: |
| 32 | + |
| 33 | +- Keep Java versions you want to test, for example Java 21 only: |
| 34 | + |
| 35 | +``` |
| 36 | +DEFAULT_TEST_JVMS: /^(21)$/ |
| 37 | +``` |
| 38 | + |
| 39 | +- Comment out heavy jobs, like `check_base, check_inst, muzzle, test_base, test_inst, test_inst_latest`. |
| 40 | + |
| 41 | +Modify `buildSrc/src/main/kotlin/dd-trace-java.configure-tests.gradle.kts`: |
| 42 | + |
| 43 | +- Replace the timeout from 20 minutes to 10 minutes: |
| 44 | + |
| 45 | +``` |
| 46 | +timeout.set(Duration.of(10, ChronoUnit.MINUTES)) |
| 47 | +``` |
| 48 | + |
| 49 | +## Step 2: Modify the target test. |
| 50 | + |
| 51 | +Adjust the target test so it stays alive until Gradle timeout triggers dump collection. For Spock tests, one option is |
| 52 | +to use `PollingConditions` with a long timeout in a base class or directly in the target test class: |
| 53 | + |
| 54 | +``` |
| 55 | +@Shared |
| 56 | +protected final PollingConditions hangedPoll = new PollingConditions(timeout: 700, initialDelay: 0, delay: 5, factor: 2) |
| 57 | +``` |
| 58 | + |
| 59 | +> [!NOTE] |
| 60 | +> Use `timeout: 700` if you executed step 1, otherwise use `timeout: 1500` |
| 61 | +
|
| 62 | +This poll keeps the test running until Gradle detects timeout and `DumpHangedTestPlugin` triggers dump collection. |
| 63 | +Use this poll in the test, for example by replacing `defaultPoll` with `hangedPoll`: |
| 64 | + |
| 65 | +``` |
| 66 | +waitForTrace(hangedPoll, checkTrace()) |
| 67 | +``` |
| 68 | + |
| 69 | +In other test frameworks, use an equivalent approach to keep the test running past the timeout (for example, |
| 70 | +`Thread.sleep(XXX)` in a temporary debugging branch). |
| 71 | +The main goal is to keep the test process alive to allow dump collection for all related JVMs. |
| 72 | + |
| 73 | +## Step 3: Run the test on CI and collect dumps. |
| 74 | + |
| 75 | +- Commit your changes. |
| 76 | +- Push the reproducer branch to trigger the GitLab pipeline. |
| 77 | +- Wait for the target test job to hit timeout. |
| 78 | +- In job logs, confirm the dump hook executed (look for `Taking dumps after ... for :...`). |
| 79 | +- Wait until the job fails and download job artifacts. |
| 80 | +  |
| 81 | + |
| 82 | +> [!NOTE] |
| 83 | +> You may need to re-run CI several times if the bug is not reproduced on the first try. |
| 84 | +
|
| 85 | +Quick verification checklist: |
| 86 | + |
| 87 | +- The test job timed out (not failed fast for another reason). |
| 88 | +- Logs contain `Taking dumps after ... for :...`. |
| 89 | +- Downloaded artifacts include dump files from the failed run. |
| 90 | + |
| 91 | +## Step 4: Locate dumps by JVM type |
| 92 | + |
| 93 | +### HotSpot/OpenJDK (heap + thread dumps): |
| 94 | + |
| 95 | +- Open the report folder of the failed module/test task. |
| 96 | +- You should see files such as `<pid>-heap-dump-<timestamp>.hprof`, `<pid>-thread-dump-<timestamp>.log`, and |
| 97 | + `all-thread-dumps-<timestamp>.log`. |
| 98 | +  |
| 99 | + |
| 100 | +### IBM JDK (javacore thread dumps only): |
| 101 | + |
| 102 | +- In this case, dumps are produced via `kill -3` and written as `javacore` text files (thread dumps). |
| 103 | +- Collect root-level javacore artifacts with the path pattern `reports/javacore.YYYYMMDD.HHMMSS.PID.SEQ.txt`. |
| 104 | +  |
| 105 | + |
| 106 | +## Step 5: Run the investigation |
| 107 | + |
| 108 | +Use tools like Eclipse MAT, or ask Codex or Claude to analyze collected dumps. |
0 commit comments