|
| 1 | +# Reproduce hanged smoke test and collect dumps (including child process) |
| 2 | + |
| 3 | +Sometimes a smoke test that passes locally fails on CI. |
| 4 | +Such tests can be found |
| 5 | +on [Test optimization: Flaky Management](https://app.datadoghq.com/ci/test/flaky?query=%40git.repository.id_v2%3Agithub.com%2Fdatadog%2Fdd-trace-java%20%40test.suite%3A%2Asmoke%2A&sort=-pipelines_failed&viewMode=flaky). |
| 6 | +An indicator of a potential problem in the smoke application is an error saying that a condition was not |
| 7 | +satisfied after some time and several attempts. |
| 8 | +For example: |
| 9 | + |
| 10 | +``` |
| 11 | +Condition not satisfied after 30.00 seconds and 39 attempts |
| 12 | +... |
| 13 | +Caused by: Condition not satisfied: |
| 14 | +decodeTraces.find { predicate.apply(it) } != null |
| 15 | +``` |
| 16 | + |
| 17 | +With high probability this is hiding a real problem, usually a deadlock in the smoke application because of a potential |
| 18 | +bug in instrumentation. |
| 19 | +To investigate such issues we need thread and heap dumps for both the test and the smoke application started as a child |
| 20 | +process. |
| 21 | + |
| 22 | +This document describes several steps that will simplify collection of dumps. Some steps are optional and only reduce CI |
| 23 | +turnaround time. |
| 24 | + |
| 25 | +## Step 0: Setup |
| 26 | + |
| 27 | +Create a branch for testing. |
| 28 | + |
| 29 | +## Step 1 (optional): Modify build scripts to minimize CI time. |
| 30 | + |
| 31 | +Modify `.gitlab-ci.yml`: |
| 32 | + |
| 33 | +- Keep Java versions you want to test, for example Java 21 only: |
| 34 | + |
| 35 | +``` |
| 36 | +DEFAULT_TEST_JVMS: /^(21)$/ |
| 37 | +``` |
| 38 | + |
| 39 | +- Comment out heavy jobs, like `check_base, check_inst, muzzle, test_base, test_inst, test_inst_latest`. |
| 40 | + |
| 41 | +Modify `buildSrc/src/main/kotlin/dd-trace-java.configure-tests.gradle.kts`: |
| 42 | + |
| 43 | +- Replace timeout of 20 mins with 10 mins: |
| 44 | + |
| 45 | +``` |
| 46 | +timeout.set(Duration.of(10, ChronoUnit.MINUTES)) |
| 47 | +``` |
| 48 | + |
| 49 | +## Step 2: Modify target test. |
| 50 | + |
| 51 | +Add special poll to `AbstractSmokeTest.groovy` that would prevent test from retry: |
| 52 | + |
| 53 | +``` |
| 54 | +@Shared |
| 55 | +protected final PollingConditions hangedPoll = new PollingConditions(timeout: 700, initialDelay: 0, delay: 5, factor: 2) |
| 56 | +``` |
| 57 | + |
| 58 | +> [!NOTE] |
| 59 | +> Use `timeout: 700` if you executed step 1, otherwise use `timeout: 1500` |
| 60 | +
|
| 61 | +This poll would literally wait until Gradle detects timeout and triggers thread and heap dumps collection by |
| 62 | +`DumpHangedTestPlugin`. |
| 63 | +Use this poll in test, something like this (usually just replace `defaultPoll` with `hangedPoll`): |
| 64 | + |
| 65 | +``` |
| 66 | +waitForTrace(hangedPoll, checkTrace()) |
| 67 | +``` |
| 68 | + |
| 69 | +In other situations just make test continue to work longer than test timeout, for example with `Thread.sleep(XXX)`. |
| 70 | +The main goal is to keep the test alive to allow dump collection for the smoke application. |
| 71 | + |
| 72 | +## Step 3: Run test on CI and collect dumps. |
| 73 | + |
| 74 | +- Commit your changes. |
| 75 | +- Push the reproducer branch that will trigger the GitLab pipeline. |
| 76 | +- Wait for the smoke test job to hit timeout. |
| 77 | +- In job logs, confirm the dump hook executed (look for `Taking dumps after ... for :...`). |
| 78 | +- Wait until the job fails and download job artifacts (see screenshot). |
| 79 | +  |
| 80 | + |
| 81 | +> [!NOTE] |
| 82 | +> You may need to re-run CI several times if the bug is not reproduced on the first try. |
| 83 | +
|
| 84 | +## Step 4: Locate dumps by JVM type |
| 85 | + |
| 86 | +### HotSpot/OpenJDK (heap + thread dumps): |
| 87 | + |
| 88 | +- Open the report folder of the failed smoke module (the hanged test folder), for example under |
| 89 | + `reports/dd-smoke-tests/...`. |
| 90 | +- There will be files, such as `<pid>-heap-dump-<timestamp>.hprof`, `<pid>-thread-dump-<timestamp>.log`, and |
| 91 | + `all-thread-dumps-<timestamp>.log`. |
| 92 | +  |
| 93 | + |
| 94 | +### IBM JDK (javacore thread dumps only): |
| 95 | + |
| 96 | +- In this case dumps are produced via `kill -3` and written as `javacore` text files, basically thread dumps. |
| 97 | +- Collect root-level javacore artifacts from `reports/`, for example `javacore.YYYYMMDD.HHMMSS.PID.SEQ.txt`. |
| 98 | +  |
| 99 | + |
| 100 | +## Step 5: Run the investigation |
| 101 | + |
| 102 | +Use tools like Eclipse MAT, or simply ask Codex or Claude to analyze collected dumps. |
0 commit comments