Commit 51759df
Diagnose log injection smoke test flakiness instead of masking it (#11075)
# What Does This Do
Adds diagnostic instrumentation to the `check raw file injection` smoke test so the next CI failure tells us the root cause instead of a bare "Condition not satisfied after 30s" with `traceCount=0`.
Changes to `LogInjectionSmokeTest`:
1. **`waitForTraceCountAlive`** — checks process liveness on every poll iteration; if the process dies, fails immediately with exit code + last 20 lines of process output
2. **Enriched timeout errors** — on timeout, dumps: process alive?, traceCount, RC polls received, last 30 lines of process output
3. **Reorder `waitForTraceCount(4)` before `waitFor`** + assert `waitFor` return value
# Motivation
CI Visibility data for the last 30 days on master shows 10 failures of `check raw file injection`:
| Failure mode | Count | Line | Duration | Root cause |
|---|---|---|---|---|
| `traceCount=0` at `waitForTraceCount(2)` | 9/10 | 368 | 30.3s | Unknown — no diagnostics |
| `logLines.size()=3` at `assertRawLogLinesWithInjection` | 1/10 | 229 | 8.3s | Incomplete log file |
The failure distribution is **bimodal** — successful runs complete in 3.5-8.7s (80 data points, zero above 9s), while failures sit at exactly 30.3s. There is nothing in between. This means the process either works or is totally broken — a timeout increase would just delay the same failure.
```
<9s: ████████████████████████████████████████ 80/80 passes
9-30s: 0 runs
30s: █████████ 9/10 failures (at timeout)
```
The current test is blind during the wait — it just polls `traceCount` in a loop. We don't know if the process crashed, hung during agent init, failed to connect to the test server, or something else entirely. This PR makes the next failure self-diagnosing.
**Example output when process crashes:**
```
Process exited with code 1 while waiting for 2 traces (received 0, RC polls: 3).
Last process output:
[dd.trace ...] ERROR ... NullPointerException during instrumentation
...
```
**Example output on timeout (process alive but not sending traces):**
```
Timed out waiting for 2 traces after 30s. traceCount=0, process.alive=true, RC polls received: 142.
Last process output:
[dd.trace ...] DEBUG ... Still loading instrumentations...
...
```
# Additional Notes
- Only `LogInjectionSmokeTest.groovy` is changed
- No timeout increase — the 30s `defaultPoll` is kept as-is
- All 11 historically flaky backends pass locally
- `rcClientMessages.size()` tells us whether the agent connected to the test server at all (RC polls hit `/v0.7/config` every 200ms)
# Contributor Checklist
- [x] Format the title according to [the contribution guidelines](https://github.com/DataDog/dd-trace-java/blob/master/CONTRIBUTING.md#title-format)
- [x] Assign the `type:` and (`comp:` or `inst:`) labels
- [x] Avoid using `close`, `fix`, or [any linking keywords](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) when referencing an issue
- [x] Update the [CODEOWNERS](https://github.com/DataDog/dd-trace-java/blob/master/.github/CODEOWNERS) file on source file addition, migration, or deletion — N/A (no file additions)
- [x] Update [public documentation](https://docs.datadoghq.com/tracing/trace_collection/library_config/java/) with any new configuration flags or behaviors — N/A (test-only change)
tag: no release notes
tag: ai generated
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: devflow.devflow-routing-intake <devflow.devflow-routing-intake@kubernetes.us1.ddbuild.io>1 parent 9aedff2 commit 51759df
1 file changed
Lines changed: 60 additions & 5 deletions
Lines changed: 60 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
347 | 347 | | |
348 | 348 | | |
349 | 349 | | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
350 | 404 | | |
351 | 405 | | |
352 | 406 | | |
| |||
365 | 419 | | |
366 | 420 | | |
367 | 421 | | |
368 | | - | |
| 422 | + | |
369 | 423 | | |
370 | 424 | | |
371 | 425 | | |
| |||
374 | 428 | | |
375 | 429 | | |
376 | 430 | | |
377 | | - | |
| 431 | + | |
378 | 432 | | |
379 | 433 | | |
380 | 434 | | |
381 | | - | |
382 | | - | |
| 435 | + | |
| 436 | + | |
383 | 437 | | |
384 | | - | |
| 438 | + | |
| 439 | + | |
385 | 440 | | |
386 | 441 | | |
387 | 442 | | |
| |||
0 commit comments