Skip to content

Commit 51759df

Browse files
bm1549devflow.devflow-routing-intake
andauthored
Diagnose log injection smoke test flakiness instead of masking it (#11075)
# What Does This Do Adds diagnostic instrumentation to the `check raw file injection` smoke test so the next CI failure tells us the root cause instead of a bare "Condition not satisfied after 30s" with `traceCount=0`. Changes to `LogInjectionSmokeTest`: 1. **`waitForTraceCountAlive`** — checks process liveness on every poll iteration; if the process dies, fails immediately with exit code + last 20 lines of process output 2. **Enriched timeout errors** — on timeout, dumps: process alive?, traceCount, RC polls received, last 30 lines of process output 3. **Reorder `waitForTraceCount(4)` before `waitFor`** + assert `waitFor` return value # Motivation CI Visibility data for the last 30 days on master shows 10 failures of `check raw file injection`: | Failure mode | Count | Line | Duration | Root cause | |---|---|---|---|---| | `traceCount=0` at `waitForTraceCount(2)` | 9/10 | 368 | 30.3s | Unknown — no diagnostics | | `logLines.size()=3` at `assertRawLogLinesWithInjection` | 1/10 | 229 | 8.3s | Incomplete log file | The failure distribution is **bimodal** — successful runs complete in 3.5-8.7s (80 data points, zero above 9s), while failures sit at exactly 30.3s. There is nothing in between. This means the process either works or is totally broken — a timeout increase would just delay the same failure. ``` <9s: ████████████████████████████████████████ 80/80 passes 9-30s: 0 runs 30s: █████████ 9/10 failures (at timeout) ``` The current test is blind during the wait — it just polls `traceCount` in a loop. We don't know if the process crashed, hung during agent init, failed to connect to the test server, or something else entirely. This PR makes the next failure self-diagnosing. **Example output when process crashes:** ``` Process exited with code 1 while waiting for 2 traces (received 0, RC polls: 3). Last process output: [dd.trace ...] ERROR ... NullPointerException during instrumentation ... ``` **Example output on timeout (process alive but not sending traces):** ``` Timed out waiting for 2 traces after 30s. traceCount=0, process.alive=true, RC polls received: 142. Last process output: [dd.trace ...] DEBUG ... Still loading instrumentations... ... ``` # Additional Notes - Only `LogInjectionSmokeTest.groovy` is changed - No timeout increase — the 30s `defaultPoll` is kept as-is - All 11 historically flaky backends pass locally - `rcClientMessages.size()` tells us whether the agent connected to the test server at all (RC polls hit `/v0.7/config` every 200ms) # Contributor Checklist - [x] Format the title according to [the contribution guidelines](https://github.com/DataDog/dd-trace-java/blob/master/CONTRIBUTING.md#title-format) - [x] Assign the `type:` and (`comp:` or `inst:`) labels - [x] Avoid using `close`, `fix`, or [any linking keywords](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) when referencing an issue - [x] Update the [CODEOWNERS](https://github.com/DataDog/dd-trace-java/blob/master/.github/CODEOWNERS) file on source file addition, migration, or deletion — N/A (no file additions) - [x] Update [public documentation](https://docs.datadoghq.com/tracing/trace_collection/library_config/java/) with any new configuration flags or behaviors — N/A (test-only change) tag: no release notes tag: ai generated 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: devflow.devflow-routing-intake <devflow.devflow-routing-intake@kubernetes.us1.ddbuild.io>
1 parent 9aedff2 commit 51759df

1 file changed

Lines changed: 60 additions & 5 deletions

File tree

dd-smoke-tests/log-injection/src/test/groovy/datadog/smoketest/LogInjectionSmokeTest.groovy

Lines changed: 60 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -347,6 +347,60 @@ abstract class LogInjectionSmokeTest extends AbstractSmokeTest {
347347
return logEvent[key]
348348
}
349349

350+
/**
351+
* Like {@link AbstractSmokeTest#waitForTraceCount} but checks process liveness on every poll
352+
* iteration and dumps diagnostic state on failure, so CI failures produce actionable output
353+
* instead of a bare "Condition not satisfied" after a 30s timeout.
354+
*/
355+
int waitForTraceCountAlive(int count) {
356+
try {
357+
defaultPoll.eventually {
358+
if (traceDecodingFailure != null) {
359+
throw traceDecodingFailure
360+
}
361+
// Check the count BEFORE liveness — the process may have exited normally
362+
// after delivering all traces, and we don't want to treat that as a failure.
363+
if (traceCount.get() >= count) {
364+
return
365+
}
366+
if (testedProcess != null && !testedProcess.isAlive()) {
367+
def lastLines = tailProcessLog(20)
368+
// RuntimeException (not AssertionError) so PollingConditions propagates
369+
// immediately instead of retrying for the full timeout.
370+
throw new RuntimeException(
371+
"Process exited with code ${testedProcess.exitValue()} while waiting for ${count} traces " +
372+
"(received ${traceCount.get()}, RC polls: ${rcClientMessages.size()}).\n" +
373+
"Last process output:\n${lastLines}")
374+
}
375+
assert traceCount.get() >= count
376+
}
377+
} catch (AssertionError e) {
378+
// The default error ("Condition not satisfied after 30s") is useless — enrich with diagnostic state
379+
def alive = testedProcess?.isAlive()
380+
def lastLines = tailProcessLog(30)
381+
throw new AssertionError(
382+
"Timed out waiting for ${count} traces after ${defaultPoll.timeout}s. " +
383+
"traceCount=${traceCount.get()}, process.alive=${alive}, " +
384+
"RC polls received: ${rcClientMessages.size()}.\n" +
385+
"Last process output:\n${lastLines}", e)
386+
}
387+
traceCount.get()
388+
}
389+
390+
private String tailProcessLog(int lines) {
391+
try {
392+
def logFile = new File(logFilePath)
393+
if (!logFile.exists()) {
394+
return "(log file does not exist: ${logFilePath})"
395+
}
396+
def allLines = logFile.readLines()
397+
def tail = allLines.size() > lines ? allLines[-lines..-1] : allLines
398+
return tail.join("\n")
399+
} catch (Exception e) {
400+
return "(failed to read log: ${e.message})"
401+
}
402+
}
403+
350404
def parseTraceFromStdOut( String line ) {
351405
if (line == null) {
352406
throw new IllegalArgumentException("Line is null")
@@ -365,7 +419,7 @@ abstract class LogInjectionSmokeTest extends AbstractSmokeTest {
365419
@Flaky(condition = () -> JavaVirtualMachine.isIbm8() || JavaVirtualMachine.isOracleJDK8())
366420
def "check raw file injection"() {
367421
when:
368-
def count = waitForTraceCount(2)
422+
def count = waitForTraceCountAlive(2)
369423

370424
def newConfig = """
371425
{"lib_config":
@@ -374,14 +428,15 @@ abstract class LogInjectionSmokeTest extends AbstractSmokeTest {
374428
""".toString()
375429
setRemoteConfig("datadog/2/APM_TRACING/config_overrides/config", newConfig)
376430

377-
count = waitForTraceCount(3)
431+
count = waitForTraceCountAlive(3)
378432

379433
setRemoteConfig("datadog/2/APM_TRACING/config_overrides/config", """{"lib_config":{}}""".toString())
380434

381-
testedProcess.waitFor(TIMEOUT_SECS, SECONDS)
382-
def exitValue = testedProcess.exitValue()
435+
// Wait for all 4 traces before waiting for process exit to ensure trace delivery is confirmed
436+
count = waitForTraceCountAlive(4)
383437

384-
count = waitForTraceCount(4)
438+
assert testedProcess.waitFor(TIMEOUT_SECS, SECONDS) : "Process did not exit within ${TIMEOUT_SECS}s"
439+
def exitValue = testedProcess.exitValue()
385440

386441
def logLines = outputLogFile.readLines()
387442
println "log lines: " + logLines

0 commit comments

Comments
 (0)