Capture deeper diagnostics for log-injection smoke test flake#11211
Draft
Capture deeper diagnostics for log-injection smoke test flake#11211
Conversation
PR #11075 narrowed the flake to "process alive, RC polling, captured stdout empty" but couldn't pinpoint where the JVM was wedged. This extends the failure diagnostic to capture every signal needed to discriminate the remaining hypotheses in a single failure. On the failure path of `waitForTraceCountAlive`: - SIGQUIT to the live process so its thread dump flows through the captured stdout pipe (when the OutputThreads writer is alive) - bounded tail of captured stdout via RandomAccessFile (avoids loading a now-thread-dump-inflated log into memory) - bounded tail of the application logger file (logback/log4j2/JBoss target via dd.test.logfile) — distinguishes "app never ran" from "app ran but stdout pipe lost output" - enumeration of the smoke-output ThreadGroup so we can see if the OutputThreads writer thread is dead - jcmd Thread.print fallback that drains stdout incrementally (codex review caught the pipe-buffer deadlock the naive form would hit) - previously-swallowed rcClientDecodingFailure surfaced - launch command line included so we don't have to dig through the Gradle test reports Process.pid() is guarded for Java 8, and resolveJcmdPath probes both JRE and JDK layouts. tag: no release note tag: ai generated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 61 metrics, 10 unstable metrics. Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.065 s) : 0, 1065486
Total [baseline] (8.803 s) : 0, 8803278
Agent [candidate] (1.063 s) : 0, 1062711
Total [candidate] (8.86 s) : 0, 8859633
section iast
Agent [baseline] (1.249 s) : 0, 1248750
Total [baseline] (9.539 s) : 0, 9538927
Agent [candidate] (1.251 s) : 0, 1250835
Total [candidate] (9.523 s) : 0, 9522530
gantt
title insecure-bank - break down per module: candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.232 ms) : 0, 1232
crashtracking [candidate] (1.216 ms) : 0, 1216
BytebuddyAgent [baseline] (637.21 ms) : 0, 637210
BytebuddyAgent [candidate] (635.013 ms) : 0, 635013
AgentMeter [baseline] (29.477 ms) : 0, 29477
AgentMeter [candidate] (29.656 ms) : 0, 29656
GlobalTracer [baseline] (248.826 ms) : 0, 248826
GlobalTracer [candidate] (249.734 ms) : 0, 249734
AppSec [baseline] (32.889 ms) : 0, 32889
AppSec [candidate] (33.081 ms) : 0, 33081
Debugger [baseline] (59.897 ms) : 0, 59897
Debugger [candidate] (60.297 ms) : 0, 60297
Remote Config [baseline] (596.576 µs) : 0, 597
Remote Config [candidate] (603.398 µs) : 0, 603
Telemetry [baseline] (9.65 ms) : 0, 9650
Telemetry [candidate] (8.942 ms) : 0, 8942
Flare Poller [baseline] (9.752 ms) : 0, 9752
Flare Poller [candidate] (8.257 ms) : 0, 8257
section iast
crashtracking [baseline] (1.243 ms) : 0, 1243
crashtracking [candidate] (1.232 ms) : 0, 1232
BytebuddyAgent [baseline] (828.4 ms) : 0, 828400
BytebuddyAgent [candidate] (830.064 ms) : 0, 830064
AgentMeter [baseline] (11.41 ms) : 0, 11410
AgentMeter [candidate] (11.418 ms) : 0, 11418
GlobalTracer [baseline] (238.347 ms) : 0, 238347
GlobalTracer [candidate] (239.246 ms) : 0, 239246
AppSec [baseline] (30.359 ms) : 0, 30359
AppSec [candidate] (32.29 ms) : 0, 32290
Debugger [baseline] (62.262 ms) : 0, 62262
Debugger [candidate] (62.294 ms) : 0, 62294
Remote Config [baseline] (520.643 µs) : 0, 521
Remote Config [candidate] (524.722 µs) : 0, 525
Telemetry [baseline] (7.628 ms) : 0, 7628
Telemetry [candidate] (7.66 ms) : 0, 7660
Flare Poller [baseline] (3.419 ms) : 0, 3419
Flare Poller [candidate] (3.307 ms) : 0, 3307
IAST [baseline] (29.046 ms) : 0, 29046
IAST [candidate] (26.599 ms) : 0, 26599
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.064 s) : 0, 1064282
Total [baseline] (11.05 s) : 0, 11050284
Agent [candidate] (1.069 s) : 0, 1069466
Total [candidate] (11.075 s) : 0, 11075238
section appsec
Agent [baseline] (1.273 s) : 0, 1273056
Total [baseline] (11.067 s) : 0, 11066925
Agent [candidate] (1.268 s) : 0, 1267613
Total [candidate] (11.046 s) : 0, 11046423
section iast
Agent [baseline] (1.267 s) : 0, 1267360
Total [baseline] (11.331 s) : 0, 11331229
Agent [candidate] (1.243 s) : 0, 1243230
Total [candidate] (11.251 s) : 0, 11251021
section profiling
Agent [baseline] (1.185 s) : 0, 1185397
Total [baseline] (11.052 s) : 0, 11052359
Agent [candidate] (1.189 s) : 0, 1189067
Total [candidate] (11.09 s) : 0, 11089924
gantt
title petclinic - break down per module: candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.217 ms) : 0, 1217
crashtracking [candidate] (1.227 ms) : 0, 1227
BytebuddyAgent [baseline] (636.454 ms) : 0, 636454
BytebuddyAgent [candidate] (639.569 ms) : 0, 639569
AgentMeter [baseline] (29.454 ms) : 0, 29454
AgentMeter [candidate] (29.684 ms) : 0, 29684
GlobalTracer [baseline] (248.773 ms) : 0, 248773
GlobalTracer [candidate] (250.116 ms) : 0, 250116
AppSec [baseline] (32.755 ms) : 0, 32755
AppSec [candidate] (32.85 ms) : 0, 32850
Debugger [baseline] (60.523 ms) : 0, 60523
Debugger [candidate] (60.867 ms) : 0, 60867
Remote Config [baseline] (592.92 µs) : 0, 593
Remote Config [candidate] (593.866 µs) : 0, 594
Telemetry [baseline] (8.111 ms) : 0, 8111
Telemetry [candidate] (9.6 ms) : 0, 9600
Flare Poller [baseline] (10.491 ms) : 0, 10491
Flare Poller [candidate] (8.98 ms) : 0, 8980
section appsec
crashtracking [baseline] (1.228 ms) : 0, 1228
crashtracking [candidate] (1.222 ms) : 0, 1222
BytebuddyAgent [baseline] (680.288 ms) : 0, 680288
BytebuddyAgent [candidate] (677.635 ms) : 0, 677635
AgentMeter [baseline] (12.306 ms) : 0, 12306
AgentMeter [candidate] (12.352 ms) : 0, 12352
GlobalTracer [baseline] (250.807 ms) : 0, 250807
GlobalTracer [candidate] (250.64 ms) : 0, 250640
AppSec [baseline] (185.843 ms) : 0, 185843
AppSec [candidate] (186.022 ms) : 0, 186022
Debugger [baseline] (65.329 ms) : 0, 65329
Debugger [candidate] (64.879 ms) : 0, 64879
Remote Config [baseline] (582.616 µs) : 0, 583
Remote Config [candidate] (579.585 µs) : 0, 580
Telemetry [baseline] (7.674 ms) : 0, 7674
Telemetry [candidate] (7.622 ms) : 0, 7622
Flare Poller [baseline] (5.69 ms) : 0, 5690
Flare Poller [candidate] (5.626 ms) : 0, 5626
IAST [baseline] (24.826 ms) : 0, 24826
IAST [candidate] (24.913 ms) : 0, 24913
section iast
crashtracking [baseline] (1.25 ms) : 0, 1250
crashtracking [candidate] (1.229 ms) : 0, 1229
BytebuddyAgent [baseline] (843.307 ms) : 0, 843307
BytebuddyAgent [candidate] (823.126 ms) : 0, 823126
AgentMeter [baseline] (11.524 ms) : 0, 11524
AgentMeter [candidate] (11.309 ms) : 0, 11309
GlobalTracer [baseline] (239.584 ms) : 0, 239584
GlobalTracer [candidate] (237.373 ms) : 0, 237373
AppSec [baseline] (29.153 ms) : 0, 29153
AppSec [candidate] (31.34 ms) : 0, 31340
Debugger [baseline] (63.783 ms) : 0, 63783
Debugger [candidate] (62.832 ms) : 0, 62832
Remote Config [baseline] (537.124 µs) : 0, 537
Remote Config [candidate] (523.253 µs) : 0, 523
Telemetry [baseline] (7.834 ms) : 0, 7834
Telemetry [candidate] (7.691 ms) : 0, 7691
Flare Poller [baseline] (3.473 ms) : 0, 3473
Flare Poller [candidate] (3.468 ms) : 0, 3468
IAST [baseline] (28.891 ms) : 0, 28891
IAST [candidate] (26.647 ms) : 0, 26647
section profiling
crashtracking [baseline] (1.181 ms) : 0, 1181
crashtracking [candidate] (1.173 ms) : 0, 1173
BytebuddyAgent [baseline] (691.877 ms) : 0, 691877
BytebuddyAgent [candidate] (694.88 ms) : 0, 694880
AgentMeter [baseline] (8.868 ms) : 0, 8868
AgentMeter [candidate] (8.912 ms) : 0, 8912
GlobalTracer [baseline] (207.36 ms) : 0, 207360
GlobalTracer [candidate] (208.002 ms) : 0, 208002
AppSec [baseline] (32.742 ms) : 0, 32742
AppSec [candidate] (32.576 ms) : 0, 32576
Debugger [baseline] (65.843 ms) : 0, 65843
Debugger [candidate] (65.657 ms) : 0, 65657
Remote Config [baseline] (568.088 µs) : 0, 568
Remote Config [candidate] (576.076 µs) : 0, 576
Telemetry [baseline] (7.801 ms) : 0, 7801
Telemetry [candidate] (7.765 ms) : 0, 7765
Flare Poller [baseline] (3.557 ms) : 0, 3557
Flare Poller [candidate] (3.519 ms) : 0, 3519
ProfilingAgent [baseline] (93.901 ms) : 0, 93901
ProfilingAgent [candidate] (94.64 ms) : 0, 94640
Profiling [baseline] (94.466 ms) : 0, 94466
Profiling [candidate] (95.193 ms) : 0, 95193
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 3 performance regressions! Performance is the same for 17 metrics, 16 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section baseline
no_agent (17.236 ms) : 17066, 17406
. : milestone, 17236,
appsec (18.598 ms) : 18409, 18787
. : milestone, 18598,
code_origins (17.996 ms) : 17818, 18175
. : milestone, 17996,
iast (17.98 ms) : 17803, 18158
. : milestone, 17980,
profiling (18.902 ms) : 18715, 19090
. : milestone, 18902,
tracing (17.917 ms) : 17740, 18094
. : milestone, 17917,
section candidate
no_agent (19.222 ms) : 19024, 19421
. : milestone, 19222,
appsec (18.875 ms) : 18683, 19066
. : milestone, 18875,
code_origins (17.901 ms) : 17723, 18078
. : milestone, 17901,
iast (17.749 ms) : 17575, 17923
. : milestone, 17749,
profiling (18.676 ms) : 18491, 18862
. : milestone, 18676,
tracing (18.651 ms) : 18465, 18837
. : milestone, 18651,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section baseline
no_agent (1.276 ms) : 1264, 1289
. : milestone, 1276,
iast (3.242 ms) : 3206, 3279
. : milestone, 3242,
iast_FULL (6.065 ms) : 6003, 6127
. : milestone, 6065,
iast_GLOBAL (3.601 ms) : 3541, 3660
. : milestone, 3601,
profiling (2.007 ms) : 1991, 2023
. : milestone, 2007,
tracing (1.932 ms) : 1916, 1949
. : milestone, 1932,
section candidate
no_agent (1.249 ms) : 1236, 1261
. : milestone, 1249,
iast (3.411 ms) : 3365, 3456
. : milestone, 3411,
iast_FULL (6.212 ms) : 6148, 6276
. : milestone, 6212,
iast_GLOBAL (3.773 ms) : 3708, 3839
. : milestone, 3773,
profiling (2.1 ms) : 2082, 2119
. : milestone, 2100,
tracing (1.895 ms) : 1879, 1912
. : milestone, 1895,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section baseline
no_agent (1.486 ms) : 1474, 1498
. : milestone, 1486,
appsec (3.785 ms) : 3567, 4004
. : milestone, 3785,
iast (2.289 ms) : 2219, 2360
. : milestone, 2289,
iast_GLOBAL (2.332 ms) : 2260, 2403
. : milestone, 2332,
profiling (2.116 ms) : 2060, 2172
. : milestone, 2116,
tracing (2.082 ms) : 2028, 2136
. : milestone, 2082,
section candidate
no_agent (1.489 ms) : 1478, 1501
. : milestone, 1489,
appsec (3.825 ms) : 3602, 4049
. : milestone, 3825,
iast (2.295 ms) : 2223, 2367
. : milestone, 2295,
iast_GLOBAL (2.336 ms) : 2264, 2408
. : milestone, 2336,
profiling (2.115 ms) : 2059, 2172
. : milestone, 2115,
tracing (2.086 ms) : 2031, 2140
. : milestone, 2086,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.62.0-SNAPSHOT~cb27fd64db, baseline=1.62.0-SNAPSHOT~3ba8d13c73
dateFormat X
axisFormat %s
section baseline
no_agent (15.547 s) : 15547000, 15547000
. : milestone, 15547000,
appsec (14.836 s) : 14836000, 14836000
. : milestone, 14836000,
iast (18.438 s) : 18438000, 18438000
. : milestone, 18438000,
iast_GLOBAL (17.854 s) : 17854000, 17854000
. : milestone, 17854000,
profiling (15.024 s) : 15024000, 15024000
. : milestone, 15024000,
tracing (14.944 s) : 14944000, 14944000
. : milestone, 14944000,
section candidate
no_agent (15.429 s) : 15429000, 15429000
. : milestone, 15429000,
appsec (14.921 s) : 14921000, 14921000
. : milestone, 14921000,
iast (18.821 s) : 18821000, 18821000
. : milestone, 18821000,
iast_GLOBAL (17.939 s) : 17939000, 17939000
. : milestone, 17939000,
profiling (15.044 s) : 15044000, 15044000
. : milestone, 15044000,
tracing (15.115 s) : 15115000, 15115000
. : milestone, 15115000,
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Does This Do
Extends the failure diagnostic in
LogInjectionSmokeTest.waitForTraceCountAliveso the next failure ofcheck raw file injectionproduces enough information to identify the actual root cause in one CI run, rather than requiring another iteration. Test-only change; no agent or product code touched.Motivation
Prior work
This test has flaked across 11+ logging backends for months.
BaseApplication.TIMEOUT_IN_NANOSfires before the test's 30sPollingConditions, kills the daemon trace writerThe two
#11075-instrumented failures both showed:Hypotheses ruled out by #11075
process.alive=true/v0.7/configin ~27straceDecodingFailuretest_ssi_smoke, but 4/16 are regulartest_smoke. Same root cause in both job typesWhat #11075 could not tell us (and why this PR exists)
The captured stdout file is empty. With dd-trace at
defaultLogLevel=debug, that is unexpected: in 30s of agent + app startup we would normally see thousands of log lines. ThetailProcessLogfrom #11075 cannot distinguish:ClassLoadCallBack.run()defer pattern (Agent.java:596) that exists because this class of bug is real@Traceadvice deadlocks on first invocationfirstTracedMethod;System.out.println("FIRSTTRACEID …")is inside the wedged advicewaitForCondition; spans were created but never flushed@Traceis no-op0wc.write/ decoder)writeBytesAgent.java:689")The existing fields are insufficient to discriminate these. The next iteration should not have to be another guessing game.
Additional Notes
What this PR captures
In
waitForTraceCountAlive's catch block (failure path only),captureFullDiagnosticcollects:Process.pid()is Java 9+, guarded for zulu8outputLogFile, thedd.test.logfiletarget) — bounded tail. This is the H1/H2 vs H3/H4 discriminator: if the app log hasBEFORE FIRST SPAN, the app started; if it has all 4 lines, the app finishedsmoke-outputThreadGroup enumeration — count + state ofOutputThreadswriter threads. Non-empty thread set tells us H5 is wrong; if empty, the captured file is unreliable regardlessjcmd Thread.printfallback — works even if theOutputThreadswriter is dead. Drains stdout incrementally (the naive form deadlocks on full pipes, per codex review)rcClientDecodingFailure— surfacedHow each remaining hypothesis falls out
BLOCKEDonloadClass/<clinit>, monitor held by an agent transformer thread@Traceadvicedatadog.trace.bootstrap.instrumentation.api.*/datadog.trace.core.*BEFORE/INSIDE/AFTERlines; captured stdout hasFIRSTTRACEID/SECONDTRACEID; thread dump shows trace-writer/disruptor threadBLOCKED/WAITINGFIRSTTRACEID 0 0(literal zero ids)smoke-output threads=0(or threads exist but inTERMINATEDstate)mtime; thread dump shows main inwriteBytesnative frameReview fixes already applied
Codex adversarial review caught two issues that would have made this PR ironic — diagnostic code that itself behaves badly under stress:
proc.in.textafterproc.waitFor()self-deadlocks when the dump exceeds the OS pipe buffer (~64KB on Linux), exactly when we most need it. The implementation now drains stdout incrementally withredirectErrorStream(true).readLines()on a thread-dump-inflated log loads everything into memory and could OOM the Gradle worker. Replaced with a boundedRandomAccessFiletail.Contributor Checklist
type: bug,comp: testing,tag: ai generated,tag: no release notes,tag: flaky test,tag: diagnosticsJira ticket: N/A — test infrastructure / flake investigation
🤖 Generated with Claude Code