Skip to content

Reduce flakiness in io.opentelemetry.javaagent.instrumentation.vertx.rx.v3_5.server.VertxReactivePropagationTest.highConcurrency()#18508

Closed
trask wants to merge 1 commit into
open-telemetry:mainfrom
trask:otelbot/flaky-test-remediation-io-opentelemetry-javaagent-instrumentation-vertx-rx-v3-5-ser-20260502160046
Closed

Reduce flakiness in io.opentelemetry.javaagent.instrumentation.vertx.rx.v3_5.server.VertxReactivePropagationTest.highConcurrency()#18508
trask wants to merge 1 commit into
open-telemetry:mainfrom
trask:otelbot/flaky-test-remediation-io-opentelemetry-javaagent-instrumentation-vertx-rx-v3-5-ser-20260502160046

Conversation

@trask

@trask trask commented May 2, 2026

Copy link
Copy Markdown
Member

Automated attempt at fixing flakiness in io.opentelemetry.javaagent.instrumentation.vertx.rx.v3_5.server.VertxReactivePropagationTest.highConcurrency().

Recent failed/flaky scans

  • z22sma3k2dags (flaky, :instrumentation:vertx:vertx-rx-java-3.5:javaagent:version5TestStableSemconv)

Flake history (per UTC day)

Day flaky failed passed
2026-04-25 10 0 105
2026-04-26 6 0 100
2026-04-27 17 0 223
2026-04-28 33 0 355
2026-04-29 28 0 316
2026-04-30 35 0 346
2026-05-01 14 0 149
2026-05-02 9 0 99

Sample failure (from Develocity)

java.lang.AssertionError: [Trace 3] 
Expected size: 5 but was: 6 in:
[TestSpanData{name=client 5, kind=INTERNAL, spanContext=ImmutableSpanContext{traceId=deb4def3c70141b0cd09f9b93b7bf43e, spanId=b121751c799b4765, traceFlags=03, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}, parentSpanContext=ImmutableSpanContext{traceId=00000000000000000000000000000000, spanId=0000000000000000, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=false}, status=ImmutableStatusData{statusCode=UNSET, description=}, startEpochNanos=1777703214468606836, attributes={test.request.id=5}, events=[], links=[], endEpochNanos=1777703214532726185, totalRecordedEvents=0, totalRecordedLinks=0, totalAttributeCount=1, resource=Resource{schemaUrl=null, attributes={service.instance.id="db00169f-6eb0-4ff7-ada9-c88702c19f7e", service.name="unknown_service:java", telemetry.distro.name="opentelemetry-java-instrumentation", telemetry.distro.version="2.28.0-SNAPSHOT", telem…

Copilot diagnosis

Root cause

The failing traces contained six spans instead of the expected five because two high-concurrency client requests were occasionally associated with the same trace. The test reused executor threads and started each synthetic client N span from the ambient Context.current(), so any context left current on a reused worker could make the next request inherit the previous trace. The submitted task futures were also ignored, so latch interruptions or request failures could be hidden until span assertions timed out or observed partially unexpected telemetry.

Fix

  • Start every synthetic client request from Context.root() before creating the client N span.
  • Keep the existing explicit context injection into the HTTP request so the server-side propagation coverage is unchanged.
  • Track submitted futures and wait for each request task to complete before asserting traces.
  • Assert the start latch wait result instead of ignoring CountDownLatch.await(...).

Why this addresses the root cause

Forcing a root context isolates each concurrent request from stale thread-local context on reused executor threads, so each request owns exactly one client root span and one propagated server trace. Waiting on the futures makes task-level failures deterministic and ensures trace assertions run only after all 100 requests have completed.

Risks / follow-ups

  • The new bounded future waits may surface genuine request hangs as test failures instead of letting trace polling mask them.
  • The same change was applied to the duplicated Vert.x 3.5, 4.1, and 5 test-source variants so maintainers should confirm all three variants are expected to remain identical.

Review the diagnosis and the diff carefully before merging - automated fixes can mask flakiness instead of addressing the root cause.

…rx.v3_5.server.VertxReactivePropagationTest.highConcurrency()

Automated fix attempt based on Develocity flaky-test analysis.
@trask trask force-pushed the otelbot/flaky-test-remediation-io-opentelemetry-javaagent-instrumentation-vertx-rx-v3-5-ser-20260502160046 branch from f78b6e5 to d44376e Compare May 2, 2026 16:32
@trask

trask commented May 2, 2026

Copy link
Copy Markdown
Member Author

Closing in favor of more conservative change: #18511

@trask trask closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant