Skip to content

Reduce flakiness in GrpcTest.clientCallAfterServerCompleted()#152

Open
trask wants to merge 1 commit into
mainfrom
otelbot/flaky-fix-io-opentelemetry-javaagent-instrumentation-grpc-v1-6-GrpcTes-20260501012637
Open

Reduce flakiness in GrpcTest.clientCallAfterServerCompleted()#152
trask wants to merge 1 commit into
mainfrom
otelbot/flaky-fix-io-opentelemetry-javaagent-instrumentation-grpc-v1-6-GrpcTes-20260501012637

Conversation

@trask
Copy link
Copy Markdown
Owner

@trask trask commented May 1, 2026

Automated attempt at fixing flakiness in io.opentelemetry.javaagent.instrumentation.grpc.v1_6.GrpcTest.clientCallAfterServerCompleted().

Recent failed/flaky scans

  • t4zimg4w7ytzc (flaky, :instrumentation:grpc-1.6:javaagent:testStableSemconv)
  • ybfgrbxorublm (flaky, :instrumentation:grpc-1.6:javaagent:testStableSemconv)
  • 4ly6xtvtirmeg (flaky, :instrumentation:grpc-1.6:javaagent:testExperimental)
  • gu2n3rvmwkrsy (flaky, :instrumentation:grpc-1.6:javaagent:testBothSemconv)
  • 6gv7ek33bw7b4 (flaky, :instrumentation:grpc-1.6:javaagent:testExperimental)

Flake history (per UTC day)

Day flaky failed passed
2026-04-24 0 0 318
2026-04-25 0 0 118
2026-04-26 55 0 75
2026-04-27 75 0 142
2026-04-28 0 0 287
2026-04-29 0 0 219
2026-04-30 0 0 300

Sample failure (from Develocity)

java.lang.AssertionError: 
Expecting AtomicReference[io.grpc.StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:368)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:349)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:174)
	...(6 remaining lines not displayed - this can be changed with Assertions.setMaxStackTraceElementsDisplayed)] to have value:
  null
but did not.

Copilot diagnosis

Root cause

The frontend gRPC handler completes its response with responseObserver.onCompleted(), which causes the gRPC server-side Context to be cancelled. Under the javaagent, executor task scheduling propagates the current Context to the executor thread. The deferred executor.execute(...) then runs backendStub.sayHello(request) while inheriting the now-cancelled Context, and gRPC immediately fails the new outbound call with StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error. The captured exception is then surfaced via assertThat(error).hasValue(null) failing. Without the agent (library mode) executors do not propagate Context, so the bug is invisible there.

Fix

  • In AbstractGrpcTest.clientCallAfterServerCompleted() (under instrumentation/grpc-1.6/testing/...), wrap the deferred backend call in io.grpc.Context.ROOT.run(...) so the new outbound RPC executes under a fresh, non-cancelled context instead of inheriting the cancelled one.
  • Replace clientCallDone.await(10, SECONDS); (return value ignored) with assertThat(clientCallDone.await(10, SECONDS)).as("client call should complete within timeout").isTrue(); so a real timeout produces a clean, deterministic failure rather than a silent race against a still-running background call.

Why this addresses the root cause

Context.ROOT.run(...) is gRPC's canonical pattern for "run this work outside the lifetime of my caller's context." It detaches the deferred task from the cancelled parent context, so the outbound sayHello call no longer sees a cancelled state and proceeds normally. The await assertion closes a secondary race in which the test could read error before the background thread finished writing to it.

Risks / follow-ups

  • This is a test-side workaround. Arguably the underlying instrumentation footgun (executor-scheduled tasks inheriting a cancelled Context from the originating server-side request) deserves a fix in the gRPC instrumentation itself; this PR does not address that.
  • If the background call is genuinely slow on some agents (>10 s), the new assertTrue will produce real timeout failures where the old code silently passed. That is intentional, but maintainers should keep an eye on it after merge.
  • Verified locally that the test passes; not yet validated under sustained CI load.

Generated locally by .github/scripts/flaky-test-fix/run-local.sh. Review the diagnosis and the diff carefully before merging - automated fixes can mask flakiness instead of addressing the root cause.

…1_6.GrpcTest.clientCallAfterServerCompleted()

Automated fix attempt based on Develocity flaky-test analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant