Reduce flakiness in GrpcTest.clientCallAfterServerCompleted()#152
Open
trask wants to merge 1 commit into
Open
Conversation
…1_6.GrpcTest.clientCallAfterServerCompleted() Automated fix attempt based on Develocity flaky-test analysis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated attempt at fixing flakiness in
io.opentelemetry.javaagent.instrumentation.grpc.v1_6.GrpcTest.clientCallAfterServerCompleted().instrumentation/grpc-1.6/javaagent/src/test/java/io/opentelemetry/javaagent/instrumentation/grpc/v1_6/GrpcTest.javaRecent failed/flaky scans
:instrumentation:grpc-1.6:javaagent:testStableSemconv):instrumentation:grpc-1.6:javaagent:testStableSemconv):instrumentation:grpc-1.6:javaagent:testExperimental):instrumentation:grpc-1.6:javaagent:testBothSemconv):instrumentation:grpc-1.6:javaagent:testExperimental)Flake history (per UTC day)
Sample failure (from Develocity)
Copilot diagnosis
Root cause
The frontend gRPC handler completes its response with
responseObserver.onCompleted(), which causes the gRPC server-sideContextto be cancelled. Under the javaagent, executor task scheduling propagates the currentContextto the executor thread. The deferredexecutor.execute(...)then runsbackendStub.sayHello(request)while inheriting the now-cancelledContext, and gRPC immediately fails the new outbound call withStatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error. The captured exception is then surfaced viaassertThat(error).hasValue(null)failing. Without the agent (library mode) executors do not propagateContext, so the bug is invisible there.Fix
AbstractGrpcTest.clientCallAfterServerCompleted()(underinstrumentation/grpc-1.6/testing/...), wrap the deferred backend call inio.grpc.Context.ROOT.run(...)so the new outbound RPC executes under a fresh, non-cancelled context instead of inheriting the cancelled one.clientCallDone.await(10, SECONDS);(return value ignored) withassertThat(clientCallDone.await(10, SECONDS)).as("client call should complete within timeout").isTrue();so a real timeout produces a clean, deterministic failure rather than a silent race against a still-running background call.Why this addresses the root cause
Context.ROOT.run(...)is gRPC's canonical pattern for "run this work outside the lifetime of my caller's context." It detaches the deferred task from the cancelled parent context, so the outboundsayHellocall no longer sees a cancelled state and proceeds normally. Theawaitassertion closes a secondary race in which the test could readerrorbefore the background thread finished writing to it.Risks / follow-ups
Contextfrom the originating server-side request) deserves a fix in the gRPC instrumentation itself; this PR does not address that.assertTruewill produce real timeout failures where the old code silently passed. That is intentional, but maintainers should keep an eye on it after merge.Generated locally by
.github/scripts/flaky-test-fix/run-local.sh. Review the diagnosis and the diff carefully before merging - automated fixes can mask flakiness instead of addressing the root cause.