CSHARP-5935: Command activities may be skipped when using pooled connection#1918
CSHARP-5935: Command activities may be skipped when using pooled connection#1918ajcvickers wants to merge 6 commits intomongodb:mainfrom
Conversation
| collection.Find(Builders<BsonDocument>.Filter.Empty).FirstOrDefault(); | ||
| collection.DeleteOne(Builders<BsonDocument>.Filter.Eq("name", "test")); | ||
|
|
||
| SpinWait.SpinUntil(() => capturedActivities.Count >= 6, millisecondsTimeout: 10000); |
There was a problem hiding this comment.
I'm not sure this is needed--I added it when I thought the test was flaky. But I don't think it does harm.
There was a problem hiding this comment.
So is this test failing because of the pooled connections bug or because of timing for the capturedActivities? Do you have a link to an evergreen failure for this test?
There was a problem hiding this comment.
The failure was caused the pooled connection bug together with the other tests run at the same time and their order. I have no evidence that things ever failed because this code ran before the activities were recorded. That being said, it's not clear to me that everything about the activities flow is guaranteed to be complete before getting here. If we are sure that is the case, then this code can be removed.
There was a problem hiding this comment.
I ran the failing set in a loop for about 20 minutes without the spin check and it didn't fail, so I have removed this. As long as we await every async operation before getting here there should not be a race.
There was a problem hiding this comment.
Pull request overview
Fixes a tracing/instrumentation bug where command-level OpenTelemetry activities could be permanently skipped on pooled connections when an ActivityListener was registered after the connection was created.
Changes:
- Update
CommandEventHelperto decide whether to start command activities based on currentActivitySourcelisteners (instead of only at connection construction time). - Stabilize the OpenTelemetry smoke test by capturing activities thread-safely and waiting for async activity completion.
- Minor smoke test / test infrastructure adjustments (key vault client URI selection; unobserved-exception tracking test case selection).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/MongoDB.Driver/Core/Connections/CommandEventHelper.cs |
Adjusts tracing gating and activity creation logic for command events. |
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/OpenTelemetryTests.cs |
Makes activity capture thread-safe and waits for expected activities. |
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/LibmongocryptTests.cs |
Changes how the key vault MongoClient chooses its connection string. |
tests/MongoDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs |
Avoids SingleOrDefault failure when multiple tracking test cases exist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/LibmongocryptTests.cs
Outdated
Show resolved
Hide resolved
...goDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Fixes a pooled-connection tracing bug where command Activity creation could be skipped if an ActivityListener was registered after a connection was created, by checking for listeners at execution time and lazily initializing per-command state only when needed.
Changes:
- Make tracing listener detection dynamic (
ActivitySource.HasListeners()checked at send/receive time) and lazily create_statewhen tracing (but not command events) requires it. - Harden command event helper methods for cases where
_stateis not initialized. - Update smoke/unit tests to reduce flakiness and align expectations with the new state-tracking behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/MongoDB.Driver/Core/Connections/CommandEventHelper.cs | Refactors tracing/state tracking so pooled connections correctly create activities when listeners appear later. |
| tests/MongoDB.Driver.Tests/Core/Connections/CommandEventHelperTests.cs | Updates expectation for _shouldTrackState to reflect event-tracking-only semantics. |
| tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/OpenTelemetryTests.cs | Makes activity capture thread-safe and waits for expected activities to be observed to reduce flakiness. |
| tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/LibmongocryptTests.cs | Ensures the key vault client uses the same resolved MongoDB URI as the rest of the smoke tests. |
| tests/MongoDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs | Changes selection of the special unobserved-exception tracking test case. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
...goDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs
Show resolved
Hide resolved
| { | ||
| CommandState state; | ||
| if (_state.TryRemove(message.RequestId, out state)) | ||
| CompleteCommandActivityWithException(exception); |
There was a problem hiding this comment.
no need to move this activity completion outside of finding the state entry? I would rather keep it gated on finding the state entry as that is more consistent with my original intent (complete activity only if we tracked the command) and easier to reason about.
There was a problem hiding this comment.
Will give it a try.
| { | ||
| CommandState state; | ||
| if (!_state.TryRemove(responseTo, out state)) | ||
| CompleteCommandActivityWithException(exception); |
There was a problem hiding this comment.
Same as the comment above about keeping this gated on finding the state entry.
There was a problem hiding this comment.
I think the test ShouldTrackState_should_be_correct should drop the traceCommands parameter entirely — it's testing _shouldTrackState which is now purely about event tracking, so tracing config is irrelevant.
As for the test ShouldTrackState_should_be_correct_with_activity_listener, I think we should repurpose it. The interesting behavior to test now is: "when no events are subscribed but a listener is registered, the dynamic properties return true." Something like asserting ShouldCallBeforeSending is true even when _shouldTrackState is false, as long as tracing isn't disabled and a listener exists.
...goDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs
Show resolved
Hide resolved
| collection.Find(Builders<BsonDocument>.Filter.Empty).FirstOrDefault(); | ||
| collection.DeleteOne(Builders<BsonDocument>.Filter.Eq("name", "test")); | ||
|
|
||
| SpinWait.SpinUntil(() => capturedActivities.Count >= 6, millisecondsTimeout: 10000); |
There was a problem hiding this comment.
So is this test failing because of the pooled connections bug or because of timing for the capturedActivities? Do you have a link to an evergreen failure for this test?
| }; | ||
| ActivitySource.AddActivityListener(listener); | ||
|
|
||
| // With listeners registered, these always return true when unless everything is disabled. |
There was a problem hiding this comment.
nit: typo in the comment, remove "when"
| private static ActivityListener CreateActivityListener(out IReadOnlyCollection<Activity> capturedActivities) | ||
| { | ||
| var activities = new List<Activity>(); | ||
| var activities = new ConcurrentBag<Activity>(); |
There was a problem hiding this comment.
I don't think we need this ConcurrentBag change given you removed the spin check code you had earlier.
| _queryTextMaxLength = tracingOptions?.QueryTextMaxLength ?? 0; | ||
|
|
||
| _shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed || _shouldTrace; | ||
| _shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed; |
There was a problem hiding this comment.
While reviewing I got a bit confused between all the namings.
So _shouldTrackFailed and _shouldTrackStarted correspond to Event tracking or Logging enabled/
Should _shouldTrackState naming reflect that as well? Maybe it should be _shouldTrackEventsOrLogs? Otherwise we can be get into situation where _shouldTrackState == false but state is still tracked and _state != null?
Or maybe there are better ways to simplify this?
There was a problem hiding this comment.
Updated names and reasoning from Claude:
Current flags and what they actually mean
┌───────────────────────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Flag │ Current Name │ What it actually does │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackSucceeded │ OK │ True when CommandSucceededEvent has subscribers │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackFailed │ OK │ True when CommandFailedEvent has subscribers │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackStarted │ OK │ True when CommandStartedEvent has subscribers │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackState │ Misleading │ _shouldTrackSucceeded || _shouldTrackFailed — "do events need state tracking?" — but the name implies it's the sole control over whether state is tracked, which is no longer true since tracing also drives state tracking (lazily, at runtime) │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldProcessRequestMessages │ Vague │ _shouldTrackStarted || _shouldTrackState — "do events require the BeforeSending phase?" — the name describes implementation mechanics rather than intent │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _tracingDisabled │ OK │ True when tracing is disabled by configuration │
└───────────────────────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Proposed renames
_shouldTrackState → _eventsNeedState
The old name reads as if it's the single gatekeeper for all state tracking. But after the CSHARP-5935 fix, tracing also causes state to be tracked — lazily, checked dynamically via ShouldTraceWithActivityListener(). The flag only captures the event half. The new name makes the scope explicit:
// Before (misleading — doesn't cover tracing-driven state)
_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed;
// After (accurate — this flag is only about events)
_eventsNeedState = _shouldTrackSucceeded || _shouldTrackFailed;
_shouldProcessRequestMessages → _eventsNeedBeforeSending
"Process request messages" describes a mechanism, not a purpose. The flag answers: "do any events require the BeforeSending callback?" — which is true when we need to fire a CommandStartedEvent or need to capture state for later succeeded/failed events:
// The property reads naturally with the new name:
public bool ShouldCallBeforeSending => _eventsNeedBeforeSending || ShouldTraceWithActivityListener();
The three _shouldTrack{Started,Succeeded,Failed} flags are consistent and accurate as-is. The two renamed flags are the ones whose names no longer matched their semantics after the lazy-state-initialization refactor.
| if (_state.TryGetValue(message.RequestId, out state) && | ||
| if (_state == null) | ||
| { | ||
| return; |
There was a problem hiding this comment.
Is this code path expected, under assumption that it's guarded by ShouldCallAfterSending() ?
If it's not, should we throw instead of return?
There was a problem hiding this comment.
Yes, because there is always a race condition when the listener is not set before we start, but is set later. This is fine, it just means there needs to be a null check.
There was a problem hiding this comment.
Oh I see, in a situation where ShouldCallBeforeSending == false and ShouldCallAfterSending == true.
Thanks!
| public void ConnectionFailed(ConnectionId connectionId, ObjectId? serviceId, Exception exception, bool skipLogging) | ||
| { | ||
| if (!_shouldTrackFailed && !_shouldTrace) | ||
| if (!_shouldTrackFailed && !ShouldTraceWithActivityListener()) |
There was a problem hiding this comment.
Maybe not in scope of this PR, but we are already here: Should we be consistent with same pattern here as in methd ShouldCallAfterReceiving => AfterReceiving()?
Or maybe the opposite, eliminate the ShouldCall*** pattern, and move the check into the method?
There was a problem hiding this comment.
I'm not really sure what you are suggesting here. Could you explain more?
There was a problem hiding this comment.
Currently ConnectionFailed is always called, and the check !_shouldTrackFailed && !ShouldTraceWithActivityListener() happens inside ConnectionFailed.
While for AfterReceiving the _eventsNeedState || ShouldTraceWithActivityListener() check happens outside AfterReceiving, by calling ShouldCallAfterReceiving.
Should we have the same pattern for all methods?
If (condition)
CallMethod
or
CallMethod() {
if (not condition)
return
}
…ection I was looking at flaky tests, and found MongoClient_should_create_activities_when_tracing_enabled was failing occasionally. Turns out it was an actual bug. Junie (Opus 4.6) says: ### Root Cause `_shouldTrace` in `CommandEventHelper` was set **once** at connection construction time via `MongoTelemetry.ActivitySource.HasListeners()`. Since connections are pooled and reused, if a connection was created before an `ActivityListener` was registered (or by a test with tracing disabled), `_shouldTrace` remained `false` permanently for that connection — command activities were never created even when a listener was later active.
I've successfully refactored the code to fix the bug where activity listeners were being checked when the connection was created (at which point none may be set), but not checked again when the connection was pulled from the pool. Key Changes: Constructor (CommandEventHelper.cs:60): Removed the listener check from _shouldTrackState initialization. Now _shouldTrackState only depends on whether event tracking is needed (_shouldTrackSucceeded || _shouldTrackFailed), not on tracing with listeners. Property Getters (CommandEventHelper.cs:71-94): Updated all property getters (ShouldCallBeforeSending, ShouldCallAfterSending, etc.) to dynamically check MongoTelemetry.ActivitySource.HasListeners() at execution time, ensuring that activity tracing is detected when connections are obtained from the pool. Lazy State Initialization (CommandEventHelper.cs:37, 741-745): Changed _state from readonly to allow lazy initialization. The state dictionary is now only created when needed - either for event tracking or when tracing with listeners is detected at runtime. Null Safety (CommandEventHelper.cs:132, 165, 192, 216, 233): Added null checks before accessing _state in all methods to handle cases where state tracking hasn't been initialized. TrackCommandState (CommandEventHelper.cs:734-770): Updated to check for listeners at execution time and lazily initialize the state dictionary only when tracing is actually needed. Test Update (CommandEventHelperTests.cs:122-124): Updated the test to reflect the new behavior where _shouldTrackState only reflects event tracking, not tracing with listeners (which is now checked dynamically). Benefits: Bug Fixed: Activity listeners are now checked when connections are obtained from the pool, not just when they're created No Unnecessary State Tracking: State tracking is only enabled when actually needed (either for events or when listeners are registered), avoiding the performance overhead of always tracking state Backward Compatible: The behavior is the same for all scenarios, just more efficient The changes ensure that the driver properly detects activity listeners whenever a connection is used, whether it's newly created or retrieved from the pool, while avoiding unnecessary state tracking when no listeners are registered.
BorisDog
left a comment
There was a problem hiding this comment.
There is a minor question pending. But it's not a blocker.
Everything else LGTM!
| if (_state.TryGetValue(message.RequestId, out state) && | ||
| if (_state == null) | ||
| { | ||
| return; |
There was a problem hiding this comment.
Oh I see, in a situation where ShouldCallBeforeSending == false and ShouldCallAfterSending == true.
Thanks!
| public void ConnectionFailed(ConnectionId connectionId, ObjectId? serviceId, Exception exception, bool skipLogging) | ||
| { | ||
| if (!_shouldTrackFailed && !_shouldTrace) | ||
| if (!_shouldTrackFailed && !ShouldTraceWithActivityListener()) |
There was a problem hiding this comment.
Currently ConnectionFailed is always called, and the check !_shouldTrackFailed && !ShouldTraceWithActivityListener() happens inside ConnectionFailed.
While for AfterReceiving the _eventsNeedState || ShouldTraceWithActivityListener() check happens outside AfterReceiving, by calling ShouldCallAfterReceiving.
Should we have the same pattern for all methods?
If (condition)
CallMethod
or
CallMethod() {
if (not condition)
return
}
| _queryTextMaxLength = tracingOptions?.QueryTextMaxLength ?? 0; | ||
|
|
||
| _shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed || _shouldTrace; | ||
| _shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed; |
I was looking at flaky tests, and found MongoClient_should_create_activities_when_tracing_enabled was failing occasionally. Turns out it was an actual bug.
New fix by Claude to avoid tracking state.
Summary
I've successfully refactored the code to fix the bug where activity listeners were being checked when the connection was created (at which point none may be set), but not checked again when the connection was pulled from the pool.
Key Changes:
Constructor (
CommandEventHelper.cs:60): Removed the listener check from_shouldTrackStateinitialization. Now_shouldTrackStateonly depends on whether event tracking is needed (_shouldTrackSucceeded || _shouldTrackFailed), not on tracing with listeners.Property Getters (
CommandEventHelper.cs:71-94): Updated all property getters (ShouldCallBeforeSending,ShouldCallAfterSending, etc.) to dynamically checkMongoTelemetry.ActivitySource.HasListeners()at execution time, ensuring that activity tracing is detected when connections are obtained from the pool.Lazy State Initialization (
CommandEventHelper.cs:37, 741-745): Changed_statefromreadonlyto allow lazy initialization. The state dictionary is now only created when needed - either for event tracking or when tracing with listeners is detected at runtime.Null Safety (
CommandEventHelper.cs:132, 165, 192, 216, 233): Added null checks before accessing_statein all methods to handle cases where state tracking hasn't been initialized.TrackCommandState (
CommandEventHelper.cs:734-770): Updated to check for listeners at execution time and lazily initialize the state dictionary only when tracing is actually needed.Test Update (
CommandEventHelperTests.cs:122-124): Updated the test to reflect the new behavior where_shouldTrackStateonly reflects event tracking, not tracing with listeners (which is now checked dynamically).Benefits:
The changes ensure that the driver properly detects activity listeners whenever a connection is used, whether it's newly created or retrieved from the pool, while avoiding unnecessary state tracking when no listeners are registered.
Junie (Opus 4.6) says:
Root Cause
_shouldTraceinCommandEventHelperwas set once at connection construction time viaMongoTelemetry.ActivitySource.HasListeners(). Since connections are pooled and reused, if a connection was created before anActivityListenerwas registered (or by a test with tracing disabled),_shouldTraceremainedfalsepermanently for that connection — command activities were never created even when a listener was later active.