Skip to content

CSHARP-5935: Command activities may be skipped when using pooled connection#1918

Open
ajcvickers wants to merge 6 commits intomongodb:mainfrom
ajcvickers:TestFixes_2
Open

CSHARP-5935: Command activities may be skipped when using pooled connection#1918
ajcvickers wants to merge 6 commits intomongodb:mainfrom
ajcvickers:TestFixes_2

Conversation

@ajcvickers
Copy link
Copy Markdown
Contributor

@ajcvickers ajcvickers commented Mar 18, 2026

I was looking at flaky tests, and found MongoClient_should_create_activities_when_tracing_enabled was failing occasionally. Turns out it was an actual bug.

New fix by Claude to avoid tracking state.

Summary

I've successfully refactored the code to fix the bug where activity listeners were being checked when the connection was created (at which point none may be set), but not checked again when the connection was pulled from the pool.

Key Changes:

  1. Constructor (CommandEventHelper.cs:60): Removed the listener check from _shouldTrackState initialization. Now _shouldTrackState only depends on whether event tracking is needed (_shouldTrackSucceeded || _shouldTrackFailed), not on tracing with listeners.

  2. Property Getters (CommandEventHelper.cs:71-94): Updated all property getters (ShouldCallBeforeSending, ShouldCallAfterSending, etc.) to dynamically check MongoTelemetry.ActivitySource.HasListeners() at execution time, ensuring that activity tracing is detected when connections are obtained from the pool.

  3. Lazy State Initialization (CommandEventHelper.cs:37, 741-745): Changed _state from readonly to allow lazy initialization. The state dictionary is now only created when needed - either for event tracking or when tracing with listeners is detected at runtime.

  4. Null Safety (CommandEventHelper.cs:132, 165, 192, 216, 233): Added null checks before accessing _state in all methods to handle cases where state tracking hasn't been initialized.

  5. TrackCommandState (CommandEventHelper.cs:734-770): Updated to check for listeners at execution time and lazily initialize the state dictionary only when tracing is actually needed.

  6. Test Update (CommandEventHelperTests.cs:122-124): Updated the test to reflect the new behavior where _shouldTrackState only reflects event tracking, not tracing with listeners (which is now checked dynamically).

Benefits:

  • Bug Fixed: Activity listeners are now checked when connections are obtained from the pool, not just when they're created
  • No Unnecessary State Tracking: State tracking is only enabled when actually needed (either for events or when listeners are registered), avoiding the performance overhead of always tracking state
  • Backward Compatible: The behavior is the same for all scenarios, just more efficient

The changes ensure that the driver properly detects activity listeners whenever a connection is used, whether it's newly created or retrieved from the pool, while avoiding unnecessary state tracking when no listeners are registered.


Junie (Opus 4.6) says:

Root Cause

_shouldTrace in CommandEventHelper was set once at connection construction time via MongoTelemetry.ActivitySource.HasListeners(). Since connections are pooled and reused, if a connection was created before an ActivityListener was registered (or by a test with tracing disabled), _shouldTrace remained false permanently for that connection — command activities were never created even when a listener was later active.

@ajcvickers ajcvickers requested a review from adelinowona March 18, 2026 12:18
@ajcvickers ajcvickers requested a review from a team as a code owner March 18, 2026 12:18
Copilot AI review requested due to automatic review settings March 18, 2026 12:18
@ajcvickers ajcvickers added the bug Fixes issues or unintended behavior. label Mar 18, 2026
collection.Find(Builders<BsonDocument>.Filter.Empty).FirstOrDefault();
collection.DeleteOne(Builders<BsonDocument>.Filter.Eq("name", "test"));

SpinWait.SpinUntil(() => capturedActivities.Count >= 6, millisecondsTimeout: 10000);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is needed--I added it when I thought the test was flaky. But I don't think it does harm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this test failing because of the pooled connections bug or because of timing for the capturedActivities? Do you have a link to an evergreen failure for this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure was caused the pooled connection bug together with the other tests run at the same time and their order. I have no evidence that things ever failed because this code ran before the activities were recorded. That being said, it's not clear to me that everything about the activities flow is guaranteed to be complete before getting here. If we are sure that is the case, then this code can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the failing set in a loop for about 20 minutes without the spin check and it didn't fail, so I have removed this. As long as we await every async operation before getting here there should not be a race.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a tracing/instrumentation bug where command-level OpenTelemetry activities could be permanently skipped on pooled connections when an ActivityListener was registered after the connection was created.

Changes:

  • Update CommandEventHelper to decide whether to start command activities based on current ActivitySource listeners (instead of only at connection construction time).
  • Stabilize the OpenTelemetry smoke test by capturing activities thread-safely and waiting for async activity completion.
  • Minor smoke test / test infrastructure adjustments (key vault client URI selection; unobserved-exception tracking test case selection).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/MongoDB.Driver/Core/Connections/CommandEventHelper.cs Adjusts tracing gating and activity creation logic for command events.
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/OpenTelemetryTests.cs Makes activity capture thread-safe and waits for expected activities.
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/LibmongocryptTests.cs Changes how the key vault MongoClient chooses its connection string.
tests/MongoDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs Avoids SingleOrDefault failure when multiple tracking test cases exist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a pooled-connection tracing bug where command Activity creation could be skipped if an ActivityListener was registered after a connection was created, by checking for listeners at execution time and lazily initializing per-command state only when needed.

Changes:

  • Make tracing listener detection dynamic (ActivitySource.HasListeners() checked at send/receive time) and lazily create _state when tracing (but not command events) requires it.
  • Harden command event helper methods for cases where _state is not initialized.
  • Update smoke/unit tests to reduce flakiness and align expectations with the new state-tracking behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/MongoDB.Driver/Core/Connections/CommandEventHelper.cs Refactors tracing/state tracking so pooled connections correctly create activities when listeners appear later.
tests/MongoDB.Driver.Tests/Core/Connections/CommandEventHelperTests.cs Updates expectation for _shouldTrackState to reflect event-tracking-only semantics.
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/OpenTelemetryTests.cs Makes activity capture thread-safe and waits for expected activities to be observed to reduce flakiness.
tests/SmokeTests/MongoDB.Driver.SmokeTests.Sdk/LibmongocryptTests.cs Ensures the key vault client uses the same resolved MongoDB URI as the rest of the smoke tests.
tests/MongoDB.TestHelpers/XunitExtensions/TimeoutEnforcing/TimeoutEnforcingXunitTestAssemblyRunner.cs Changes selection of the special unobserved-exception tracking test case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

{
CommandState state;
if (_state.TryRemove(message.RequestId, out state))
CompleteCommandActivityWithException(exception);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to move this activity completion outside of finding the state entry? I would rather keep it gated on finding the state entry as that is more consistent with my original intent (complete activity only if we tracked the command) and easier to reason about.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give it a try.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

{
CommandState state;
if (!_state.TryRemove(responseTo, out state))
CompleteCommandActivityWithException(exception);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment above about keeping this gated on finding the state entry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test ShouldTrackState_should_be_correct should drop the traceCommands parameter entirely — it's testing _shouldTrackState which is now purely about event tracking, so tracing config is irrelevant.

As for the test ShouldTrackState_should_be_correct_with_activity_listener, I think we should repurpose it. The interesting behavior to test now is: "when no events are subscribed but a listener is registered, the dynamic properties return true." Something like asserting ShouldCallBeforeSending is true even when _shouldTrackState is false, as long as tracing isn't disabled and a listener exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tests.

collection.Find(Builders<BsonDocument>.Filter.Empty).FirstOrDefault();
collection.DeleteOne(Builders<BsonDocument>.Filter.Eq("name", "test"));

SpinWait.SpinUntil(() => capturedActivities.Count >= 6, millisecondsTimeout: 10000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this test failing because of the pooled connections bug or because of timing for the capturedActivities? Do you have a link to an evergreen failure for this test?

};
ActivitySource.AddActivityListener(listener);

// With listeners registered, these always return true when unless everything is disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in the comment, remove "when"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

private static ActivityListener CreateActivityListener(out IReadOnlyCollection<Activity> capturedActivities)
{
var activities = new List<Activity>();
var activities = new ConcurrentBag<Activity>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this ConcurrentBag change given you removed the spin check code you had earlier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ajcvickers ajcvickers requested a review from BorisDog April 2, 2026 15:48
_queryTextMaxLength = tracingOptions?.QueryTextMaxLength ?? 0;

_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed || _shouldTrace;
_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing I got a bit confused between all the namings.
So _shouldTrackFailed and _shouldTrackStarted correspond to Event tracking or Logging enabled/
Should _shouldTrackState naming reflect that as well? Maybe it should be _shouldTrackEventsOrLogs? Otherwise we can be get into situation where _shouldTrackState == false but state is still tracked and _state != null?

Or maybe there are better ways to simplify this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated names and reasoning from Claude:

Current flags and what they actually mean

┌───────────────────────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐                                                                      
│             Flag              │ Current Name │                                                                                                              What it actually does                                                                                                               │                                                                      
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackSucceeded         │ OK           │ True when CommandSucceededEvent has subscribers                                                                                                                                                                                                  │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackFailed            │ OK           │ True when CommandFailedEvent has subscribers                                                                                                                                                                                                     │                                                                      
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackStarted           │ OK           │ True when CommandStartedEvent has subscribers                                                                                                                                                                                                    │                                                                      
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldTrackState             │ Misleading   │ _shouldTrackSucceeded || _shouldTrackFailed — "do events need state tracking?" — but the name implies it's the sole control over whether state is tracked, which is no longer true since tracing also drives state tracking (lazily, at runtime) │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ _shouldProcessRequestMessages │ Vague        │ _shouldTrackStarted || _shouldTrackState — "do events require the BeforeSending phase?" — the name describes implementation mechanics rather than intent                                                                                         │
├───────────────────────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                      
│ _tracingDisabled              │ OK           │ True when tracing is disabled by configuration                                                                                                                                                                                                   │
└───────────────────────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Proposed renames

_shouldTrackState → _eventsNeedState

The old name reads as if it's the single gatekeeper for all state tracking. But after the CSHARP-5935 fix, tracing also causes state to be tracked — lazily, checked dynamically via ShouldTraceWithActivityListener(). The flag only captures the event half. The new name makes the scope explicit:

// Before (misleading — doesn't cover tracing-driven state)
_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed;

// After (accurate — this flag is only about events)
_eventsNeedState = _shouldTrackSucceeded || _shouldTrackFailed;

_shouldProcessRequestMessages → _eventsNeedBeforeSending

"Process request messages" describes a mechanism, not a purpose. The flag answers: "do any events require the BeforeSending callback?" — which is true when we need to fire a CommandStartedEvent or need to capture state for later succeeded/failed events:

// The property reads naturally with the new name:
public bool ShouldCallBeforeSending => _eventsNeedBeforeSending || ShouldTraceWithActivityListener();

The three _shouldTrack{Started,Succeeded,Failed} flags are consistent and accurate as-is. The two renamed flags are the ones whose names no longer matched their semantics after the lazy-state-initialization refactor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

if (_state.TryGetValue(message.RequestId, out state) &&
if (_state == null)
{
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code path expected, under assumption that it's guarded by ShouldCallAfterSending() ?
If it's not, should we throw instead of return?

Copy link
Copy Markdown
Contributor Author

@ajcvickers ajcvickers Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because there is always a race condition when the listener is not set before we start, but is set later. This is fine, it just means there needs to be a null check.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, in a situation where ShouldCallBeforeSending == false and ShouldCallAfterSending == true.
Thanks!

public void ConnectionFailed(ConnectionId connectionId, ObjectId? serviceId, Exception exception, bool skipLogging)
{
if (!_shouldTrackFailed && !_shouldTrace)
if (!_shouldTrackFailed && !ShouldTraceWithActivityListener())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in scope of this PR, but we are already here: Should we be consistent with same pattern here as in methd ShouldCallAfterReceiving => AfterReceiving()?
Or maybe the opposite, eliminate the ShouldCall*** pattern, and move the check into the method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what you are suggesting here. Could you explain more?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently ConnectionFailed is always called, and the check !_shouldTrackFailed && !ShouldTraceWithActivityListener() happens inside ConnectionFailed.
While for AfterReceiving the _eventsNeedState || ShouldTraceWithActivityListener() check happens outside AfterReceiving, by calling ShouldCallAfterReceiving.

Should we have the same pattern for all methods?

If (condition)
   CallMethod

or

CallMethod() { 
 if (not condition)
   return
}

…ection

I was looking at flaky tests, and found MongoClient_should_create_activities_when_tracing_enabled was failing occasionally. Turns out it was an actual bug. Junie (Opus 4.6) says:

### Root Cause

`_shouldTrace` in `CommandEventHelper` was set **once** at connection construction time via `MongoTelemetry.ActivitySource.HasListeners()`. Since connections are pooled and reused, if a connection was created before an `ActivityListener` was registered (or by a test with tracing disabled), `_shouldTrace` remained `false` permanently for that connection — command activities were never created even when a listener was later active.
I've successfully refactored the code to fix the bug where activity listeners were being checked when the connection was created (at which point none may be set), but not checked again when the connection was pulled from the pool.
Key Changes:
Constructor (CommandEventHelper.cs:60): Removed the listener check from _shouldTrackState initialization. Now _shouldTrackState only depends on whether event tracking is needed (_shouldTrackSucceeded || _shouldTrackFailed), not on tracing with listeners.
Property Getters (CommandEventHelper.cs:71-94): Updated all property getters (ShouldCallBeforeSending, ShouldCallAfterSending, etc.) to dynamically check MongoTelemetry.ActivitySource.HasListeners() at execution time, ensuring that activity tracing is detected when connections are obtained from the pool.
Lazy State Initialization (CommandEventHelper.cs:37, 741-745): Changed _state from readonly to allow lazy initialization. The state dictionary is now only created when needed - either for event tracking or when tracing with listeners is detected at runtime.
Null Safety (CommandEventHelper.cs:132, 165, 192, 216, 233): Added null checks before accessing _state in all methods to handle cases where state tracking hasn't been initialized.
TrackCommandState (CommandEventHelper.cs:734-770): Updated to check for listeners at execution time and lazily initialize the state dictionary only when tracing is actually needed.
Test Update (CommandEventHelperTests.cs:122-124): Updated the test to reflect the new behavior where _shouldTrackState only reflects event tracking, not tracing with listeners (which is now checked dynamically).
Benefits:
Bug Fixed: Activity listeners are now checked when connections are obtained from the pool, not just when they're created
No Unnecessary State Tracking: State tracking is only enabled when actually needed (either for events or when listeners are registered), avoiding the performance overhead of always tracking state
Backward Compatible: The behavior is the same for all scenarios, just more efficient
The changes ensure that the driver properly detects activity listeners whenever a connection is used, whether it's newly created or retrieved from the pool, while avoiding unnecessary state tracking when no listeners are registered.
Copy link
Copy Markdown
Contributor

@BorisDog BorisDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a minor question pending. But it's not a blocker.
Everything else LGTM!

if (_state.TryGetValue(message.RequestId, out state) &&
if (_state == null)
{
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, in a situation where ShouldCallBeforeSending == false and ShouldCallAfterSending == true.
Thanks!

public void ConnectionFailed(ConnectionId connectionId, ObjectId? serviceId, Exception exception, bool skipLogging)
{
if (!_shouldTrackFailed && !_shouldTrace)
if (!_shouldTrackFailed && !ShouldTraceWithActivityListener())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently ConnectionFailed is always called, and the check !_shouldTrackFailed && !ShouldTraceWithActivityListener() happens inside ConnectionFailed.
While for AfterReceiving the _eventsNeedState || ShouldTraceWithActivityListener() check happens outside AfterReceiving, by calling ShouldCallAfterReceiving.

Should we have the same pattern for all methods?

If (condition)
   CallMethod

or

CallMethod() { 
 if (not condition)
   return
}

_queryTextMaxLength = tracingOptions?.QueryTextMaxLength ?? 0;

_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed || _shouldTrace;
_shouldTrackState = _shouldTrackSucceeded || _shouldTrackFailed;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Fixes issues or unintended behavior.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants