Skip to content

Tests | Flakiness improvements to XEventsTracingTest#4262

Merged
mdaigle merged 5 commits into
dotnet:mainfrom
edwardneal:tests/xevent-reliability
May 18, 2026
Merged

Tests | Flakiness improvements to XEventsTracingTest#4262
mdaigle merged 5 commits into
dotnet:mainfrom
edwardneal:tests/xevent-reliability

Conversation

@edwardneal

@edwardneal edwardneal commented May 7, 2026

Copy link
Copy Markdown
Contributor

Description

This performs some reliability improvements to XEventsTracingTest. We can see intermittent failures, most of which are the result of them being killed to resolve deadlocks.

I wouldn't normally expect the original SQL statements (sp_help and SELECT @@VERSION) to encounter that, but sp_help returns the list of objects in the current database; perhaps if objects are being created by another CI run, this becomes an issue.

To break the dependency on server state, I've replaced both method calls with a call to a new SP which just runs SELECT 1, and a SQL statement which runs SELECT 1 directly.

During investigation it also became clear that an activity ID is recorded (and the test is capable of passing) even when a deadlock or other SQL error occurs. I've made this explicit via a new test case. Technically this means that we could simply broaden the error handling to swallow all SqlException errors when executing the command and the test would continue to pass. I've not done this because I'm a little more concerned that we're encountering deadlocks on comparatively simple statements, and don't want to mask any underlying issue.

Besides this, there are a few QoL improvements:

  • One test method is now refactored into three coherent test cases. This also makes the reflection work to retrieve the test case name unnecessary.
  • Added XML documentation which describes the tests (and the reason why they're marked as flaky.)
  • A new FlushResultSet helper was added in an earlier PR, and we now use it.

Issues

Fixes #3453.

Testing

One new test case. All three XEvents tests pass, but I don't think I can easily reproduce the same kind of load. Could someone run CI against this PR multiple times at peak load please?

* Refactor one test method into three distinct test cases.
* Add reasons for the tests being marked as flaky.
* Switch call to 'sp_help' to a new, simpler SP.
* Switch execution of 'SELECT @@Version' to a simpler 'SELECT 1' statement.
* Use new FlushResultSet helper.
* Simplify XEvent session name generation.
* Add test case to verify that an activity ID is recorded in the extended event when the SQL statement throws an error.
@edwardneal edwardneal requested a review from a team as a code owner May 7, 2026 04:59
@github-project-automation github-project-automation Bot moved this to To triage in SqlClient Board May 7, 2026
@cheenamalhotra cheenamalhotra moved this from To triage to In review in SqlClient Board May 7, 2026
@cheenamalhotra cheenamalhotra added the Area\Tests Issues that are targeted to tests or test projects label May 7, 2026
@cheenamalhotra cheenamalhotra added this to the 7.1.0-preview2 milestone May 7, 2026
@cheenamalhotra

Copy link
Copy Markdown
Member

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal

Copy link
Copy Markdown
Contributor Author

That's looking positive for the first run, although it looks like the LocalAppContextSwitchesTest.TestDefaultAppContextSwitchValues unit test is unrelatedly flaky - possibly because it's running in parallel with other unit tests which change the values. I'd appreciate a few more runs of this.

@mdaigle

mdaigle commented May 12, 2026

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal

Copy link
Copy Markdown
Contributor Author

No deadlocks on the second run, but it looks like there are a few problems with the number of sessions open against a single Azure SQL instance - there's a hard limit of 128MB session memory:

Microsoft.Data.SqlClient.SqlException : Operation failed. Operation will cause database event session memory to exceed allowed limit. Event session memory may be released by stopping active sessions or altering session memory options. Check sys.dm_xe_database_sessions for active sessions that can be stopped or altered.

We'd originally increased the MAX_MEMORY from 4MB to 16MB in an attempt to tackle the deadlocks, but perhaps these have been addressed at source. I've cut this back to 4MB, can someone run the pipelines a few times please?

@priyankatiwari08

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal

Copy link
Copy Markdown
Contributor Author

I've added an explanatory comment and pushed, thanks @priyankatiwari08. The last pipeline run passed, so I'd appreciate a few more at peak load before I unmark them as flaky.

@mdaigle

mdaigle commented May 12, 2026

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal

Copy link
Copy Markdown
Contributor Author

All tests passed on those two runs, which looks encouraging - particularly since this run had one or two failures which seemed to be an indicator that the remote server is under load. Thanks.

Could you run this twice more at peak please?

@mdaigle

mdaigle commented May 12, 2026

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@mdaigle

mdaigle commented May 13, 2026

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@priyankatiwari08 priyankatiwari08 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approving this PR

@mdaigle

mdaigle commented May 14, 2026

Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 2 pipeline(s).

@mdaigle mdaigle enabled auto-merge (squash) May 14, 2026 20:25
@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.99%. Comparing base (b700269) to head (f055a68).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4262      +/-   ##
==========================================
- Coverage   66.04%   65.99%   -0.06%     
==========================================
  Files         275      272       -3     
  Lines       42976    66258   +23282     
==========================================
+ Hits        28383    43725   +15342     
- Misses      14593    22533    +7940     
Flag Coverage Δ
CI-SqlClient ?
PR-SqlClient-Project 65.99% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mdaigle mdaigle merged commit 58d784e into dotnet:main May 18, 2026
302 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in SqlClient Board May 18, 2026
@edwardneal edwardneal deleted the tests/xevent-reliability branch May 18, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area\Tests Issues that are targeted to tests or test projects

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

XEvent tests intermittent failures

4 participants