dml: add per-instance mutexes to fix concurrent session crashes by oysteinkrog · Pull Request #28007 · microsoft/onnxruntime

oysteinkrog · 2026-04-08T07:18:54Z

Description

Adds thread-safety to 4 DML EP data structures that race when multiple InferenceSession instances run concurrently on the same D3D12 device. Without these locks, concurrent DML sessions crash with 0x8000FFFF ("Catastrophic failure") in MLOperatorAuthorImpl.cpp.

Problem

When creating multiple InferenceSession instances that share the same D3D12 device (e.g., running person detection and pose estimation models simultaneously), the DML EP crashes because several internal data structures are not thread-safe:

BucketizedBufferAllocator::Alloc/FreeResource — concurrent allocations corrupt bucket lists
CommandQueue methods — concurrent command list submissions race
ExecutionContext — concurrent Flush/SetCommandRecorder calls race
DescriptorPool::AllocDescriptors — concurrent descriptor allocation corrupts pool state

Fix

Add per-instance mutexes to each of the 4 classes:

Class	Lock type	Why
`BucketizedBufferAllocator`	`std::mutex` + `std::atomic` for `m_defaultRoundingMode`	`FreeResource` releases lock before calling `QueueReference` to prevent lock-order inversion
`CommandQueue`	`std::recursive_mutex`	Re-entrance: `ExecuteCommandList` → `ExecuteCommandLists`, `Close` → `GetCurrentCompletionEvent`
`ExecutionContext`	`std::recursive_mutex` + `std::atomic<bool>` for `m_closed`	Re-entrance: `Flush` ↔ `SetCommandRecorder` cycle
`DescriptorPool`	`std::mutex`	Simple mutual exclusion on `AllocDescriptors`, `Trim`, `GetTotalCapacity`

Each session has its own instances of these objects, so the mutexes only serialize intra-session calls. Cross-session concurrency is fully preserved.

Verification

Tested with concurrent inference stress tests:

2 models (person detection + pose estimation) running simultaneously — crashes consistently without fix, stable with fix
3 models running simultaneously — stable
1000-iteration stress test — no crashes

Tested on NVIDIA GeForce RTX 5070 Ti with DirectML, Windows 11.

Motivation and Context

Applications that run multiple ML models concurrently (e.g., real-time sports analysis with person detection + pose estimation) need concurrent DML sessions for performance. The current code assumes single-threaded access to per-session EP objects, which breaks when sessions share a D3D12 device.

This is a minimal fix — only adding locks where data races were observed. No API changes, no behavioral changes for single-session usage.

Add thread-safety to 4 DML EP data structures that race when multiple InferenceSessions run concurrently on the same D3D12 device: - BucketizedBufferAllocator: std::mutex on Alloc/FreeResource, std::atomic for m_defaultRoundingMode. FreeResource releases lock before calling ExecutionContext::QueueReference to prevent lock-order inversion (allocator→context vs context→queue→allocator). - CommandQueue: std::recursive_mutex on all methods (re-entrance: ExecuteCommandList→ExecuteCommandLists, Close→GetCurrentCompletionEvent). - ExecutionContext: std::recursive_mutex on all public/private methods (re-entrance: Flush↔SetCommandRecorder cycle). std::atomic<bool> for m_closed to eliminate data race in IsClosed(). - DescriptorPool: std::mutex on AllocDescriptors, Trim, GetTotalCapacity. Each session has its own instances of these objects, so the mutexes only serialize intra-session calls. Cross-session concurrency is fully preserved. Fixes 0x8000FFFF "Catastrophic failure" in MLOperatorAuthorImpl.cpp when running concurrent DML inference sessions with per-session command queues. Verified with concurrent inference stress tests (2-3 models running simultaneously, 1000+ iterations) — crashes consistently without the fix, stable with the fix.

oysteinkrog · 2026-04-08T07:41:10Z

@oysteinkrog please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Initial Force AS"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dml: add per-instance mutexes to fix concurrent session crashes#28007

dml: add per-instance mutexes to fix concurrent session crashes#28007
oysteinkrog wants to merge 1 commit intomicrosoft:mainfrom
oysteinkrog:fix/dml-concurrent-session-crash

oysteinkrog commented Apr 8, 2026

Uh oh!

oysteinkrog commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oysteinkrog commented Apr 8, 2026

Description

Problem

Fix

Verification

Motivation and Context

Uh oh!

oysteinkrog commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant