Skip to content

dml: add per-instance mutexes to fix concurrent session crashes#28007

Open
oysteinkrog wants to merge 1 commit intomicrosoft:mainfrom
oysteinkrog:fix/dml-concurrent-session-crash
Open

dml: add per-instance mutexes to fix concurrent session crashes#28007
oysteinkrog wants to merge 1 commit intomicrosoft:mainfrom
oysteinkrog:fix/dml-concurrent-session-crash

Conversation

@oysteinkrog
Copy link
Copy Markdown

Description

Adds thread-safety to 4 DML EP data structures that race when multiple InferenceSession instances run concurrently on the same D3D12 device. Without these locks, concurrent DML sessions crash with 0x8000FFFF ("Catastrophic failure") in MLOperatorAuthorImpl.cpp.

Problem

When creating multiple InferenceSession instances that share the same D3D12 device (e.g., running person detection and pose estimation models simultaneously), the DML EP crashes because several internal data structures are not thread-safe:

  • BucketizedBufferAllocator::Alloc/FreeResource — concurrent allocations corrupt bucket lists
  • CommandQueue methods — concurrent command list submissions race
  • ExecutionContext — concurrent Flush/SetCommandRecorder calls race
  • DescriptorPool::AllocDescriptors — concurrent descriptor allocation corrupts pool state

Fix

Add per-instance mutexes to each of the 4 classes:

Class Lock type Why
BucketizedBufferAllocator std::mutex + std::atomic for m_defaultRoundingMode FreeResource releases lock before calling QueueReference to prevent lock-order inversion
CommandQueue std::recursive_mutex Re-entrance: ExecuteCommandListExecuteCommandLists, CloseGetCurrentCompletionEvent
ExecutionContext std::recursive_mutex + std::atomic<bool> for m_closed Re-entrance: FlushSetCommandRecorder cycle
DescriptorPool std::mutex Simple mutual exclusion on AllocDescriptors, Trim, GetTotalCapacity

Each session has its own instances of these objects, so the mutexes only serialize intra-session calls. Cross-session concurrency is fully preserved.

Verification

Tested with concurrent inference stress tests:

  • 2 models (person detection + pose estimation) running simultaneously — crashes consistently without fix, stable with fix
  • 3 models running simultaneously — stable
  • 1000-iteration stress test — no crashes

Tested on NVIDIA GeForce RTX 5070 Ti with DirectML, Windows 11.

Motivation and Context

Applications that run multiple ML models concurrently (e.g., real-time sports analysis with person detection + pose estimation) need concurrent DML sessions for performance. The current code assumes single-threaded access to per-session EP objects, which breaks when sessions share a D3D12 device.

This is a minimal fix — only adding locks where data races were observed. No API changes, no behavioral changes for single-session usage.

Add thread-safety to 4 DML EP data structures that race when multiple
InferenceSessions run concurrently on the same D3D12 device:

- BucketizedBufferAllocator: std::mutex on Alloc/FreeResource,
  std::atomic for m_defaultRoundingMode. FreeResource releases lock
  before calling ExecutionContext::QueueReference to prevent lock-order
  inversion (allocator→context vs context→queue→allocator).

- CommandQueue: std::recursive_mutex on all methods (re-entrance:
  ExecuteCommandList→ExecuteCommandLists, Close→GetCurrentCompletionEvent).

- ExecutionContext: std::recursive_mutex on all public/private methods
  (re-entrance: Flush↔SetCommandRecorder cycle). std::atomic<bool> for
  m_closed to eliminate data race in IsClosed().

- DescriptorPool: std::mutex on AllocDescriptors, Trim, GetTotalCapacity.

Each session has its own instances of these objects, so the mutexes only
serialize intra-session calls. Cross-session concurrency is fully preserved.

Fixes 0x8000FFFF "Catastrophic failure" in MLOperatorAuthorImpl.cpp when
running concurrent DML inference sessions with per-session command queues.

Verified with concurrent inference stress tests (2-3 models running
simultaneously, 1000+ iterations) — crashes consistently without the fix,
stable with the fix.
@oysteinkrog
Copy link
Copy Markdown
Author

@oysteinkrog please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Initial Force AS"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant