Skip to content

Additional diagnostics for DML failure path#28495

Merged
adrastogi merged 2 commits into
mainfrom
adrastogi/dml-diagnostic-fix
May 23, 2026
Merged

Additional diagnostics for DML failure path#28495
adrastogi merged 2 commits into
mainfrom
adrastogi/dml-diagnostic-fix

Conversation

@adrastogi
Copy link
Copy Markdown
Contributor

Description

In DmlGraphFusionHelper::ExecuteReusableCommandList, after ExecuteCommandList fails:

  • Broaden the failure branch from just DXGI_ERROR_DEVICE_REMOVED to also catch DEVICE_HUNG, DEVICE_RESET, and
    DRIVER_INTERNAL_ERROR.
  • Query GetDeviceRemovedReason on both the DML and D3D12 devices (matching the pattern in DmlCommandRecorder.cpp).
  • Throw via ORT_THROW_HR_MSG with a clear message that names the failure as a TDR / device-removal event, calls out and includes all three HRESULTs for triage. Preserves the prior thrown-HRESULT for the existing DEVICE_REMOVED path

Motivation and Context

While investigating a WebNN sample failure on Chrome running Stable Diffusion 1.5 on an AMD Radeon 860M iGPU, ORT 1.23.4 surfaced this error:

DmlGraphFusionHelper.cpp(1078) ... 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

0x887A0006 is DXGI_ERROR_DEVICE_HUNG. The text "...invalid command passed by the calling application" seems to be the FormatMessage string for that HRESULT.

The pre-existing code in DmlGraphFusionHelper::ExecuteReusableCommandList only special-cased DXGI_ERROR_DEVICE_REMOVED, so for DEVICE_HUNG / DEVICE_RESET / DRIVER_INTERNAL_ERROR HRESULTs the user just got the raw message. I wanted to add a little more diagnostic information to this.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds richer diagnostics in DmlGraphFusionHelper::ExecuteReusableCommandList when ExecuteCommandList fails with a device-lost/removal-class HRESULT, so that TDR-induced failures (DEVICE_HUNG, DEVICE_RESET, DRIVER_INTERNAL_ERROR) no longer surface only as the misleading "invalid command" FormatMessage text.

Changes:

  • Broadens the failure branch from DXGI_ERROR_DEVICE_REMOVED only to also include DEVICE_HUNG, DEVICE_RESET, and DRIVER_INTERNAL_ERROR.
  • Queries GetDeviceRemovedReason on both the DML and D3D12 devices and selects the most specific HRESULT to throw with.
  • Uses ORT_THROW_HR_MSG to emit a clear message naming TDR/device-removal as the likely cause and including the original HRESULT and both removed-reason HRESULTs.
Show a summary per file
File Description
onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlGraphFusionHelper.cpp Expands the device-lost detection branch and emits a more diagnostic error message including DML/D3D12 GetDeviceRemovedReason values.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 0

@adrastogi adrastogi requested a review from fdwr May 15, 2026 03:20
@fdwr
Copy link
Copy Markdown
Contributor

fdwr commented May 18, 2026

Note you might also get more info via the DML debug layer. It's probably already installed you have Visual Studio installed, but otherwise: https://learn.microsoft.com/en-us/windows/ai/directml/dml-debug-layer. Then:

  • Start / Run / dxcpl.exe
  • Add your process .exe path to the list.
  • Then Force debug messages on.
  • You should see additional output in the Visual Studio Output window, a different debugger of your choice, or via DebugView.

image

Copy link
Copy Markdown
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@adrastogi adrastogi merged commit a1fc916 into main May 23, 2026
88 checks passed
@adrastogi adrastogi deleted the adrastogi/dml-diagnostic-fix branch May 23, 2026 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants