Skip to content

Commit a1fc916

Browse files
adrastogiAditya Rastogi
andauthored
Additional diagnostics for DML failure path (#28495)
### Description <!-- Describe your changes. --> In DmlGraphFusionHelper::ExecuteReusableCommandList, after ExecuteCommandList fails: * Broaden the failure branch from just DXGI_ERROR_DEVICE_REMOVED to also catch DEVICE_HUNG, DEVICE_RESET, and DRIVER_INTERNAL_ERROR. * Query GetDeviceRemovedReason on both the DML and D3D12 devices (matching the pattern in DmlCommandRecorder.cpp). * Throw via ORT_THROW_HR_MSG with a clear message that names the failure as a TDR / device-removal event, calls out and includes all three HRESULTs for triage. Preserves the prior thrown-HRESULT for the existing DEVICE_REMOVED path ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> While investigating a WebNN sample failure on Chrome running Stable Diffusion 1.5 on an AMD Radeon 860M iGPU, ORT 1.23.4 surfaced this error: `DmlGraphFusionHelper.cpp(1078) ... 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.` 0x887A0006 is DXGI_ERROR_DEVICE_HUNG. The text "...invalid command passed by the calling application" seems to be the FormatMessage string for that HRESULT. The pre-existing code in DmlGraphFusionHelper::ExecuteReusableCommandList only special-cased DXGI_ERROR_DEVICE_REMOVED, so for DEVICE_HUNG / DEVICE_RESET / DRIVER_INTERNAL_ERROR HRESULTs the user just got the raw message. I wanted to add a little more diagnostic information to this. Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
1 parent d464b2a commit a1fc916

1 file changed

Lines changed: 42 additions & 4 deletions

File tree

onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlGraphFusionHelper.cpp

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1094,11 +1094,49 @@ namespace DmlGraphFusionHelper
10941094
uint64_t completionValue;
10951095
HRESULT hr = provider->ExecuteCommandList(commandListState.graphicsCommandList.Get(), fence.GetAddressOf(), &completionValue);
10961096

1097-
if (hr == DXGI_ERROR_DEVICE_REMOVED)
1097+
// ExecuteCommandList may report any of the device-lost / removal-class HRESULTs when
1098+
// the GPU device has transitioned to a "removed" state. Windows often surfaces these
1099+
// through FormatMessage as "...most likely because of an invalid command...", but in
1100+
// practice they almost always indicate a Timeout Detection and Recovery (TDR) event
1101+
// (e.g. a long-running shader exceeding the system TdrDelay), a driver fault, or a
1102+
// hardware reset rather than a malformed command from the calling application.
1103+
//
1104+
// GetDeviceRemovedReason on the DML and D3D12 devices reports the underlying reason
1105+
// when the device has been removed. Throw with the most specific reason and a
1106+
// clearer message so that downstream tooling (Watson, telemetry, anyone reading the
1107+
// log) doesn't have to guess at the cause.
1108+
if (hr == DXGI_ERROR_DEVICE_REMOVED ||
1109+
hr == DXGI_ERROR_DEVICE_HUNG ||
1110+
hr == DXGI_ERROR_DEVICE_RESET ||
1111+
hr == DXGI_ERROR_DRIVER_INTERNAL_ERROR)
10981112
{
1099-
ComPtr<ID3D12Device> device;
1100-
ORT_THROW_IF_FAILED(provider->GetD3DDevice(&device));
1101-
ORT_THROW_IF_FAILED(device->GetDeviceRemovedReason());
1113+
ComPtr<ID3D12Device> d3dDevice;
1114+
ComPtr<IDMLDevice> dmlDevice;
1115+
ORT_THROW_IF_FAILED(provider->GetD3DDevice(&d3dDevice));
1116+
ORT_THROW_IF_FAILED(provider->GetDmlDevice(&dmlDevice));
1117+
1118+
const HRESULT dmlRemovedReason = dmlDevice->GetDeviceRemovedReason();
1119+
const HRESULT d3dRemovedReason = d3dDevice->GetDeviceRemovedReason();
1120+
1121+
// Prefer the more-specific reason returned by GetDeviceRemovedReason - matching
1122+
// the prior behavior for DXGI_ERROR_DEVICE_REMOVED and the pattern in
1123+
// DmlCommandRecorder.cpp which checks DML first, then D3D12. Fall back to the
1124+
// original ExecuteCommandList HRESULT if neither device reports a removal reason.
1125+
const HRESULT throwHr = FAILED(dmlRemovedReason) ? dmlRemovedReason
1126+
: FAILED(d3dRemovedReason) ? d3dRemovedReason
1127+
: hr;
1128+
1129+
ORT_THROW_HR_MSG(throwHr,
1130+
"DirectML execution failed because of a device-lost / removal-class error. "
1131+
"Windows may report this as 'invalid command' via FormatMessage, but in practice "
1132+
"this often indicates a Timeout Detection and Recovery (TDR) event (e.g. a shader "
1133+
"exceeding the system TdrDelay), a driver fault, or a hardware reset rather than "
1134+
"a malformed command from the application. ExecuteCommandList HRESULT=0x%08X, "
1135+
"ID3D12Device::GetDeviceRemovedReason=0x%08X, "
1136+
"IDMLDevice::GetDeviceRemovedReason=0x%08X.",
1137+
static_cast<unsigned int>(hr),
1138+
static_cast<unsigned int>(d3dRemovedReason),
1139+
static_cast<unsigned int>(dmlRemovedReason));
11021140
}
11031141

11041142
ORT_THROW_IF_FAILED(hr);

0 commit comments

Comments
 (0)