First, I apologize in advance for not being able to provide a short repro.
With GROMACS on Arc A770 and UHD770 (same machine), when running tests with a recent IntelLLVM build, some tests quite reliably (but not always) hang when using Level zero backend.
#0 0x00007e41edcfd7db in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007e41ea111dde in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2 0x00007e41ea107baa in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3 0x00007e41ec5a9ecb in ur::level_zero::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/llvm-project/build/install/lib/libur_adapter_level_zero.so.0
#4 0x00007e41ec7f95e7 in urEventWait () from /opt/llvm-project/build/install/lib/libur_loader.so.0
#5 0x00007e41ed981a3f in sycl::_V1::detail::event_impl::waitInternal(bool*) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#6 0x00007e41ed981ba0 in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#7 0x00007e41ed981cac in sycl::_V1::detail::event_impl::wait_and_throw(std::shared_ptr<sycl::_V1::detail::event_impl>) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#8 0x00007e41eda964b8 in sycl::_V1::event::wait_and_throw() () from /opt/llvm-project/build/install/lib/libsycl.so.8
#9 0x00007e41ef2998e8 in gmx::StatePropagatorDataGpu::Impl::waitVelocitiesReadyOnHost(gmx::AtomLocality) () from /opt/gromacs/build/bin/../lib/libgromacs.so.10
[level_zero:gpu] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.33276.160000]
Platforms: 1
Platform [#1]:
Version : 1.6
Name : Intel(R) oneAPI Unified Runtime over Level-Zero
Vendor : Intel(R) Corporation
Devices : 1
Type : gpu
Version : 12.55.8
Name : Intel(R) Arc(TM) A770 Graphics
Vendor : Intel(R) Corporation
Driver : 1.6.33276.160000
UUID : 13412816086800030000000
DeviceID : 22176
Num SubDevices : 0
Num SubSubDevices : 0
Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_bindless_images ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_matrix ext_oneapi_limited_graph ext_oneapi_private_alloca ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_virtual_functions ext_intel_spill_memory_size ext_intel_current_clock_throttle_reasons ext_intel_power_limits ext_oneapi_async_memory_alloc
info::device::sub_group_sizes: 8 16 32
Architecture: intel_gpu_acm_g10
Reverting to Compute Runtime 24.52 (LevelZero version 1.6.32224.500000) does not seem to change the behavior, the application still hangs.
I am willing to run more tests and understand that this is a pain to debug based on the description, but I'm stumped right now and would appreciate some advise in what to try next.
Other than two of the integration tests, things work just fine, and there's nothing special about those two that hang.
Normal execution time for the test is a couple seconds.
UR <--- EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal, false)(UR_RESULT_SUCCESS)
UR ---> setSignalEvent(Queue, UseCopyEngine, &ZeEvent, Event, NumEventsInWaitList, EventWaitList, CommandList->second.ZeQueue)
UR <--- setSignalEvent(Queue, UseCopyEngine, &ZeEvent, Event, NumEventsInWaitList, EventWaitList, CommandList->second.ZeQueue)(UR_RESULT_SUCCESS)
calling zeCommandListAppendMemoryCopy() with ZeEvent 448225560
NumEventsInWaitList 1: 448224632
ZE ---> zeCommandListAppendMemoryCopy(ZeCommandList, Dst, Src, Size, ZeEvent, WaitList.Length, WaitList.ZeEventList)
UR ---> Queue->executeCommandList(CommandList, BlockingWrite, OkToBatch)
UR <--- Queue->executeCommandList(CommandList, BlockingWrite, OkToBatch)(UR_RESULT_SUCCESS)
<--- urEnqueueUSMMemcpy(.hQueue = 0x1aa70220, .blocking = 0, .pDst = 0x7226a92bf000, .pSrc = 0xffffc001ff3f6000, .size = 7776, .numEventsInWaitList = 0, .phEventWaitList = nullptr, .phEvent = nullptr) -> UR_RESULT_SUCCESS;
---> urEnqueueEventsWaitWithBarrierExt
UR ---> TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, false )
UR <--- TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, false )(UR_RESULT_SUCCESS)
UR ---> Queue->Context->getAvailableCommandList( Queue, CmdList, false , NumEventsInWaitList, EventWaitList, OkToBatch, nullptr )
UR ---> Queue->insertStartBarrierIfDiscardEventsMode(CommandList)
UR <--- Queue->insertStartBarrierIfDiscardEventsMode(CommandList)(UR_RESULT_SUCCESS)
UR <--- Queue->Context->getAvailableCommandList( Queue, CmdList, false , NumEventsInWaitList, EventWaitList, OkToBatch, nullptr )(UR_RESULT_SUCCESS)
UR ---> insertBarrierIntoCmdList(CmdList, TmpWaitList, ResultEvent, IsInternal, InterruptBasedEventsEnabled)
UR ---> createEventAndAssociateQueue( Queue, &Event, UR_COMMAND_EVENTS_WAIT_WITH_BARRIER, CmdList, IsInternal, InterruptBasedEventsEnabled)
UR ---> EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)
Cache empty (Host Visible: 1, Profiling: 0, Counter: 0, Interrupt: 0, Device: 0x1a9f1dc0)
ZE ---> zeEventCreate(ZeEventPool, &ZeEventDesc, &ZeEvent)
UR <--- EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR ---> ur::level_zero::urEventRetain(*Event)
UR <--- ur::level_zero::urEventRetain(*Event)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue( Queue, &Event, UR_COMMAND_EVENTS_WAIT_WITH_BARRIER, CmdList, IsInternal, InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
ZE ---> zeCommandListAppendWaitOnEvents(CmdList->first, EventWaitList.Length, EventWaitList.ZeEventList)
ZE ---> zeCommandListAppendSignalEvent(CmdList->first, Event->ZeEvent)
UR <--- insertBarrierIntoCmdList(CmdList, TmpWaitList, ResultEvent, IsInternal, InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR ---> Queue->executeCommandList(CmdList, false , OkToBatch)
UR <--- Queue->executeCommandList(CmdList, false , OkToBatch)(UR_RESULT_SUCCESS)
<--- urEnqueueEventsWaitWithBarrierExt(.hQueue = 0x1aa70220, .pProperties = 0x7ffd0ae54530 ((struct ur_exp_enqueue_ext_properties_t){.stype = UR_STRUCTURE_TYPE_EXP_ENQUEUE_EXT_PROPERTIES, .pNext = nullptr, .flags = 0}), .numEventsInWaitList = 0, .phEventWaitList = nullptr, .phEvent = 0x7ffd0ae54400 (0x1ab40020)) -> UR_RESULT_SUCCESS;
---> urEventWait
UR ---> UrQueue->executeAllOpenCommandLists()
UR <--- UrQueue->executeAllOpenCommandLists()(UR_RESULT_SUCCESS)
ZeEvent = 448003464 # This line (with different value) is the last thing the hanging process prints
ZE ---> zeHostSynchronize(ZeEvent)
UR ---> CleanupEventListFromResetCmdList(EventListToCleanup, QueueLocked)
UR ---> CleanupCompletedEvent(Event, QueueLocked, true )
UR ---> urEventReleaseInternal(Event)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR <--- CleanupCompletedEvent(Event, QueueLocked, true )(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(Event)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
UR ---> CleanupCompletedEvent(Event, QueueLocked, true )
UR ---> ur::level_zero::urKernelRelease(AssociatedKernel)
UR <--- ur::level_zero::urKernelRelease(AssociatedKernel)(UR_RESULT_SUCCESS)
EDIT 1: Added info about a newer kernel and an older compute-runtime (no effect).
Describe the bug
First, I apologize in advance for not being able to provide a short repro.
With GROMACS on Arc A770 and UHD770 (same machine), when running tests with a recent IntelLLVM build, some tests quite reliably (but not always) hang when using Level zero backend.
Observations:
SYCL_UR_USE_LEVEL_ZERO_V2=1), the tests pass just fine.UR_L0_USE_DRIVER_INORDER_LISTS=1), the tests pass on A770, but still deadlock on UHD770.sycl::ext::oneapi::experimental::submitis used. If I build GROMACS with-DGMX_SYCL_ENABLE_EXPERIMENTAL_SUBMIT_API=OFF, or hackExecCGCommand::enqueueImpQueuein IntelLLVM to haveDiscardUrEventalways false, things work fine.SYCL_UR_TRACE=2also makes the test pass.ZE_DEBUG=1, the bug reproduces slightly less reliably on bare metal, and not at all in docker or when run under gdb (even without any breakpoints). See logs below.gdb(without tracing), the test hangs inzeHostSynchronize, which is consistent withZE_TRACEoutput:NEOReadDebugKeys=1 PrintIoctlEntries=1, the program seems to be stuck in the loop ofIOCTL DRM_IOCTL_I915_GET_RESET_STATS called/IOCTL DRM_IOCTL_I915_GET_RESET_STATS returns 0.Fence expiration time out i915-0000:00:02.0:mdrun-pull-test[2687535:8a!indmesg.UR_LEVEL_ZERO_LOADER_TAGinunified-runtime/cmake/FetchLevelZero.cmake, leads to Only Enable Teardown thread on windows and remove debug on success oneapi-src/level-zero#323. Given the sensitivity of the bug to tracing output, I suspect that this is not the breaking commit, but it somehow makes the bug more likely by making the destructors faster. With tracing enabled, there's quite somezeEventDestroyetc calls happening around the problematic code, and and it's in the middle of the run, so L0 should not be shutting down. Race condition in NEO / L0?To reproduce
docker run --device /dev/dri -e ONEAPI_DEVICE_SELECTOR=level_zero:0 -w /opt/gromacs/build --rm -it CONTAINER_NAME ./bin/mdrun-pull-test -ntmpi 1 -ntomp 120 steps, 0.0 ps.Dockerfile.txt
Environment
Reverting to Compute Runtime 24.52 (LevelZero version 1.6.32224.500000) does not seem to change the behavior, the application still hangs.
Additional context
I am willing to run more tests and understand that this is a pain to debug based on the description, but I'm stumped right now and would appreciate some advise in what to try next.
Other than two of the integration tests, things work just fine, and there's nothing special about those two that hang.
Normal execution time for the test is a couple seconds.
I'm attaching two log files,
ze_debug_fail.txt(bare-metal run withZE_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:0, that ends up hanging) andze_debug_pass.txt(bare-metal run withSYCL_UR_TRACE=2 ZE_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:0, that ends up passing). Diffing them one can more-or-less see what's going on in UR when the application is hanging:ze_debug_pass.txt
ze_debug_fail.txt
EDIT 1: Added info about a newer kernel and an older compute-runtime (no effect).