As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.
Here is the main() of the test code
int main() {
int64_t m = 200, n = 200;
int64_t lda = m;
int64_t info;
int64_t ipiv[200];
// array is an identity matrix
double arr[200*200];
for (int i = 0; i < m*n; i += n + 1) {
arr[i] = 1.0;
}
printf("before fork\n");
pid_t pid = fork();
printf("after fork\n");
if (pid == 0) {
printf("inside child\n");
exit(0);
} else {
wait(NULL);
}
printf("before dgetrf\n");
dgetrf_(&m, &n, arr, &lda, ipiv, &info);
printf("after dgetrf\n");
and here is what I see with debug printing (on OpenBLAS HEAD, using ``)
installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565
Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.
|
#ifdef SMP_SERVER |
|
// Handle lazy re-init of the thread-pool after a POSIX fork |
|
LOCK_COMMAND(&server_lock); |
|
if (unlikely(blas_server_avail == 0)) blas_thread_init(); |
|
UNLOCK_COMMAND(&server_lock); |
|
#endif |
|
BLASLONG i = 0; |
I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?
As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling
dgetrf_after a fork. I instrumented the calls toLOCK_COMMANDandUNLOCK_COMMANDinblas_server.cand I think the problem is inexec_blas_async. This is "new" after #5170.Here is the
main()of the test codeand here is what I see with debug printing (on OpenBLAS HEAD, using ``)
Note the call to
LOCK_COMMANDinexec_blas_async, and then the call toblas_thread_init, which again tries to callLOCK_COMMAND. Boom.OpenBLAS/driver/others/blas_server.c
Lines 638 to 644 in 0c59ae0
I am not sure what the best way is to solve this. Note that the first thing
blas_thread_initdoes is to checkblas_server_avail(with no lock), so maybe the lock/unlock inexec_blas_asyncshould be removed?