Deadlock after fork when calling dgetrf_

As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling `dgetrf_` after a fork. I instrumented the calls to `LOCK_COMMAND` and `UNLOCK_COMMAND` in `blas_server.c` and I think the problem is in `exec_blas_async`. This is "new" after #5170.

Here is the `main()` of the test code
```
int main() {
    int64_t m = 200, n = 200;
    int64_t lda = m;
    int64_t info;
    int64_t ipiv[200];

    // array is an identity matrix
    double arr[200*200];
    for (int i = 0; i < m*n; i += n + 1) {
        arr[i] = 1.0;
    }

    printf("before fork\n");
    pid_t pid = fork();
    printf("after fork\n");
    if (pid == 0) {
        printf("inside child\n");
        exit(0);
    } else {
        wait(NULL);
    }

    printf("before dgetrf\n");
    dgetrf_(&m, &n, arr, &lda, ipiv, &info);
    printf("after dgetrf\n");
```

and here is what I see with debug printing (on OpenBLAS HEAD, using ``)

```
installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565
```

Note the call to `LOCK_COMMAND` in `exec_blas_async`, and then the call to `blas_thread_init`, which again tries to call `LOCK_COMMAND`. Boom. https://github.com/OpenMathLib/OpenBLAS/blob/0c59ae0b45a8f30224f045902bc558381d6f8974/driver/others/blas_server.c#L638-L644

I am not sure what the best way is to solve this. Note that the first thing `blas_thread_init` does is to check `blas_server_avail` (with no lock), so maybe the lock/unlock in `exec_blas_async` should be removed?

	#ifdef SMP_SERVER
	// Handle lazy re-init of the thread-pool after a POSIX fork
	LOCK_COMMAND(&server_lock);
	if (unlikely(blas_server_avail == 0)) blas_thread_init();
	UNLOCK_COMMAND(&server_lock);
	#endif
	BLASLONG i = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock after fork when calling dgetrf_ #5520

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Deadlock after fork when calling dgetrf_ #5520

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions