Hi,
I am performing large-scale SVD computations using MAK on GPUs and frequently encounter Out-Of-Memory (OOM) errors as the matrix size grows. I would like to ask whether there are ways to make GPU SVD more memory-efficient.
For example. In svd_compact!(A), when size(A, 1) < size(A, 2), I noticed that _gpu_gesvd_maybe_transpose! transposes the input matrix before calling cuSOLVER’s _gpu_gesvd!, since cuSOLVER only supports the m ≥ n case. However, this creates a second transposed matrix on the GPU while the original is still alive, effectively doubling memory usage. Would it be possible to free the original matrix immediately after the transpose if it is no longer needed in MAK? And the same for the output matrix?
And, are there any other way help me handle larger matrix SVD?
I have run into a similar issue when using TensorOperations.jl. For example, a contraction such as @tensor A[a, c, b, d] := B[a, k, b] * C[c, k, d] causes internal tensor permutations before contraction, resulting in additional temporary allocations. If B and C are large CuArrays, the temporary copies can again double memory usage.
Additionally, I have locally modified TensorKit so that contractions and factorizations run on GPU. In doing so, I noticed that the Julia garbage collector sometimes delays reclaiming GPU memory, causing the next large allocation to fail with OOM. To avoid this, I sometimes need to call CUDA.unsafe_free! manually. Has this GC behavior been observed by the maintainers?
Thanks for the great packages and all your work!
Hi,
I am performing large-scale SVD computations using MAK on GPUs and frequently encounter Out-Of-Memory (OOM) errors as the matrix size grows. I would like to ask whether there are ways to make GPU SVD more memory-efficient.
For example. In
svd_compact!(A), whensize(A, 1) < size(A, 2), I noticed that_gpu_gesvd_maybe_transpose!transposes the input matrix before calling cuSOLVER’s_gpu_gesvd!, since cuSOLVER only supports them ≥ ncase. However, this creates a second transposed matrix on the GPU while the original is still alive, effectively doubling memory usage. Would it be possible to free the original matrix immediately after the transpose if it is no longer needed in MAK? And the same for the output matrix?And, are there any other way help me handle larger matrix SVD?
I have run into a similar issue when using TensorOperations.jl. For example, a contraction such as
@tensor A[a, c, b, d] := B[a, k, b] * C[c, k, d]causes internal tensor permutations before contraction, resulting in additional temporary allocations. If B and C are largeCuArrays, the temporary copies can again double memory usage.Additionally, I have locally modified TensorKit so that contractions and factorizations run on GPU. In doing so, I noticed that the Julia garbage collector sometimes delays reclaiming GPU memory, causing the next large allocation to fail with OOM. To avoid this, I sometimes need to call
CUDA.unsafe_free!manually. Has this GC behavior been observed by the maintainers?Thanks for the great packages and all your work!