Skip to content

Update support for sparse linear algebra#937

Open
luraess wants to merge 1 commit into
mainfrom
lr/sp-int
Open

Update support for sparse linear algebra#937
luraess wants to merge 1 commit into
mainfrom
lr/sp-int

Conversation

@luraess

@luraess luraess commented Jun 25, 2026

Copy link
Copy Markdown
Member

Sparse LinearAlgebra interfaces

Follows-up on #859 and supersedes #861

Implements a first round of LinearAlgebra interfaces (+/-, UniformScaling, Diagonal, triu/tril, kron) for GPU sparse matrices, hooking into the GPUArrays generic machinery where possible and falling back to format-aware implementations where it is not.

The implementation was robot-helped by Claude under my steering.

What is added

src/sparse/interfaces.jl

  • _sptranspose / _spadjoint for CSR, CSC, COO — required by GPUArrays to materialise transpose / adjoint wrappers.
  • +/- for all combinations of plain/transposed/adjoint CSR and CSC via geam; COO routes through CSR and converts back. Cross-format pairs (CSR+CSC, CSR+BSR and their reverses) normalise both operands to CSR before calling geam.
  • +/-/* with UniformScaling for all three formats and their transposed/adjointed wrappers. The identity is materialised as a same-format sparse matrix (_sparse_identity); a TODO notes a potentially more efficient broadcast-singleton approach.
  • +/-/* with Diagonal for all three formats and wrappers. Addition converts the Diagonal to the same sparse format and delegates to the existing geam path. Multiplication scales the nonzero values directly via the COO index arrays (d[colInd] / d[rowInd]); for CSR and CSC this involves a round-trip through COO.

src/sparse/linalg.jl

  • triu / tril for COO by masking; CSR/CSC fall through the GPUArrays generic path which dispatches to coo_type.
  • kron for COO×COO, COO×Diagonal, Diagonal×COO via GPU-side repeat/broadcast on the index and value arrays.

src/sparse/conversions.jl

  • ROCSparseMatrixCOO(::Diagonal) constructor chain.
  • Typed forwarding constructors ROCSparseMatrixCSR{Tv,Ti}(coo) / ROCSparseMatrixCSC{Tv,Ti}(coo) needed by GPUArrays generics.

src/sparse/array.jl

  • ROCSparseMatrix(transpose/adjoint(other_sparse)) constructors for GPU-to-GPU round-trips.

Potential follow-ups

  • * Diagonal on CSR/CSC goes through two format conversions (CSR→COO→CSR). A dedicated rocsparse kernel or a direct rowPtr-walk would be more efficient but would require more infrastructure.
  • COO +/- also routes through CSR; same trade-off applies.
  • Cross-format +/- (e.g. CSR + CSC) always returns CSR. The output type could arguably follow the left operand, but the approach keeps things simple and correct.
  • kron with Diagonal uses collect on the CPU for the diagonal index arrays before uploading; a fully GPU-side construction would avoid the round-trip.
  • _sparse_identity allocates the full sparse identity, which is wasteful for large matrices (see TODO). A broadcast-singleton approach (as in SparseArrays.PromoteToSparse) would be cleaner.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU.jl Benchmarks

Details
Benchmark suite Current: 0517d58 Previous: 7c9aab0 Ratio
amdgpu/synchronization/context/device 610 ns 600 ns 1.02
amdgpu/synchronization/stream/blocking 250 ns 250 ns 1
amdgpu/synchronization/stream/nonblocking 330 ns 330 ns 1
array/accumulate/Float32/1d 87831 ns 85972 ns 1.02
array/accumulate/Float32/dims=1 391265 ns 412075 ns 0.95
array/accumulate/Float32/dims=1L 136152 ns 137091 ns 0.99
array/accumulate/Float32/dims=2 129292 ns 130332 ns 0.99
array/accumulate/Float32/dims=2L 2810460 ns 2810115 ns 1.00
array/accumulate/Int64/1d 98861 ns 102751 ns 0.96
array/accumulate/Int64/dims=1 281824 ns 442706 ns 0.64
array/accumulate/Int64/dims=1L 168312 ns 167432 ns 1.01
array/accumulate/Int64/dims=2 127811 ns 127031 ns 1.01
array/accumulate/Int64/dims=2L 2988642 ns 2984467 ns 1.00
array/broadcast 93701 ns 70231 ns 1.33
array/construct 1780 ns 1700 ns 1.05
array/copy 37160 ns 40561 ns 0.92
array/copyto!/cpu_to_gpu 183463 ns 121541 ns 1.51
array/copyto!/gpu_to_cpu 182693 ns 114461 ns 1.60
array/copyto!/gpu_to_gpu 82171 ns 66551 ns 1.23
array/iteration/findall/bool 181293 ns 181832 ns 1.00
array/iteration/findall/int 190223 ns 192932 ns 0.99
array/iteration/findfirst/bool 117851 ns 122251 ns 0.96
array/iteration/findfirst/int 116232 ns 116342 ns 1.00
array/iteration/findmin/1d 170642 ns 170152 ns 1.00
array/iteration/findmin/2d 156303 ns 153822 ns 1.02
array/iteration/logical 357785 ns 350744 ns 1.02
array/iteration/scalar 295964 ns 296083 ns 1.00
array/permutedims/2d 75072 ns 74481 ns 1.01
array/permutedims/3d 75231 ns 74251 ns 1.01
array/permutedims/4d 76821 ns 76951 ns 1.00
array/random/rand/Float32 52021 ns 52171 ns 1.00
array/random/rand/Int64 58311 ns 58731 ns 0.99
array/random/rand!/Float32 88511 ns 85101 ns 1.04
array/random/rand!/Int64 115051 ns 69261 ns 1.66
array/random/randn/Float32 86562 ns 98642 ns 0.88
array/random/randn!/Float32 167133 ns 101231 ns 1.65
array/reductions/mapreduce/Float32/1d 134282 ns 134242 ns 1.00
array/reductions/mapreduce/Float32/dims=1 95312 ns 95431 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 774121 ns 774349 ns 1.00
array/reductions/mapreduce/Float32/dims=2 97772 ns 97531 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 298834 ns 297464 ns 1.00
array/reductions/mapreduce/Int64/1d 134642 ns 134951 ns 1.00
array/reductions/mapreduce/Int64/dims=1 95321 ns 95301 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 781201 ns 781800 ns 1.00
array/reductions/mapreduce/Int64/dims=2 96681 ns 96801 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 301074 ns 299524 ns 1.01
array/reductions/reduce/Float32/1d 134222 ns 133912 ns 1.00
array/reductions/reduce/Float32/dims=1 95582 ns 95711 ns 1.00
array/reductions/reduce/Float32/dims=1L 774211 ns 775219 ns 1.00
array/reductions/reduce/Float32/dims=2 97542 ns 97621 ns 1.00
array/reductions/reduce/Float32/dims=2L 298544 ns 297424 ns 1.00
array/reductions/reduce/Int64/1d 130452 ns 134602 ns 0.97
array/reductions/reduce/Int64/dims=1 95422 ns 95311 ns 1.00
array/reductions/reduce/Int64/dims=1L 778521 ns 780269 ns 1.00
array/reductions/reduce/Int64/dims=2 96882 ns 97121 ns 1.00
array/reductions/reduce/Int64/dims=2L 299684 ns 299264 ns 1.00
array/reverse/1d 44921 ns 44550 ns 1.01
array/reverse/1dL 76271 ns 76661 ns 0.99
array/reverse/1dL_inplace 169863 ns 173202 ns 0.98
array/reverse/1d_inplace 74581 ns 84571 ns 0.88
array/reverse/2d 52901 ns 52831 ns 1.00
array/reverse/2dL 102261 ns 102811 ns 0.99
array/reverse/2dL_inplace 125692 ns 178873 ns 0.70
array/reverse/2d_inplace 107632 ns 96051 ns 1.12
array/sorting/1d 341325 ns 379995 ns 0.90
integration/byval/reference 39271 ns 39540 ns 0.99
integration/byval/slices=1 40471 ns 40350 ns 1.00
integration/byval/slices=2 160112 ns 159152 ns 1.01
integration/byval/slices=3 238773 ns 238933 ns 1.00
integration/volumerhs 5042401 ns 5031334 ns 1.00
kernel/indexing 75801 ns 65521 ns 1.16
kernel/indexing_checked 66391 ns 72491 ns 0.92
kernel/launch 1290 ns 1280 ns 1.01
kernel/rand 126832 ns 124252 ns 1.02
latency/import 1503115254 ns 1491816057 ns 1.01
latency/precompile 11926533311 ns 11773992921 ns 1.01
latency/ttfp 11012011599 ns 10954774141 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@luraess luraess marked this pull request as ready for review June 25, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant