This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Do NOT add "Co-Authored-By" lines or any other self-attribution to commit messages. Do NOT advertise Claude or Anthropic in commits. Keep commit messages focused on describing the changes only.
# Run all tests (spawns MPI processes automatically via test harness)
julia --project=. -e 'using Pkg; Pkg.test()'
# Run a specific MPI test directly (for debugging)
mpiexec -n 4 julia --project=. test/test_matrix_multiplication.jl
mpiexec -n 4 julia --project=. test/test_transpose.jl
mpiexec -n 4 julia --project=. test/test_addition.jl
mpiexec -n 4 julia --project=. test/test_lazy_transpose.jl
mpiexec -n 4 julia --project=. test/test_vector_multiplication.jl
mpiexec -n 4 julia --project=. test/test_dense_matrix.jl
mpiexec -n 4 julia --project=. test/test_sparse_api.jl
mpiexec -n 4 julia --project=. test/test_blocks.jl
mpiexec -n 4 julia --project=. test/test_utilities.jl
mpiexec -n 4 julia --project=. test/test_local_constructors.jl
mpiexec -n 4 julia --project=. test/test_indexing.jl
mpiexec -n 4 julia --project=. test/test_factorization.jl
# GPU tests run automatically when Metal.jl is available
# Tests are parameterized over (scalar type, backend) configurations
# Precompile the package
julia --project=. -e 'using Pkg; Pkg.precompile()'By default, MPI.jl uses MPItrampoline_jll. On some Linux clusters, this causes MUMPS to hang during the solve phase. If you experience hangs with multi-rank MUMPS tests, switch to MPICH_jll:
using MPIPreferences
MPIPreferences.use_jll_binary("MPICH_jll")This creates/updates LocalPreferences.toml (which is gitignored). Restart Julia after changing MPI preferences.
GPU acceleration is supported via Metal.jl (macOS) or CUDA.jl (Linux/Windows) as package extensions.
HPCVector{T,AV}whereAVisVector{T}(CPU),MtlVector{T}(Metal), orCuVector{T}(CUDA)HPCMatrix{T,AM}whereAMisMatrix{T}(CPU),MtlMatrix{T}(Metal), orCuMatrix{T}(CUDA)HPCSparseMatrix{T,Ti,AV}whereAVisVector{T}(CPU),MtlVector{T}, orCuVector{T}for thenzvalarray- Type aliases:
HPCVector_CPU{T},HPCMatrix_CPU{T},HPCSparseMatrix_CPU{T,Ti}for CPU-backed types
Use Base.zeros with the full parametric type or type alias:
# CPU zero arrays
v = zeros(HPCVector{Float64,Vector{Float64}}, 100)
v = zeros(HPCVector_CPU{Float64}, 100) # Equivalent using type alias
A = zeros(HPCMatrix{Float64,Matrix{Float64}}, 50, 30)
A = zeros(HPCMatrix_CPU{Float64}, 50, 30)
S = zeros(HPCSparseMatrix{Float64,Int,Vector{Float64}}, 100, 100)
S = zeros(HPCSparseMatrix_CPU{Float64,Int}, 100, 100)
# GPU zero arrays (requires Metal.jl or CUDA.jl loaded)
using Metal
v_gpu = zeros(HPCVector{Float32,MtlVector{Float32}}, 100)
A_gpu = zeros(HPCMatrix{Float32,MtlMatrix{Float32}}, 50, 30)
# Or with CUDA
using CUDA
v_gpu = zeros(HPCVector{Float64,CuVector{Float64}}, 100)
A_gpu = zeros(HPCMatrix{Float64,CuMatrix{Float64}}, 50, 30)MPI communication always uses CPU buffers (no GPU-aware MPI). GPU data is staged through CPU:
- GPU vector data copied to CPU staging buffer
- MPI communication on CPU buffers
- Results copied back to GPU
Plans (VectorPlan, DenseMatrixVectorPlan, DenseTransposeVectorPlan) include:
gathered::AV- buffer matching input typegathered_cpu::Vector{T}- CPU staging buffersend_bufs,recv_bufs- always CPU for MPI
Sparse matrices remain on CPU (Julia's SparseMatrixCSC doesn't support GPU arrays). For A * x where x is GPU:
- Gather
xelements via CPU staging - Compute sparse multiply on CPU
- Copy result to GPU via
_create_output_like()
ext/HPCSparseArraysMetalExt.jl- Metal extension with DeviceMetal backend supportext/HPCSparseArraysCUDAExt.jl- CUDA extension with DeviceCUDA backend support and cuDSS multi-GPU solver- Loaded automatically when
using Metalorusing CUDAbeforeusing HPCSparseArrays - Use
to_backend(obj, target_backend)to convert between backends
The CUDA extension includes CuDSSFactorizationMPI for distributed sparse direct solves using NVIDIA's cuDSS library with NCCL inter-GPU communication:
using CUDA, MPI
MPI.Init()
using HPCSparseArrays
# Each MPI rank should use a different GPU
CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
# Create factorization (LDLT for symmetric, LU for general)
F = cudss_ldlt(A) # or cudss_lu(A)
x = F \ b
finalize!(F) # Required: clean up cuDSS resourcesImportant cuDSS notes:
- Requires cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support
- NCCL communicator is bootstrapped automatically from MPI
finalize!(F)must be called to avoid MPI desync during cleanup- Known issue: tridiagonal matrices with 3+ rows per rank may fail (cuDSS bug reported to NVIDIA)
IMPORTANT: Never write separate CPU and GPU code paths with if AV <: Vector or if A.nzval isa Vector branches. Use unified helper functions instead.
No mixed CPU/GPU operations: Operations between CPU and GPU arrays are forbidden. Both operands must be on the same backend. Functions should error on mixed backends:
if (A_is_cpu != B_is_cpu)
error("Mixed CPU/GPU operations not supported")
endHelper functions for unified code:
-
_values_to_backend(cpu_values::Vector, template)- Convert CPU values to template's backend:- CPU template: returns
cpu_valuesdirectly (no copy) - GPU template:
copyto!(similar(template, T, length(cpu_values)), cpu_values)
- CPU template: returns
-
_to_target_backend(v::Vector, ::Type{AV})- Convert CPU index vector to target type:Type{Vector{T}}: returnsvdirectly (no copy)Type{MtlVector{T}}orType{CuVector{T}}: returns GPU copy
Pattern for result construction (unified):
# Values: use helper (no-op for CPU, copy for GPU)
nzval = _values_to_backend(nzval_cpu, A.nzval)
# Structure arrays: cache once, reuse forever
# For CPU: caches reference to original (no allocation)
# For GPU: caches GPU copy (allocated once)
if plan.cached_rowptr_target === nothing
plan.cached_rowptr_target = _to_target_backend(plan.rowptr, AV)
end
if plan.cached_colval_target === nothing
plan.cached_colval_target = _to_target_backend(plan.colval, AV)
endPattern for values that change each call (e.g., gathered B values in A*B):
# First call: create cache (reference for CPU, GPU buffer for GPU)
if plan.cached_values === nothing
plan.cached_values = _values_to_backend(plan.cpu_values, A.nzval)
end
# Sync: no-op for CPU (cache === source), copy for GPU
if !(plan.cached_values === plan.cpu_values)
copyto!(plan.cached_values, plan.cpu_values)
endHPCSparseArrays implements distributed sparse and dense matrix operations using MPI for parallel computing across multiple ranks. Supports both Float64 and ComplexF64 element types.
Distributed matrices (sparse and dense) and vectors should not be allgathered in the main library, although gathering constructors are available to the user. It is normal to have to move data from one rank to another, and for this, point-to-point communication (e.g. Isend/Irecv) should be favored. For communication that is performance-sensitive, like algebraic operations between matrices, a plan should be generated and cached based upon the hash of the relevant structures.
Many operations in this module are collective and should not be run on a subset of all the ranks. The programming pattern if rank == 0 println(...) is almost always wrong and one should use instead println(io0(),...), without the if rank == 0, or else MPI desynchronization is almost guaranteed.
SparseMatrixCSR{T,Ti} (Type Alias)
SparseMatrixCSR{T,Ti} = Transpose{T, SparseMatrixCSC{T,Ti}}- type alias for CSR storage- In Julia,
Transpose{SparseMatrixCSC}has a dual life:- Semantic view: A lazy transpose of a CSC matrix (what
transpose(A)returns) - Storage view: Row-major (CSR) access to sparse data
- Semantic view: A lazy transpose of a CSC matrix (what
- Use
SparseMatrixCSRwhen intent is CSR storage, usetranspose(A)for mathematical transpose SparseMatrixCSR(A::SparseMatrixCSC)converts CSC to CSR representing the same matrixSparseMatrixCSC(A::SparseMatrixCSR)converts CSR to CSC representing the same matrix- For
B = SparseMatrixCSR(A),B[i,j] == A[i,j](same matrix, different storage)
HPCSparseMatrix{T,Ti,AV}
- Rows are partitioned across MPI ranks
- Type parameters:
T= element type,Ti= index type,AV= array type for values (Vector{T}orMtlVector{T}) - Local rows stored in CSR-like format with separate arrays:
rowptr::Vector{Ti}- row pointers (always CPU)colval::Vector{Ti}- LOCAL column indices (1:ncols_compressed, always CPU)nzval::AV- nonzero values (can be CPU or GPU)nrows_local::Int- number of local rowsncols_compressed::Int- = length(col_indices)
row_partition: Array of sizenranks + 1defining which rows each rank owns (1-indexed boundaries)col_partition: Array of sizenranks + 1defining column partition (used for transpose operations)col_indices: Sorted global column indices that appear in the local part (local→global mapping)structural_hash: Optional Blake3 hash of the matrix structure (computed lazily via_ensure_hash)cached_transpose: Cached materialized transpose (invalidated on modification)cached_symmetric: Cached result ofissymmetriccheck
HPCMatrix{T,AM}
- Distributed dense matrix partitioned by rows across MPI ranks
- Type parameter
AM<:AbstractMatrix{T}:Matrix{T}(CPU) orMtlMatrix{T}(GPU) A::AM: Local rows stored directly (NOT transposed), size =(local_nrows, ncols)row_partition: Array of sizenranks + 1defining which rows each rank owns (always CPU)col_partition: Array of sizenranks + 1defining column partition (always CPU)structural_hash: Optional Blake3 hash (computed lazily)
HPCVector{T,AV}
- Distributed dense vector partitioned across MPI ranks
- Type parameter
AV<:AbstractVector{T}:Vector{T}(CPU) orMtlVector{T}(GPU) partition: Array of sizenranks + 1defining which elements each rank owns (always CPU)v::AV: Local vector elements owned by this rankstructural_hash: Blake3 hash of the partition (computed on construction)- Can have any partition (not required to match matrix partitions)
MatrixPlan{T}
- Communication plan for gathering rows from another HPCSparseMatrix
- Memoized based on structural hashes of A and B (see
_plan_cache) - Pre-allocates all send/receive buffers for allocation-free
execute_plan!calls
TransposePlan{T}
- Communication plan for computing matrix transpose
- Redistributes nonzeros based on
col_partitionbecomingrow_partition - Pre-allocates buffers for reusable execution
VectorPlan{T,AV}
- Communication plan for gathering vector elements needed for
A * x - Type parameter
AV: matches input vector type for GPU support - Gathers
x[A.col_indices]from appropriate ranks based onx.partition - Memoized based on structural hashes of A and x plus array type (see
_vector_plan_cache) gathered::AV: buffer matching input type,gathered_cpu::Vector{T}: CPU stagingsend_bufs,recv_bufs: always CPU for MPI communication
Factorization uses MUMPS (MUltifrontal Massively Parallel Solver) with distributed matrix input (ICNTL(18)=3).
MUMPSFactorization{Tin, AVin, Tinternal} (internal type, not exported)
- Wraps a MUMPS object for distributed factorization
- Type parameters:
Tin= input element type,AVin= input array type,Tinternal= MUMPS-compatible type (Float64 or ComplexF64) - Created by
lu(A)for general matrices orldlt(A)for symmetric matrices - Automatically converts input to MUMPS-compatible types and back (e.g., Float32 GPU → Float64 CPU → solve → Float32 GPU)
- Stores COO arrays (irn_loc, jcn_loc, a_loc) to prevent GC while MUMPS holds pointers
F = lu(A)
x = F \ bFor efficient construction when data is already distributed:
HPCVector_local(v_local): Create from local vector portionHPCSparseMatrix_local(SparseMatrixCSR(local_csc)): Create from local rows in CSR formatHPCSparseMatrix_local(transpose(AT_local)): Alternative using explicit transpose wrapperHPCMatrix_local(A_local): Create from local dense rows
These infer the global partition via MPI.Allgather of local sizes.
When building from triplets (I, J, V), the efficient pattern is to build M^T directly as CSC by swapping indices, then wrap in lazy transpose for CSR:
AT_local = sparse(local_J, local_I, local_V, ncols, local_nrows) # M^T as CSC
HPCSparseMatrix_local(transpose(AT_local)) # M in CSR formatThis avoids an unnecessary physical transpose operation.
- Plan creation (
MatrixPlanconstructor): UsesAlltoallandAlltoallvto exchange row requests and sparse structure (colptr, rowval) - Value exchange (
execute_plan!): Point-to-pointIsend/Irecvto send actual matrix values - Local computation:
C^T = B^T * A^Tusing reindexed local sparse matrices - Result construction: Directly wraps local result with inherited row partition from A
For y = A * x where A::HPCSparseMatrix and x::HPCVector:
- Plan creation (
VectorPlanconstructor): UsesAlltoallto exchange element request counts, then point-to-point to exchange indices - Value exchange (
execute_plan!): Point-to-pointIsend/Irecvto gatherx[A.col_indices]into a localgatheredvector - Local computation: Compute
transpose(_get_csc(A)) * gatheredwhere_get_csc(A)reconstructs a SparseMatrixCSC from the internal CSR arrays - Result construction: Result vector
yinheritsA.row_partition
Note: Uses transpose() (not adjoint ') to correctly handle complex values without conjugation.
conj(v)- Returns new HPCVector with conjugated values (materialized)transpose(v)- Returns lazyTransposewrapperv'(adjoint) - Returnstranspose(conj(v))whereconj(v)is materializedtranspose(v) * A- Computestranspose(transpose(A) * v), returns transposed HPCVectorv' * A- Computestranspose(transpose(A) * conj(v)), returns transposed HPCVectoru + v,u - v- Automatic partition alignment if partitions differ
transpose(A) returns Transpose{T, HPCSparseMatrix{T,Ti,AV}} (lazy wrapper). Materialization happens automatically when needed:
transpose(A) * transpose(B)→transpose(B * A)(stays lazy)transpose(A) * BorA * transpose(B)→ materializes viaTransposePlanHPCSparseMatrix(transpose(A))→ explicitly materialize the transpose (cached bidirectionally)
All distributed types support getindex and setindex! with cross-rank communication:
v[i],v[i:j],v[indices::HPCVector]for HPCVectorA[i,j],A[i:j, k:l],A[:, j],A[rows::HPCVector, cols]for HPCMatrix and HPCSparseMatrix- Setting values triggers structural modification for sparse matrices when inserting new nonzeros
cat(A, B; dims=...)- Concatenate matrices/vectors along specified dimensionsblockdiag(A, B, ...)- Create block diagonal matrix from multiple matrices
- All ranks must have identical copies of input matrices/vectors when constructing from global data
- Row/element ownership:
searchsortedlast(partition, index) - 1gives the owning rank (0-indexed) - Communication uses MPI collective operations (
Alltoall,Alltoallv) and point-to-point (Isend,Irecv) - Communication tags are organized by operation type (1-3, 10-11, 20-22, 30-35, 40-51, 60-125)
- Tests run under
mpiexecvia the test harness intest/runtests.jl - Test matrices must be deterministic (not random) to ensure consistency across MPI ranks