CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commit Messages

Do NOT add "Co-Authored-By" lines or any other self-attribution to commit messages. Do NOT advertise Claude or Anthropic in commits. Keep commit messages focused on describing the changes only.

Build and Test Commands

# Run all tests (spawns MPI processes automatically via test harness)
julia --project=. -e 'using Pkg; Pkg.test()'

# Run a specific MPI test directly (for debugging)
mpiexec -n 4 julia --project=. test/test_matrix_multiplication.jl
mpiexec -n 4 julia --project=. test/test_transpose.jl
mpiexec -n 4 julia --project=. test/test_addition.jl
mpiexec -n 4 julia --project=. test/test_lazy_transpose.jl
mpiexec -n 4 julia --project=. test/test_vector_multiplication.jl
mpiexec -n 4 julia --project=. test/test_dense_matrix.jl
mpiexec -n 4 julia --project=. test/test_sparse_api.jl
mpiexec -n 4 julia --project=. test/test_blocks.jl
mpiexec -n 4 julia --project=. test/test_utilities.jl
mpiexec -n 4 julia --project=. test/test_local_constructors.jl
mpiexec -n 4 julia --project=. test/test_indexing.jl
mpiexec -n 4 julia --project=. test/test_factorization.jl

# GPU tests run automatically when Metal.jl is available
# Tests are parameterized over (scalar type, backend) configurations

# Precompile the package
julia --project=. -e 'using Pkg; Pkg.precompile()'

MPI Configuration

By default, MPI.jl uses MPItrampoline_jll. On some Linux clusters, this causes MUMPS to hang during the solve phase. If you experience hangs with multi-rank MUMPS tests, switch to MPICH_jll:

using MPIPreferences
MPIPreferences.use_jll_binary("MPICH_jll")

This creates/updates LocalPreferences.toml (which is gitignored). Restart Julia after changing MPI preferences.

GPU Support

GPU acceleration is supported via Metal.jl (macOS) or CUDA.jl (Linux/Windows) as package extensions.

Type Parameters

HPCVector{T,AV} where AV is Vector{T} (CPU), MtlVector{T} (Metal), or CuVector{T} (CUDA)
HPCMatrix{T,AM} where AM is Matrix{T} (CPU), MtlMatrix{T} (Metal), or CuMatrix{T} (CUDA)
HPCSparseMatrix{T,Ti,AV} where AV is Vector{T} (CPU), MtlVector{T}, or CuVector{T} for the nzval array
Type aliases: HPCVector_CPU{T}, HPCMatrix_CPU{T}, HPCSparseMatrix_CPU{T,Ti} for CPU-backed types

Creating Zero Arrays

Use Base.zeros with the full parametric type or type alias:

# CPU zero arrays
v = zeros(HPCVector{Float64,Vector{Float64}}, 100)
v = zeros(HPCVector_CPU{Float64}, 100)  # Equivalent using type alias

A = zeros(HPCMatrix{Float64,Matrix{Float64}}, 50, 30)
A = zeros(HPCMatrix_CPU{Float64}, 50, 30)

S = zeros(HPCSparseMatrix{Float64,Int,Vector{Float64}}, 100, 100)
S = zeros(HPCSparseMatrix_CPU{Float64,Int}, 100, 100)

# GPU zero arrays (requires Metal.jl or CUDA.jl loaded)
using Metal
v_gpu = zeros(HPCVector{Float32,MtlVector{Float32}}, 100)
A_gpu = zeros(HPCMatrix{Float32,MtlMatrix{Float32}}, 50, 30)

# Or with CUDA
using CUDA
v_gpu = zeros(HPCVector{Float64,CuVector{Float64}}, 100)
A_gpu = zeros(HPCMatrix{Float64,CuMatrix{Float64}}, 50, 30)

CPU Staging

MPI communication always uses CPU buffers (no GPU-aware MPI). GPU data is staged through CPU:

GPU vector data copied to CPU staging buffer
MPI communication on CPU buffers
Results copied back to GPU

Plans (VectorPlan, DenseMatrixVectorPlan, DenseTransposeVectorPlan) include:

gathered::AV - buffer matching input type
gathered_cpu::Vector{T} - CPU staging buffer
send_bufs, recv_bufs - always CPU for MPI

Sparse Operations with GPU Vectors

Sparse matrices remain on CPU (Julia's SparseMatrixCSC doesn't support GPU arrays). For A * x where x is GPU:

Gather x elements via CPU staging
Compute sparse multiply on CPU
Copy result to GPU via _create_output_like()

Extension Files

ext/HPCSparseArraysMetalExt.jl - Metal extension with DeviceMetal backend support
ext/HPCSparseArraysCUDAExt.jl - CUDA extension with DeviceCUDA backend support and cuDSS multi-GPU solver
Loaded automatically when using Metal or using CUDA before using HPCSparseArrays
Use to_backend(obj, target_backend) to convert between backends

CUDA-Specific: cuDSS Multi-GPU Solver

The CUDA extension includes CuDSSFactorizationMPI for distributed sparse direct solves using NVIDIA's cuDSS library with NCCL inter-GPU communication:

using CUDA, MPI
MPI.Init()
using HPCSparseArrays

# Each MPI rank should use a different GPU
CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))

# Create factorization (LDLT for symmetric, LU for general)
F = cudss_ldlt(A)  # or cudss_lu(A)
x = F \ b
finalize!(F)  # Required: clean up cuDSS resources

Important cuDSS notes:

Requires cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support
NCCL communicator is bootstrapped automatically from MPI
finalize!(F) must be called to avoid MPI desync during cleanup
Known issue: tridiagonal matrices with 3+ rows per rank may fail (cuDSS bug reported to NVIDIA)

Writing Unified CPU/GPU Functions

IMPORTANT: Never write separate CPU and GPU code paths with if AV <: Vector or if A.nzval isa Vector branches. Use unified helper functions instead.

No mixed CPU/GPU operations: Operations between CPU and GPU arrays are forbidden. Both operands must be on the same backend. Functions should error on mixed backends:

if (A_is_cpu != B_is_cpu)
    error("Mixed CPU/GPU operations not supported")
end

Helper functions for unified code:

_values_to_backend(cpu_values::Vector, template) - Convert CPU values to template's backend:
- CPU template: returns cpu_values directly (no copy)
- GPU template: copyto!(similar(template, T, length(cpu_values)), cpu_values)
_to_target_backend(v::Vector, ::Type{AV}) - Convert CPU index vector to target type:
- Type{Vector{T}}: returns v directly (no copy)
- Type{MtlVector{T}} or Type{CuVector{T}}: returns GPU copy

Pattern for result construction (unified):

# Values: use helper (no-op for CPU, copy for GPU)
nzval = _values_to_backend(nzval_cpu, A.nzval)

# Structure arrays: cache once, reuse forever
# For CPU: caches reference to original (no allocation)
# For GPU: caches GPU copy (allocated once)
if plan.cached_rowptr_target === nothing
    plan.cached_rowptr_target = _to_target_backend(plan.rowptr, AV)
end
if plan.cached_colval_target === nothing
    plan.cached_colval_target = _to_target_backend(plan.colval, AV)
end

Pattern for values that change each call (e.g., gathered B values in A*B):

# First call: create cache (reference for CPU, GPU buffer for GPU)
if plan.cached_values === nothing
    plan.cached_values = _values_to_backend(plan.cpu_values, A.nzval)
end
# Sync: no-op for CPU (cache === source), copy for GPU
if !(plan.cached_values === plan.cpu_values)
    copyto!(plan.cached_values, plan.cpu_values)
end

Architecture

HPCSparseArrays implements distributed sparse and dense matrix operations using MPI for parallel computing across multiple ranks. Supports both Float64 and ComplexF64 element types.

Core design principle

Distributed matrices (sparse and dense) and vectors should not be allgathered in the main library, although gathering constructors are available to the user. It is normal to have to move data from one rank to another, and for this, point-to-point communication (e.g. Isend/Irecv) should be favored. For communication that is performance-sensitive, like algebraic operations between matrices, a plan should be generated and cached based upon the hash of the relevant structures.

MPI programmiing pitfalls.

Many operations in this module are collective and should not be run on a subset of all the ranks. The programming pattern if rank == 0 println(...) is almost always wrong and one should use instead println(io0(),...), without the if rank == 0, or else MPI desynchronization is almost guaranteed.

Core Data Structures

SparseMatrixCSR{T,Ti} (Type Alias)

SparseMatrixCSR{T,Ti} = Transpose{T, SparseMatrixCSC{T,Ti}} - type alias for CSR storage
In Julia, Transpose{SparseMatrixCSC} has a dual life:
- Semantic view: A lazy transpose of a CSC matrix (what transpose(A) returns)
- Storage view: Row-major (CSR) access to sparse data
Use SparseMatrixCSR when intent is CSR storage, use transpose(A) for mathematical transpose
SparseMatrixCSR(A::SparseMatrixCSC) converts CSC to CSR representing the same matrix
SparseMatrixCSC(A::SparseMatrixCSR) converts CSR to CSC representing the same matrix
For B = SparseMatrixCSR(A), B[i,j] == A[i,j] (same matrix, different storage)

HPCSparseMatrix{T,Ti,AV}

Rows are partitioned across MPI ranks
Type parameters: T = element type, Ti = index type, AV = array type for values (Vector{T} or MtlVector{T})
Local rows stored in CSR-like format with separate arrays:
- rowptr::Vector{Ti} - row pointers (always CPU)
- colval::Vector{Ti} - LOCAL column indices (1:ncols_compressed, always CPU)
- nzval::AV - nonzero values (can be CPU or GPU)
- nrows_local::Int - number of local rows
- ncols_compressed::Int - = length(col_indices)
row_partition: Array of size nranks + 1 defining which rows each rank owns (1-indexed boundaries)
col_partition: Array of size nranks + 1 defining column partition (used for transpose operations)
col_indices: Sorted global column indices that appear in the local part (local→global mapping)
structural_hash: Optional Blake3 hash of the matrix structure (computed lazily via _ensure_hash)
cached_transpose: Cached materialized transpose (invalidated on modification)
cached_symmetric: Cached result of issymmetric check

HPCMatrix{T,AM}

Distributed dense matrix partitioned by rows across MPI ranks
Type parameter AM<:AbstractMatrix{T}: Matrix{T} (CPU) or MtlMatrix{T} (GPU)
A::AM: Local rows stored directly (NOT transposed), size = (local_nrows, ncols)
row_partition: Array of size nranks + 1 defining which rows each rank owns (always CPU)
col_partition: Array of size nranks + 1 defining column partition (always CPU)
structural_hash: Optional Blake3 hash (computed lazily)

HPCVector{T,AV}

Distributed dense vector partitioned across MPI ranks
Type parameter AV<:AbstractVector{T}: Vector{T} (CPU) or MtlVector{T} (GPU)
partition: Array of size nranks + 1 defining which elements each rank owns (always CPU)
v::AV: Local vector elements owned by this rank
structural_hash: Blake3 hash of the partition (computed on construction)
Can have any partition (not required to match matrix partitions)

MatrixPlan{T}

Communication plan for gathering rows from another HPCSparseMatrix
Memoized based on structural hashes of A and B (see _plan_cache)
Pre-allocates all send/receive buffers for allocation-free execute_plan! calls

TransposePlan{T}

Communication plan for computing matrix transpose
Redistributes nonzeros based on col_partition becoming row_partition
Pre-allocates buffers for reusable execution

VectorPlan{T,AV}

Communication plan for gathering vector elements needed for A * x
Type parameter AV: matches input vector type for GPU support
Gathers x[A.col_indices] from appropriate ranks based on x.partition
Memoized based on structural hashes of A and x plus array type (see _vector_plan_cache)
gathered::AV: buffer matching input type, gathered_cpu::Vector{T}: CPU staging
send_bufs, recv_bufs: always CPU for MPI communication

Factorization

Factorization uses MUMPS (MUltifrontal Massively Parallel Solver) with distributed matrix input (ICNTL(18)=3).

MUMPSFactorization{Tin, AVin, Tinternal} (internal type, not exported)

Wraps a MUMPS object for distributed factorization
Type parameters: Tin = input element type, AVin = input array type, Tinternal = MUMPS-compatible type (Float64 or ComplexF64)
Created by lu(A) for general matrices or ldlt(A) for symmetric matrices
Automatically converts input to MUMPS-compatible types and back (e.g., Float32 GPU → Float64 CPU → solve → Float32 GPU)
Stores COO arrays (irn_loc, jcn_loc, a_loc) to prevent GC while MUMPS holds pointers

F = lu(A)
x = F \ b

Local Constructors

For efficient construction when data is already distributed:

HPCVector_local(v_local): Create from local vector portion
HPCSparseMatrix_local(SparseMatrixCSR(local_csc)): Create from local rows in CSR format
HPCSparseMatrix_local(transpose(AT_local)): Alternative using explicit transpose wrapper
HPCMatrix_local(A_local): Create from local dense rows

These infer the global partition via MPI.Allgather of local sizes.

When building from triplets (I, J, V), the efficient pattern is to build M^T directly as CSC by swapping indices, then wrap in lazy transpose for CSR:

AT_local = sparse(local_J, local_I, local_V, ncols, local_nrows)  # M^T as CSC
HPCSparseMatrix_local(transpose(AT_local))  # M in CSR format

This avoids an unnecessary physical transpose operation.

Matrix Multiplication Flow

Plan creation (MatrixPlan constructor): Uses Alltoall and Alltoallv to exchange row requests and sparse structure (colptr, rowval)
Value exchange (execute_plan!): Point-to-point Isend/Irecv to send actual matrix values
Local computation: C^T = B^T * A^T using reindexed local sparse matrices
Result construction: Directly wraps local result with inherited row partition from A

Matrix-Vector Multiplication Flow

For y = A * x where A::HPCSparseMatrix and x::HPCVector:

Plan creation (VectorPlan constructor): Uses Alltoall to exchange element request counts, then point-to-point to exchange indices
Value exchange (execute_plan!): Point-to-point Isend/Irecv to gather x[A.col_indices] into a local gathered vector
Local computation: Compute transpose(_get_csc(A)) * gathered where _get_csc(A) reconstructs a SparseMatrixCSC from the internal CSR arrays
Result construction: Result vector y inherits A.row_partition

Note: Uses transpose() (not adjoint ') to correctly handle complex values without conjugation.

Vector Operations

conj(v) - Returns new HPCVector with conjugated values (materialized)
transpose(v) - Returns lazy Transpose wrapper
v' (adjoint) - Returns transpose(conj(v)) where conj(v) is materialized
transpose(v) * A - Computes transpose(transpose(A) * v), returns transposed HPCVector
v' * A - Computes transpose(transpose(A) * conj(v)), returns transposed HPCVector
u + v, u - v - Automatic partition alignment if partitions differ

Lazy Transpose

transpose(A) returns Transpose{T, HPCSparseMatrix{T,Ti,AV}} (lazy wrapper). Materialization happens automatically when needed:

transpose(A) * transpose(B) → transpose(B * A) (stays lazy)
transpose(A) * B or A * transpose(B) → materializes via TransposePlan
HPCSparseMatrix(transpose(A)) → explicitly materialize the transpose (cached bidirectionally)

Indexing Operations

All distributed types support getindex and setindex! with cross-rank communication:

v[i], v[i:j], v[indices::HPCVector] for HPCVector
A[i,j], A[i:j, k:l], A[:, j], A[rows::HPCVector, cols] for HPCMatrix and HPCSparseMatrix
Setting values triggers structural modification for sparse matrices when inserting new nonzeros

Block Operations

cat(A, B; dims=...) - Concatenate matrices/vectors along specified dimensions
blockdiag(A, B, ...) - Create block diagonal matrix from multiple matrices

Key Design Patterns

All ranks must have identical copies of input matrices/vectors when constructing from global data
Row/element ownership: searchsortedlast(partition, index) - 1 gives the owning rank (0-indexed)
Communication uses MPI collective operations (Alltoall, Alltoallv) and point-to-point (Isend, Irecv)
Communication tags are organized by operation type (1-3, 10-11, 20-22, 30-35, 40-51, 60-125)
Tests run under mpiexec via the test harness in test/runtests.jl
Test matrices must be deterministic (not random) to ensure consistency across MPI ranks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Commit Messages

Build and Test Commands

MPI Configuration

GPU Support

Type Parameters

Creating Zero Arrays

CPU Staging

Sparse Operations with GPU Vectors

Extension Files

CUDA-Specific: cuDSS Multi-GPU Solver

Writing Unified CPU/GPU Functions

Architecture

Core design principle

MPI programmiing pitfalls.

Core Data Structures

Factorization

Local Constructors

Matrix Multiplication Flow

Matrix-Vector Multiplication Flow

Vector Operations

Lazy Transpose

Indexing Operations

Block Operations

Key Design Patterns

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Commit Messages

Build and Test Commands

MPI Configuration

GPU Support

Type Parameters

Creating Zero Arrays

CPU Staging

Sparse Operations with GPU Vectors

Extension Files

CUDA-Specific: cuDSS Multi-GPU Solver

Writing Unified CPU/GPU Functions

Architecture

Core design principle

MPI programmiing pitfalls.

Core Data Structures

Factorization

Local Constructors

Matrix Multiplication Flow

Matrix-Vector Multiplication Flow

Vector Operations

Lazy Transpose

Indexing Operations

Block Operations

Key Design Patterns