sloisel
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 49 additions & 8 deletions b/‎CLAUDE.md‎
Lines changed: 49 additions & 8 deletions
diff --git a/‎Project.toml‎
Lines changed: 34 additions & 14 deletions b/‎Project.toml‎
Lines changed: 34 additions & 14 deletions
diff --git a/‎README.md‎
Lines changed: 57 additions & 21 deletions b/‎README.md‎
Lines changed: 57 additions & 21 deletions
diff --git a/‎codecov.yml‎
Lines changed: 15 additions & 0 deletions b/‎codecov.yml‎
Lines changed: 15 additions & 0 deletions
@@ -9,6 +9,7 @@ Plots/
 
 # Julia
 Manifest.toml
+LocalPreferences.toml
 
 # Documentation build
 docs/build/
 
@@ -29,15 +29,26 @@ mpiexec -n 4 julia --project=. test/test_factorization.jl
 julia --project=. -e 'using Pkg; Pkg.precompile()'
 ```
 
+## MPI Configuration
+
+By default, MPI.jl uses MPItrampoline_jll. On some Linux clusters, this causes MUMPS to hang during the solve phase. If you experience hangs with multi-rank MUMPS tests, switch to MPICH_jll:
+
+```julia
+using MPIPreferences
+MPIPreferences.use_jll_binary("MPICH_jll")
+```
+
+This creates/updates `LocalPreferences.toml` (which is gitignored). Restart Julia after changing MPI preferences.
+
 ## GPU Support
 
-GPU acceleration is supported via Metal.jl (macOS) as a package extension.
+GPU acceleration is supported via Metal.jl (macOS) or CUDA.jl (Linux/Windows) as package extensions.
 
 ### Type Parameters
 
-- `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU)
-- `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU) or `MtlMatrix{T}` (GPU)
-- `SparseMatrixMPI{T,Ti,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU) for the `nzval` array
+- `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}` (Metal), or `CuVector{T}` (CUDA)
+- `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU), `MtlMatrix{T}` (Metal), or `CuMatrix{T}` (CUDA)
+- `SparseMatrixMPI{T,Ti,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}`, or `CuVector{T}` for the `nzval` array
 - Type aliases: `VectorMPI_CPU{T}`, `MatrixMPI_CPU{T}`, `SparseMatrixMPI_CPU{T,Ti}` for CPU-backed types
 
 ### Creating Zero Arrays
@@ -55,15 +66,20 @@ A = zeros(MatrixMPI_CPU{Float64}, 50, 30)
 S = zeros(SparseMatrixMPI{Float64,Int,Vector{Float64}}, 100, 100)
 S = zeros(SparseMatrixMPI_CPU{Float64,Int}, 100, 100)
 
-# GPU zero arrays (requires Metal.jl loaded)
+# GPU zero arrays (requires Metal.jl or CUDA.jl loaded)
 using Metal
 v_gpu = zeros(VectorMPI{Float32,MtlVector{Float32}}, 100)
 A_gpu = zeros(MatrixMPI{Float32,MtlMatrix{Float32}}, 50, 30)
+
+# Or with CUDA
+using CUDA
+v_gpu = zeros(VectorMPI{Float64,CuVector{Float64}}, 100)
+A_gpu = zeros(MatrixMPI{Float64,CuMatrix{Float64}}, 50, 30)
 ```
 
 ### CPU Staging
 
-MPI communication always uses CPU buffers (no Metal-aware MPI exists). GPU data is staged through CPU:
+MPI communication always uses CPU buffers (no GPU-aware MPI). GPU data is staged through CPU:
 
 1. GPU vector data copied to CPU staging buffer
 2. MPI communication on CPU buffers
@@ -84,7 +100,32 @@ Sparse matrices remain on CPU (Julia's `SparseMatrixCSC` doesn't support GPU arr
 ### Extension Files
 
 - `ext/LinearAlgebraMPIMetalExt.jl` - Metal extension with `mtl()` and `cpu()` functions
-- Loaded automatically when `using Metal` before `using LinearAlgebraMPI`
+- `ext/LinearAlgebraMPICUDAExt.jl` - CUDA extension with `cu()` and `cpu()` functions, plus cuDSS multi-GPU solver
+- Loaded automatically when `using Metal` or `using CUDA` before `using LinearAlgebraMPI`
+
+### CUDA-Specific: cuDSS Multi-GPU Solver
+
+The CUDA extension includes `CuDSSFactorizationMPI` for distributed sparse direct solves using NVIDIA's cuDSS library with NCCL inter-GPU communication:
+
+```julia
+using CUDA, MPI
+MPI.Init()
+using LinearAlgebraMPI
+
+# Each MPI rank should use a different GPU
+CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
+
+# Create factorization (LDLT for symmetric, LU for general)
+F = cudss_ldlt(A)  # or cudss_lu(A)
+x = F \ b
+finalize!(F)  # Required: clean up cuDSS resources
+```
+
+**Important cuDSS notes:**
+- Requires cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support
+- NCCL communicator is bootstrapped automatically from MPI
+- `finalize!(F)` must be called to avoid MPI desync during cleanup
+- Known issue: tridiagonal matrices with 3+ rows per rank may fail (cuDSS bug reported to NVIDIA)
 
 ### Writing Unified CPU/GPU Functions
 
@@ -105,7 +146,7 @@ end
 
 2. `_to_target_backend(v::Vector, ::Type{AV})` - Convert CPU index vector to target type:
    - `Type{Vector{T}}`: returns `v` directly (no copy)
-   - `Type{MtlVector{T}}`: returns GPU copy
+   - `Type{MtlVector{T}}` or `Type{CuVector{T}}`: returns GPU copy
 
 **Pattern for result construction (unified):**
 ```julia
 
@@ -1,39 +1,59 @@
+authors = ["S. Loisel"]
 name = "LinearAlgebraMPI"
 uuid = "5bdd2be4-ae34-42ef-8b36-f4c85d48f377"
 version = "0.1.9"
-authors = ["S. Loisel"]
+
+[compat]
+Adapt = "4"
+Blake3Hash = "0.3"
+CUDA = "5"
+CUDSS_jll = "0.7"
+KernelAbstractions = "0.9"
+MPI = "0.20"
+MPIPreferences = "0.1.11"
+MUMPS = "1.5"
+Metal = "1.9.1"
+NCCL = "0.1"
+PrecompileTools = "1"
+StaticArrays = "1"
+julia = "1.10"
 
 [deps]
 Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
 Blake3Hash = "8f478455-a32d-4928-b0e4-72b19a7d5574"
 KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
+MPIPreferences = "3da0fdf6-3ccc-4f1b-acd9-58baa6c99267"
 MUMPS = "55d2b088-9f4e-11e9-26c0-150b02ea6a46"
-Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
 PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
 SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
 StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
 
 [extensions]
+LinearAlgebraMPICUDAExt = ["CUDA", "NCCL", "CUDSS_jll"]
 LinearAlgebraMPIMetalExt = "Metal"
 
-[compat]
-Adapt = "4"
-Blake3Hash = "0.3"
-KernelAbstractions = "0.9"
-MPI = "0.20"
-MUMPS = "1.5"
-Metal = "1.9.1"
-PrecompileTools = "1"
-StaticArrays = "1"
-julia = "1.10"
-
 [extras]
 BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+CUDSS_jll = "4889d778-9329-5762-9fec-0578a5d30366"
 Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
+NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
+[preferences.MPIPreferences]
+__clear__ = ["libmpi", "abi", "mpiexec", "cclibs", "preloads_env_switch"]
+_format = "1.0"
+binary = "MPItrampoline_jll"
+preloads = []
+
 [targets]
-test = ["Random", "Test"]
+test = ["CUDA", "CUDSS_jll", "NCCL", "Random", "Test"]
+
+[weakdeps]
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+CUDSS_jll = "4889d778-9329-5762-9fec-0578a5d30366"
+Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
+NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b"
@@ -20,7 +20,8 @@ Distributed sparse matrix and vector operations using MPI for Julia. This packag
 - **Matrix addition/subtraction** (`A + B`, `A - B`)
 - **Vector operations**: norms, reductions, arithmetic with automatic partition alignment
 - Support for both `Float64` and `ComplexF64` element types
-- **GPU acceleration** via Metal.jl (macOS) with automatic CPU staging for MPI
+- **GPU acceleration** via Metal.jl (macOS) or CUDA.jl (Linux/Windows) with automatic CPU staging for MPI
+- **Multi-GPU sparse direct solver** via cuDSS with NCCL communication (CUDA only)
 
 ## Installation
 
@@ -66,11 +67,11 @@ F = ldlt(A_sym_dist)  # LDLT factorization
 x_sol = solve(F, y)   # Solve A_sym * x_sol = y
 ```
 
-## GPU Support (Metal)
+## GPU Support
 
-LinearAlgebraMPI supports GPU acceleration on macOS via Metal.jl. GPU support is optional - Metal.jl is loaded as a weak dependency.
+LinearAlgebraMPI supports GPU acceleration via Metal.jl (macOS) or CUDA.jl (Linux/Windows). GPU support is optional - extensions are loaded as weak dependencies.
 
-### Converting between CPU and GPU
+### Metal (macOS)
 
 ```julia
 using Metal  # Load Metal BEFORE MPI for GPU detection
@@ -93,34 +94,69 @@ z_gpu = x_gpu + x_gpu  # Vector addition on GPU
 y_cpu = cpu(y_gpu)
 ```
 
-### Creating GPU vectors directly
+### CUDA (Linux/Windows)
 
 ```julia
-using Metal
+using CUDA  # Load CUDA BEFORE MPI
+using MPI
+MPI.Init()
+using LinearAlgebraMPI
+
+# Convert to GPU
+x_cpu = VectorMPI(rand(1000))
+x_gpu = cu(x_cpu)  # Returns VectorMPI with CuVector storage
+
+# GPU operations work transparently
+y_gpu = A * x_gpu
+z_gpu = x_gpu + x_gpu
+
+# Convert back to CPU
+y_cpu = cpu(y_gpu)
+```
+
+### cuDSS Multi-GPU Solver (CUDA only)
+
+For multi-GPU distributed sparse direct solves, use `CuDSSFactorizationMPI`:
+
+```julia
+using CUDA, MPI
+MPI.Init()
+using LinearAlgebraMPI
+
+# Each rank uses one GPU
+CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
 
-# Create GPU vector from local data
-local_data = MtlVector(Float32.(rand(100)))
-v_gpu = VectorMPI_local(local_data)
+# Create distributed sparse matrix
+A = SparseMatrixMPI{Float64}(make_spd_matrix(1000))
+b = VectorMPI(rand(1000))
+
+# Multi-GPU factorization using cuDSS + NCCL
+F = cudss_ldlt(A)  # or cudss_lu(A)
+x = F \ b
+finalize!(F)  # Clean up cuDSS resources
 ```
 
+**Requirements**: cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support, NCCL for inter-GPU communication.
+
 ### How it works
 
-- **Vectors**: `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU) or `MtlVector{T}` (GPU)
+- **Vectors**: `VectorMPI{T,AV}` where `AV` is `Vector{T}` (CPU), `MtlVector{T}` (Metal), or `CuVector{T}` (CUDA)
 - **Sparse matrices**: `SparseMatrixMPI{T,Ti,AV}` where `AV` determines storage for nonzero values
-- **Dense matrices**: `MatrixMPI{T,AM}` where `AM` is `Matrix{T}` (CPU) or `MtlMatrix{T}` (GPU)
-- **MPI communication**: Always uses CPU buffers (no Metal-aware MPI exists)
-- **Element type**: Metal requires `Float32` (no `Float64` support)
+- **Dense matrices**: `MatrixMPI{T,AM}` where `AM` is `Matrix{T}`, `MtlMatrix{T}`, or `CuMatrix{T}`
+- **MPI communication**: Always uses CPU buffers (staged automatically)
+- **Element types**: Metal requires `Float32`; CUDA supports `Float32` and `Float64`
 
 ### Supported GPU operations
 
-| Operation | GPU Support |
-|-----------|-------------|
-| `v + w`, `v - w` | Native GPU |
-| `α * v` (scalar) | Native GPU |
-| `A * x` (sparse) | CPU staging |
-| `A * x` (dense) | CPU staging |
-| `transpose(A) * x` | CPU staging |
-| Broadcasting (`abs.(v)`) | Native GPU |
+| Operation | Metal | CUDA |
+|-----------|-------|------|
+| `v + w`, `v - w` | Native | Native |
+| `α * v` (scalar) | Native | Native |
+| `A * x` (sparse) | CPU staging | CPU staging |
+| `A * x` (dense) | CPU staging | CPU staging |
+| `transpose(A) * x` | CPU staging | CPU staging |
+| Broadcasting (`abs.(v)`) | Native | Native |
+| `cudss_lu(A)`, `cudss_ldlt(A)` | N/A | Multi-GPU native |
 
 ## Running with MPI
 
 
@@ -0,0 +1,15 @@
+# Codecov configuration
+# https://docs.codecov.com/docs/codecov-yaml
+
+coverage:
+  status:
+    project:
+      default:
+        target: auto
+        threshold: 1%
+    patch:
+      default:
+        target: auto
+
+ignore:
+  - "ext/**/*"  # GPU extensions (Metal, CUDA) - no GPU runners on CI