Restore automatic synchronized finalization for MUMPS factorizations

Sébastien Loisel · Sébastien Loisel · commit e48f38e735e4 · 2025-12-16T20:04:56.000+01:00
Re-adds the automatic cleanup system:
- Each factorization gets a unique ID tracked in a global registry
- Julia finalizers queue IDs to a thread-safe destroy list (no MPI)
- _process_finalizers() gathers pending IDs from all ranks, merges,
  and finalizes in deterministic order
- Registry check prevents double-finalization

Manual finalize!(F) remains available for explicit control.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -91,17 +91,16 @@ Factorization uses MUMPS (MUltifrontal Massively Parallel Solver) with distribut
 - Created by `lu(A)` for general matrices or `ldlt(A)` for symmetric matrices
 - Stores COO arrays (irn_loc, jcn_loc, a_loc) to prevent GC while MUMPS holds pointers
 
-**Important: Manual cleanup required.** Unlike other types in this library, factorization objects
-require explicit cleanup via `finalize!(F)`. This is because MUMPS cleanup routines call MPI
-functions, and Julia's GC may run finalizers after MPI has shut down (causing crashes). Example:
+**Automatic cleanup:** Factorization objects are automatically cleaned up when garbage collected.
+The cleanup is synchronized across MPI ranks when the next factorization is created. Example:
 
 ```julia
 F = lu(A)
 x = F \ b
-finalize!(F)  # Required!
+# F is automatically cleaned up when GC'd and next factorization is created
 ```
 
-If `finalize!` is not called, the program still works but MUMPS memory leaks until exit.
+Manual `finalize!(F)` is still available for explicit control (must be called on all ranks together).
 
 ### Local Constructors
 
diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ A_sym = A + transpose(A) + 10I  # Make symmetric positive definite
 A_sym_dist = SparseMatrixMPI{Float64}(A_sym)
 F = ldlt(A_sym_dist)  # LDLT factorization
 x_sol = solve(F, y)   # Solve A_sym * x_sol = y
-finalize!(F)          # Release factorization resources
+# F is automatically cleaned up when garbage collected
 ```
 
 ## Running with MPI
diff --git a/docs/src/api.md b/docs/src/api.md
@@ -428,7 +428,10 @@ solve
 solve!
 ```
 
-### Releasing Factorization Resources
+### Manual Cleanup (Optional)
+
+Factorization objects are automatically cleaned up when garbage collected.
+For explicit control, `finalize!` can be called manually (must be called on all ranks together).
 
 ```@docs
 finalize!
@@ -456,14 +459,14 @@ x = solve(F, b)
 # Or use backslash
 x = F \ b
 
-# Release factorization resources when done
-finalize!(F)
+# F is automatically cleaned up when garbage collected
+# (or call finalize!(F) for immediate cleanup on all ranks)
 
 # For non-symmetric matrices, use LU
 A_nonsym = SparseMatrixMPI{Float64}(sprand(1000, 1000, 0.01) + 10I)
 F_lu = lu(A_nonsym)
 x = F_lu \ b
-finalize!(F_lu)
+# F_lu is automatically cleaned up when garbage collected
 ```
 
 ### Direct Solve Syntax
@@ -481,7 +484,7 @@ x = transpose(b) / A           # solve x*A = transpose(b)
 x = transpose(b) / transpose(A)  # solve x*transpose(A) = transpose(b)
 ```
 
-Note: One-shot solves like `A \ b` automatically clean up the factorization. For repeated solves with the same matrix, compute the factorization once with `lu()` or `ldlt()`, reuse it, then call `finalize!()` when done.
+Note: Factorizations are automatically cleaned up when garbage collected. Cleanup is synchronized across MPI ranks when the next factorization is created.
 
 ## Cache Management
 
diff --git a/src/mumps_factorization.jl b/src/mumps_factorization.jl
@@ -11,6 +11,35 @@ using MUMPS
 using MUMPS: Mumps, set_icntl!, MUMPS_INT, MUMPS_INT8, suppress_printing!
 import MUMPS: invoke_mumps_unsafe!
 
+# ============================================================================
+# MUMPS Automatic Finalization Management
+# ============================================================================
+#
+# MUMPS cleanup requires synchronized MPI calls across all ranks, but Julia's
+# GC runs asynchronously on each rank. This system handles automatic cleanup:
+#
+# 1. Each MUMPS factorization gets a unique integer ID (_mumps_count)
+# 2. Objects are registered in _mumps_registry by ID
+# 3. Julia's GC finalizer queues the ID to _destroy_list (no MPI calls)
+# 4. When creating a new factorization, _process_finalizers() is called:
+#    - All ranks broadcast their _destroy_list
+#    - Lists are merged, sorted, uniqued
+#    - Objects are finalized in deterministic order across all ranks
+#
+# This ensures synchronized cleanup without blocking in finalizers.
+
+# Global counter for unique MUMPS object IDs
+const _mumps_count = Ref{Int}(0)
+
+# Registry mapping ID -> MUMPSFactorizationMPI (prevents GC until removed from registry)
+const _mumps_registry = Dict{Int, Any}()
+
+# List of MUMPS IDs queued for destruction by this rank's GC
+const _destroy_list = Int[]
+
+# Lock for thread-safe access to _destroy_list (finalizers may run from GC thread)
+const _destroy_list_lock = ReentrantLock()
+
 # ============================================================================
 # MUMPS Factorization Type
 # ============================================================================
@@ -20,9 +49,12 @@ import MUMPS: invoke_mumps_unsafe!
 
 Distributed MUMPS factorization result. Can be reused for multiple solves.
 
-**Important:** Call `finalize!(F)` when done to release MUMPS resources.
+Factorization objects are automatically cleaned up when garbage collected,
+with synchronized finalization across MPI ranks. Manual `finalize!(F)` is
+still available for explicit control (must be called on all ranks together).
 """
 mutable struct MUMPSFactorizationMPI{T}
+    id::Int  # Unique ID for finalization tracking
     mumps::Any  # Mumps{T,R} where R is the real type (Float64 for both real and complex)
     irn_loc::Vector{MUMPS_INT}
     jcn_loc::Vector{MUMPS_INT}
@@ -36,6 +68,72 @@ end
 Base.size(F::MUMPSFactorizationMPI) = (F.n, F.n)
 Base.eltype(::MUMPSFactorizationMPI{T}) where T = T
 
+# ============================================================================
+# Automatic Finalization Functions
+# ============================================================================
+
+"""
+    _queue_for_destruction(F::MUMPSFactorizationMPI)
+
+Julia finalizer callback. Queues the factorization ID for later synchronized
+destruction. Does NOT call MPI (unsafe from GC thread).
+"""
+function _queue_for_destruction(F::MUMPSFactorizationMPI)
+    lock(_destroy_list_lock) do
+        push!(_destroy_list, F.id)
+    end
+    return nothing
+end
+
+"""
+    _process_finalizers()
+
+Process pending MUMPS finalizations in a synchronized manner across all ranks.
+This is a **collective operation** - all ranks must call it together.
+
+Called automatically when creating new factorizations. Gathers pending
+destruction requests from all ranks, merges them, and finalizes in
+deterministic order.
+"""
+function _process_finalizers()
+    comm = MPI.COMM_WORLD
+    nranks = MPI.Comm_size(comm)
+
+    # Thread-safe: detach current destroy list, replace with empty
+    local_list = lock(_destroy_list_lock) do
+        list = copy(_destroy_list)
+        empty!(_destroy_list)
+        list
+    end
+
+    # Allgather counts of how many IDs each rank has
+    local_count = Int32(length(local_list))
+    all_counts = MPI.Allgather(local_count, comm)
+
+    # Allgatherv to collect all IDs from all ranks
+    total_count = sum(all_counts)
+    if total_count == 0
+        return  # Nothing to finalize
+    end
+
+    all_ids = Vector{Int}(undef, total_count)
+    MPI.Allgatherv!(local_list, MPI.VBuffer(all_ids, all_counts), comm)
+
+    # Sort and unique to get deterministic order across all ranks
+    dead_list = sort!(unique(all_ids))
+
+    # Finalize each in order (check registry to avoid double-finalize)
+    for id in dead_list
+        if haskey(_mumps_registry, id)
+            F = _mumps_registry[id]
+            delete!(_mumps_registry, id)
+            # Actually finalize the MUMPS object
+            F.mumps._finalized = false
+            MUMPS.finalize!(F.mumps)
+        end
+    end
+end
+
 # ============================================================================
 # Extract COO from SparseMatrixMPI
 # ============================================================================
@@ -96,6 +194,13 @@ function _create_mumps_factorization(A::SparseMatrixMPI{T}, symmetric::Bool) whe
     comm = MPI.COMM_WORLD
     rank = MPI.Comm_rank(comm)
 
+    # Process any pending finalizations first (collective operation)
+    _process_finalizers()
+
+    # Assign unique ID for this factorization
+    id = _mumps_count[]
+    _mumps_count[] += 1
+
     m, n = size(A)
     @assert m == n "Matrix must be square for factorization"
 
@@ -107,7 +212,7 @@ function _create_mumps_factorization(A::SparseMatrixMPI{T}, symmetric::Bool) whe
     # sym=0: unsymmetric, sym=1: SPD, sym=2: general symmetric
     mumps_sym = symmetric ? MUMPS.mumps_definite : MUMPS.mumps_unsymmetric
     mumps = Mumps{T}(mumps_sym, MUMPS.default_icntl, MUMPS.default_cntl64)
-    mumps._finalized = true  # Disable GC finalizer to avoid post-MPI crash
+    mumps._finalized = true  # Disable MUMPS GC finalizer to avoid post-MPI crash
 
     # Suppress all MUMPS output
     suppress_printing!(mumps)
@@ -142,10 +247,19 @@ function _create_mumps_factorization(A::SparseMatrixMPI{T}, symmetric::Bool) whe
     # Pre-allocate RHS buffer on rank 0
     rhs_buffer = rank == 0 ? zeros(T, n) : T[]
 
-    return MUMPSFactorizationMPI{T}(
-        mumps, irn_loc, jcn_loc, a_loc,
+    # Create factorization object with ID
+    F = MUMPSFactorizationMPI{T}(
+        id, mumps, irn_loc, jcn_loc, a_loc,
         n, symmetric, copy(A.row_partition), rhs_buffer
     )
+
+    # Register in global registry (prevents GC until removed)
+    _mumps_registry[id] = F
+
+    # Attach Julia finalizer to queue for synchronized destruction
+    finalizer(_queue_for_destruction, F)
+
+    return F
 end
 
 """
@@ -168,7 +282,7 @@ end
 
 Compute LU factorization of a distributed sparse matrix using MUMPS.
 Returns a `MUMPSFactorizationMPI` for use with `\\` or `solve`.
-Call `finalize!(F)` when done.
+Factorization is automatically cleaned up when garbage collected.
 """
 function LinearAlgebra.lu(A::SparseMatrixMPI{T}) where T
     return _create_mumps_factorization(A, false)
@@ -180,7 +294,7 @@ end
 Compute LDLT factorization of a distributed symmetric sparse matrix using MUMPS.
 The matrix must be symmetric; only the lower triangular part is used.
 Returns a `MUMPSFactorizationMPI` for use with `\\` or `solve`.
-Call `finalize!(F)` when done.
+Factorization is automatically cleaned up when garbage collected.
 """
 function LinearAlgebra.ldlt(A::SparseMatrixMPI{T}) where T
     return _create_mumps_factorization(A, true)
@@ -257,9 +371,22 @@ end
 """
     finalize!(F::MUMPSFactorizationMPI)
 
-Release MUMPS resources. Must be called when done with the factorization.
+Manually release MUMPS resources. This is a **collective operation** - all
+ranks must call it together for immediate cleanup.
+
+If the factorization has already been cleaned up (by automatic finalization
+or a previous manual call), this is a no-op but all ranks must still call it.
 """
 function finalize!(F::MUMPSFactorizationMPI)
+    # Check if already finalized (removed from registry)
+    if !haskey(_mumps_registry, F.id)
+        return F  # Already finalized, no-op
+    end
+
+    # Remove from registry
+    delete!(_mumps_registry, F.id)
+
+    # Actually finalize the MUMPS object
     F.mumps._finalized = false  # Re-enable MUMPS finalization
     MUMPS.finalize!(F.mumps)
     return F