Skip to content

Refactor MIRK k_discrete from Matrix to Vector{Vector} storage#458

Open
ChrisRackauckas-Claude wants to merge 4 commits intoSciML:masterfrom
ChrisRackauckas-Claude:cr/mirk-voa-k-discrete
Open

Refactor MIRK k_discrete from Matrix to Vector{Vector} storage#458
ChrisRackauckas-Claude wants to merge 4 commits intoSciML:masterfrom
ChrisRackauckas-Claude:cr/mirk-voa-k-discrete

Conversation

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor

Summary

Refactors k_discrete storage from Vector{DiffCache{Matrix{T}}} (N×stage matrix per mesh interval) to Vector{Vector{DiffCache{Vector{T}}}} (vector of per-stage DiffCache vectors). This eliminates all SubArray view allocations from K[:, r] column access in the hot collocation inner loop.

Depends on #453 (allocation fixes) — this branch includes those commits.

Before (matrix columns)

K = get_tmp(k_discrete[i], u)     # returns N×stage Matrix
f!(K[:, r], tmp, p, t)            # K[:, r] creates a SubArray view
__maybe_matmul!(tmp, K[:, 1:(r-1)], x[r, 1:(r-1)], h, T(1))  # more views

After (vector of vectors)

K_r = get_tmp(k_discrete[i][r], u)  # returns N-Vector directly
f!(K_r, tmp, p, t)                   # no view needed
# Inline matmul loop — no views
for j in 1:(r-1)
    Kⱼ = get_tmp(k_discrete[i][j], u)
    @simd ivdep for k in 1:N
        @inbounds tmp[k] += Kⱼ[k] * h * x[r, j]
    end
end

Changes

  • mirk.jl: k_discrete allocation uses nested vector comprehension
  • collocation.jl: All 5 Φ!/Φ methods rewritten with inline matmul loops
  • adaptivity.jl: interp_setup! and sum_stages! use VoV patterns
  • interpolation.jl: sum_stages! and EvalSol use VoV patterns
  • utils.jl: Add __resize! for Vector{Vector{DiffCache}} and __maybe_matmul! for AbstractVector{<:AbstractVector}

Results

  • Loss function: 0 bytes/call (verified via @timed over 100k calls)
  • All MIRK2/4/5/6 solvers pass for StandardBVP, TwoPointBVP, and adaptive modes

Test plan

  • MIRK2/4/5/6 × StandardBVP/TwoPointBVP × adaptive/non-adaptive all pass
  • Solutions match expected values (cos(1.0) ≈ 0.540302)
  • Zero per-step allocations verified
  • CI

🤖 Generated with Claude Code

ChrisRackauckas and others added 4 commits March 25, 2026 05:23
Three key changes to reduce allocations in the hot path (loss function
and Jacobian evaluation called every Newton iteration):

1. Remove Logging.with_logger(NullLogger()) wrapper from get_tmp
   - The wrapper allocated a closure + task-local context on EVERY call
   - get_tmp is called 50+ times per loss evaluation
   - The warning it suppressed only occurs during adaptive cache resize
   - This single change reduced total solve allocations by ~60%

2. Eliminate array comprehensions in MIRK loss functions
   - Replaced `[get_tmp(r, u) for r in residual]` with direct DiffCache
     passing to Φ! and recursive_flatten!
   - Added _maybe_get_tmp helper for Φ! to handle both DiffCache and
     plain array residuals
   - Added recursive_flatten!/recursive_flatten_twopoint! overloads for
     AbstractVector{<:DiffCache}

3. Fix NoDiffCacheNeeded Φ! to reuse fᵢ_cache instead of similar()
   - `similar(fᵢ_cache)` allocated a new vector every call
   - fᵢ_cache is a scratch buffer that can be used directly

Results (dt=0.1, N=2, MIRK4, non-adaptive):
- Total solve: 3,290 → 1,180 allocations (64% reduction)
- Loss per call: 13,552 → 4,080 bytes (70% reduction)
- Jacobian per call: 20,448 → 8,784 bytes (57% reduction)

Adds allocation regression tests to verify loss function allocations
stay bounded.

Reference: https://discourse.julialang.org/t/boundaryvaluediffeq-jl-reducing-allocations/136255

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…IRKN

The `EvalSol.u[1:end] .= x` and `EvalSol.cache.k_discrete[1:end] .=`
patterns create unnecessary SubArray views under @views. Using copyto!
avoids the view allocation entirely.

Applied to MIRK, FIRK, and MIRKN loss functions.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… test

Add y_cache field to MIRKCache that pre-allocates the Vector{Vector{T}}
needed by recursive_unflatten! for the primal (Float64) path. This
eliminates the last 144 bytes/call from get_tmp. broadcast allocation.

For the Dual path (ForwardDiff Jacobian computation), falls back to the
existing broadcast since element types differ.

Loss function is now fully allocation-free at runtime (0 bytes/call
verified via @timed over 100k calls).

Add zero-allocation QA test in the qa test group.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change k_discrete from Vector{DiffCache{Matrix{T}}} (N×stage matrix per
mesh interval) to Vector{Vector{DiffCache{Vector{T}}}} (vector of
per-stage DiffCache vectors per mesh interval).

This eliminates all SubArray view allocations from K[:, r] column access
in the collocation inner loop. Stage vectors are now accessed directly as
k_discrete[i][r] instead of via matrix column views.

Changes:
- mirk.jl: k_discrete allocation uses nested vector comprehension
- collocation.jl: All 5 Φ!/Φ methods rewritten with inline matmul loops
  using per-stage get_tmp(k_discrete[i][r], u) access
- adaptivity.jl: interp_setup! and sum_stages! use VoV patterns
- interpolation.jl: sum_stages! and EvalSol use VoV patterns
- utils.jl: Add __resize! for Vector{Vector{DiffCache}} and
  __maybe_matmul! for AbstractVector{<:AbstractVector}

Loss function is fully allocation-free (0 bytes/call verified via @timed
over 100k calls).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results

Click to check benchmark results
master d0092fb... master / d0092fb...
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK2() 0.569 ± 0.028 s 0.644 ± 0.076 s 0.883 ± 0.11
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK3() 12.1 ± 0.99 ms 22.2 ± 3.8 ms 0.546 ± 0.1
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK4() 2.41 ± 0.16 ms 2.6 ± 0.34 ms 0.928 ± 0.14
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK5() 2.88 ± 0.62 ms 2.85 ± 0.95 ms 1.01 ± 0.4
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK6() 1.18 ± 0.34 ms 1.14 ± 0.24 ms 1.04 ± 0.37
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = false) 2.58 ± 0.76 ms 3.59 ± 2.4 ms 0.719 ± 0.53
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = true) 4.73 ± 1.2 ms 5.28 ± 1.4 ms 0.896 ± 0.33
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.067 ± 0.011 s 0.0948 ± 0.034 s 0.707 ± 0.28
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.102 ± 0.033 s 0.202 ± 0.11 s 0.506 ± 0.32
Simple Pendulum/IIP/Shooting(Tsit5()) 0.314 ± 0.1 ms 0.315 ± 0.097 ms 0.996 ± 0.45
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK2() 0.758 ± 0.055 s 0.881 ± 0.14 s 0.861 ± 0.15
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK3() 15.8 ± 7.3 ms 20.9 ± 12 ms 0.758 ± 0.55
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK4() 2.66 ± 0.22 ms 4.57 ± 1.3 ms 0.582 ± 0.18
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK5() 3.55 ± 0.42 ms 7.26 ± 1.3 ms 0.489 ± 0.1
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK6() 1.4 ± 0.29 ms 2.44 ± 1.4 ms 0.573 ± 0.36
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = false) 6.24 ± 3.7 ms 6.02 ± 3.9 ms 1.04 ± 0.91
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = true) 14.9 ± 15 ms 8.43 ± 8 ms 1.76 ± 2.5
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.0838 ± 0.0067 s 0.14 ± 0.024 s 0.597 ± 0.11
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.136 ± 0.027 s 0.189 ± 0.1 s 0.719 ± 0.42
Simple Pendulum/OOP/Shooting(Tsit5()) 0.613 ± 0.22 ms 1.31 ± 0.23 ms 0.467 ± 0.19
time_to_load 7.45 ± 0.16 s 9.83 ± 3.5 s 0.758 ± 0.27
### Benchmark Plots A plot of the benchmark results has been uploaded as an artifact to the workflow run for this PR. Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants