SysIdentPy is starting to adopt the Array API standard #1001

wilsonrljr · 2026-04-04T17:22:47Z

wilsonrljr
Apr 4, 2026

I'm the maintainer of SysIdentPy, a Python library for nonlinear system identification and time series forecasting. I wanted to share that I've started adding Array API support, and I thought this community would be a good place to talk about it.

I was not sure how hard it would be. My library is built on top of numpy and scipy and relies heavily on matrix operations internally. But using array-api-compat and array-api-extra made things a lot simpler than I expected. I added the namespace dispatch at the public fit()/predict() boundary, and most of the internals just followed from there.

The results were encouraging. Some concrete numbers from my benchmarks (FROLS algorithm, polynomial degree 4, RTX 3080 Ti):

Dataset size	PyTorch CPU	PyTorch CUDA	CuPy
1,000 samples	1.45x	3.6x	3.0x
5,000 samples	1.55x	15.7x	13.7x
10,000 samples	1.53x	25x	21.8x
25,000 samples	1.61x	37.9x	30.2x

All speedups are relative to plain NumPy. And the dispatch layer itself adds less than 5% overhead when you keep NumPy as the backend.

One thing worth mentioning about the predict side: my library supports free-run and n-step-ahead prediction modes, which are inherently recursive (each step depends on the output of the previous one). For that kind of loop, NumPy is already very fast, and putting it on the GPU actually made things worse: every iteration would trigger a GPU kernel launch and a device synchronization, so the overhead from managing those small sequential operations far outweighed any potential gain. The CPU fallback was the natural choice here. Since the per-step work is cheap and NumPy handles it efficiently, the overall cost stays low, and the predictions are numerically identical across all backends (differences within floating-point precision, ~1e-15).

NumPy and PyTorch (CPU/CUDA) are validated and covered by the test suite. CuPy and JAX are working but still experimental.

Here is a benchmark notebook with all the details if anyone is curious: https://github.com/wilsonrljr/sysidentpy/blob/feat/array_api/examples/array-api-benchmark.ipynb

I'm curious if others have run into the same situation with recursive or sequential operations, where falling back to CPU was the practical answer. Did you find a different approach that worked better? Would love to hear how others handled it.

rgommers · 2026-04-06T18:02:47Z

rgommers
Apr 6, 2026
Maintainer

Thanks for sharing @wilsonrljr! It's good to hear that this was a pretty smooth process for you, and that the performance gains are significant.

Regarding recursive or sequential operations: in SciPy those are the ones we've avoided converting until now, mostly because we expect the gains to be much lower or non-existent. Or it's harder. Where we could still get gains on those kinds of algorithms is if they're amenable to be JIT-compiled with jax.jit or torch.compile. I suspect that that will depend on the number of iterations; for O(100) iterations it seems like it's still feasible to jit those, if the number of iterations is much larger, I'd expect it to be too much trouble to try. This JIT support is still far less mature; jax.jit is further along and work on torch.compile has only just started.

Maybe @ev-br or @lucascolley remembers a concrete case where we had success on an iterative algorithm?

3 replies

lucascolley Apr 6, 2026
Collaborator

I don't think I've seen significant gains with an iterative algorithm yet, but I hope to check scipy/scipy#24956 at some point

lucascolley May 12, 2026
Collaborator

Maybe @ev-br or @lucascolley remembers a concrete case where we had success on an iterative algorithm?

@rgommers I've managed to observe up to 60x speedup on CPU when using SciPy under jax.jit vs. passing NumPy arrays, for solving large batches of large and sparse linear systems iteratively with CG.

@wilsonrljr I have yet to try on GPU, but I can probably try tomorrow (assuming my uni's HPC system isn't wiped out by another linux kernel vulnerability...).

EDIT: going to give up on GPU, this is not the easiest thing to debug without an interactive terminal

  11 │ ✨ Pixi task (ipython-jax in jax-cuda): spin ipython --build-dir=build-cuda --no-
     │ build -- -c run tools/jit.py
  12 │ Error:   × failed to link libcusolver-11.7.5.82-h676940d_2.conda
  13 │   ├─▶ failed to read 'paths.json'
  14 │   ╰─▶ EOF while parsing a value at line 1 column 0

The benchmark is roughly the following:

def flatten_func(op):
    children = (op.A,)
    aux_data = ()
    return (children, aux_data)


def unflatten_func(aux_data, children):
    (A,) = children
    op = scipy.sparse.linalg.aslinearoperator(A)
    return op


jax.tree_util.register_pytree_node(
    scipy.sparse.linalg._interface.MatrixLinearOperator, flatten_func, unflatten_func
)
cg_jit = jax.jit(scipy.sparse.linalg.cg)

def time_jax(
    *,
    n=4,
    b_batch_shape=(3, 2),
    seed=0,
):
    key = jax.random.key(seed)

    A = 3 * jnp.eye(n*n)
    A = A.reshape(1, *A.shape)
    A = jnp.stack((A, 1.5 * A))
    
    b = jax.random.normal(key, shape=(*b_batch_shape, n*n))

    x_warmup, _ = cg_jit(A, b)
    x_warmup.block_until_ready()

    start = time.perf_counter()
    x, info = cg_jit(A, b)
    x.block_until_ready()
    elapsed_s = time.perf_counter() - start

    return x, info, elapsed_s


def time_np(
    *,
    n=4,
    b_batch_shape=(3, 2),
    seed=0,
):
    rng = np.random.default_rng(seed)

    A = 3 * np.eye(n*n)
    A = A.reshape(1, *A.shape)
    A = np.stack((A, 1.5 * A))
    
    b = rng.normal(size=(*b_batch_shape, n*n)).astype(np.float32)

    start = time.perf_counter()
    x, info = scipy.sparse.linalg.cg(A, b)
    elapsed_s = time.perf_counter() - start

    return x, info, elapsed_s

wilsonrljr May 22, 2026
Author

Thanks a lot for the detailed benchmarking, @lucascolley. The CPU gains with JAX already sound like a really good direction for me to test as well.

Right now, the CPU performance of the prediction side of my library already covers most use cases pretty well, and the Array API adoption has already improved the training side a lot, so things are already in a very good place overall.

Since prediction relies quite a bit on recursive operations, even additional CPU-side improvements there would already be a big win for the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SysIdentPy is starting to adopt the Array API standard #1001

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

SysIdentPy is starting to adopt the Array API standard #1001

Uh oh!

wilsonrljr Apr 4, 2026

Replies: 1 comment · 3 replies

Uh oh!

rgommers Apr 6, 2026 Maintainer

Uh oh!

lucascolley Apr 6, 2026 Collaborator

Uh oh!

Uh oh!

lucascolley May 12, 2026 Collaborator

Uh oh!

wilsonrljr May 22, 2026 Author

wilsonrljr
Apr 4, 2026

Replies: 1 comment 3 replies

rgommers
Apr 6, 2026
Maintainer

lucascolley Apr 6, 2026
Collaborator

lucascolley May 12, 2026
Collaborator

wilsonrljr May 22, 2026
Author