SysIdentPy is starting to adopt the Array API standard #1001
Replies: 1 comment 1 reply
-
|
Thanks for sharing @wilsonrljr! It's good to hear that this was a pretty smooth process for you, and that the performance gains are significant. Regarding recursive or sequential operations: in SciPy those are the ones we've avoided converting until now, mostly because we expect the gains to be much lower or non-existent. Or it's harder. Where we could still get gains on those kinds of algorithms is if they're amenable to be JIT-compiled with Maybe @ev-br or @lucascolley remembers a concrete case where we had success on an iterative algorithm? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm the maintainer of SysIdentPy, a Python library for nonlinear system identification and time series forecasting. I wanted to share that I've started adding Array API support, and I thought this community would be a good place to talk about it.
I was not sure how hard it would be. My library is built on top of numpy and scipy and relies heavily on matrix operations internally. But using
array-api-compatandarray-api-extramade things a lot simpler than I expected. I added the namespace dispatch at the publicfit()/predict()boundary, and most of the internals just followed from there.The results were encouraging. Some concrete numbers from my benchmarks (FROLS algorithm, polynomial degree 4, RTX 3080 Ti):
All speedups are relative to plain NumPy. And the dispatch layer itself adds less than 5% overhead when you keep NumPy as the backend.
One thing worth mentioning about the predict side: my library supports free-run and n-step-ahead prediction modes, which are inherently recursive (each step depends on the output of the previous one). For that kind of loop, NumPy is already very fast, and putting it on the GPU actually made things worse: every iteration would trigger a GPU kernel launch and a device synchronization, so the overhead from managing those small sequential operations far outweighed any potential gain. The CPU fallback was the natural choice here. Since the per-step work is cheap and NumPy handles it efficiently, the overall cost stays low, and the predictions are numerically identical across all backends (differences within floating-point precision, ~1e-15).
NumPy and PyTorch (CPU/CUDA) are validated and covered by the test suite. CuPy and JAX are working but still experimental.
Here is a benchmark notebook with all the details if anyone is curious: https://github.com/wilsonrljr/sysidentpy/blob/feat/array_api/examples/array-api-benchmark.ipynb
I'm curious if others have run into the same situation with recursive or sequential operations, where falling back to CPU was the practical answer. Did you find a different approach that worked better? Would love to hear how others handled it.
Beta Was this translation helpful? Give feedback.
All reactions