|
| 1 | +# What's new in StencilStream 4.0.0 |
| 2 | + |
| 3 | +StencilStream 4.0.0 broadens the framework from an FPGA-focused stencil |
| 4 | +accelerator into a portable, SYCL-based 2D stencil framework that targets FPGAs, |
| 5 | +NVIDIA GPUs, and CPUs from the same transition-function code. It also brings a |
| 6 | +significant API cleanup, makes the FPGA backends faster through spatial |
| 7 | +parallelism, and adds an experimental multi-FPGA execution path. |
| 8 | + |
| 9 | +## New backends and performance features |
| 10 | + |
| 11 | +* **GPU backend.** A new CUDA backend, built on |
| 12 | + [Codeplay's oneAPI for NVIDIA GPUs plugin](https://developer.codeplay.com/products/oneapi/nvidia/home/index.html), |
| 13 | + brings StencilStream to NVIDIA GPUs. A transparent |
| 14 | + Array-of-Structs ↔ Struct-of-Arrays transformation lets the very same |
| 15 | + transition function reach high throughput on both GPUs and FPGAs. See the |
| 16 | + IWOCL '26 publication ([doi:10.1145/3811257.3811259](https://doi.org/10.1145/3811257.3811259)) |
| 17 | + for a detailed evaluation. |
| 18 | +* **Spatial parallelism on the FPGA backends.** The Monotile and Tiling |
| 19 | + backends now allow vectorizing the I/O and update kernels, raising the achievable |
| 20 | + throughput substantially over 3.0.0. The highest single-device throughput |
| 21 | + measured for the new release is 176.08 GCells/s (1.58 TFLOPS) for Jacobi on |
| 22 | + the Tiling backend, and 122.67 GCells/s (1.84 TFLOPS) arithmetic throughput for HotSpot on the |
| 23 | + Monotile backend (BittWare 520N / Intel Stratix 10 GX 2800). |
| 24 | +* **Experimental multi-FPGA Monotile backend.** Uses the networking |
| 25 | + capabilities of high-end FPGAs to scale a Monotile design beyond a single |
| 26 | + device. |
| 27 | + |
| 28 | +## Breaking API changes |
| 29 | + |
| 30 | +These changes affect every user upgrading from 3.0.0: |
| 31 | + |
| 32 | +* **Index types.** The configurable `stencil::uindex_t` and `stencil::index_t`, |
| 33 | + along with the `STENCIL_INDEX_WIDTH` macro, have been replaced by |
| 34 | + `std::size_t` and `std::ptrdiff_t` to align with the SYCL standard. Index width |
| 35 | + narrowing is now done automatically within the FPGA backends. |
| 36 | + The `StencilStream/Index.hpp` header has been removed. |
| 37 | +* **2D coordinates.** The custom `ID` / `UID` / `GenericID` types have been |
| 38 | + replaced by `sycl::id<2>` and `sycl::range<2>`, again to align with the SYCL standard. |
| 39 | + The `StencilStream/GenericID.hpp` header has been removed. |
| 40 | +* **Index ordering.** The first index of a 2D coordinate is now the row and the |
| 41 | + second is the column, again matching the SYCL standard. Transition functions, |
| 42 | + grid construction, and accessor calls written against 3.0.0 must be updated |
| 43 | + accordingly. |
| 44 | +* **No more Boost dependency.** StencilStream no longer pulls in Boost; |
| 45 | + builds and downstream projects can drop the corresponding find/link lines. |
| 46 | +* **Internal headers reorganized.** Implementation-detail headers (helpers, |
| 47 | + I/O / memory / switch kernels, the per-backend kernel and design classes) |
| 48 | + now live under `StencilStream/internal/` and per-backend `internal/` |
| 49 | + subdirectories. Public concepts in `Concepts.hpp` and `Stencil.hpp` have |
| 50 | + been updated to use the new index types. |
| 51 | + |
| 52 | +## New example and documentation |
| 53 | + |
| 54 | +* **Jacobi example.** A new example under `examples/jacobi/` provides multiple |
| 55 | + Jacobi-kernel variants with adjustable computational complexity, and serves |
| 56 | + as the primary benchmark in the new performance figures. |
| 57 | +* **Documentation overhaul.** The README has been rewritten with up-to-date |
| 58 | + build, run, and benchmarking instructions covering all backends, and the |
| 59 | + Doxygen documentation now uses the Doxygen Awesome theme with a dark-mode |
| 60 | + toggle. |
| 61 | + |
| 62 | +## Build and tooling |
| 63 | + |
| 64 | +* **Toolchain.** Validated on Intel oneAPI 24.2.1. |
| 65 | +* **Environment setup.** Separate `scripts/env_fpga.sh` and |
| 66 | + `scripts/env_cuda.sh` scripts replace the previous combined setup, so the |
| 67 | + FPGA and CUDA toolchains can be loaded independently on Noctua 2. |
| 68 | +* **Per-backend benchmark scripts.** Each example now ships |
| 69 | + `benchmark_mono.sh`, `benchmark_tiling.sh`, and `benchmark_cuda.sh` driver |
| 70 | + scripts on top of the shared Julia benchmark harness. |
| 71 | +* **Standalone Conway build.** The Conway example provides a |
| 72 | + `CMakeLists.standalone.txt` that can be used to build it outside of the |
| 73 | + StencilStream source tree. |
0 commit comments