Skip to content

Commit ca97881

Browse files
committed
Moving all section headers up on the hierarchy
1 parent b69af93 commit ca97881

1 file changed

Lines changed: 73 additions & 0 deletions

File tree

release_notes.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# What's new in StencilStream 4.0.0
2+
3+
StencilStream 4.0.0 broadens the framework from an FPGA-focused stencil
4+
accelerator into a portable, SYCL-based 2D stencil framework that targets FPGAs,
5+
NVIDIA GPUs, and CPUs from the same transition-function code. It also brings a
6+
significant API cleanup, makes the FPGA backends faster through spatial
7+
parallelism, and adds an experimental multi-FPGA execution path.
8+
9+
## New backends and performance features
10+
11+
* **GPU backend.** A new CUDA backend, built on
12+
[Codeplay's oneAPI for NVIDIA GPUs plugin](https://developer.codeplay.com/products/oneapi/nvidia/home/index.html),
13+
brings StencilStream to NVIDIA GPUs. A transparent
14+
Array-of-Structs ↔ Struct-of-Arrays transformation lets the very same
15+
transition function reach high throughput on both GPUs and FPGAs. See the
16+
IWOCL '26 publication ([doi:10.1145/3811257.3811259](https://doi.org/10.1145/3811257.3811259))
17+
for a detailed evaluation.
18+
* **Spatial parallelism on the FPGA backends.** The Monotile and Tiling
19+
backends now allow vectorizing the I/O and update kernels, raising the achievable
20+
throughput substantially over 3.0.0. The highest single-device throughput
21+
measured for the new release is 176.08 GCells/s (1.58 TFLOPS) for Jacobi on
22+
the Tiling backend, and 122.67 GCells/s (1.84 TFLOPS) arithmetic throughput for HotSpot on the
23+
Monotile backend (BittWare 520N / Intel Stratix 10 GX 2800).
24+
* **Experimental multi-FPGA Monotile backend.** Uses the networking
25+
capabilities of high-end FPGAs to scale a Monotile design beyond a single
26+
device.
27+
28+
## Breaking API changes
29+
30+
These changes affect every user upgrading from 3.0.0:
31+
32+
* **Index types.** The configurable `stencil::uindex_t` and `stencil::index_t`,
33+
along with the `STENCIL_INDEX_WIDTH` macro, have been replaced by
34+
`std::size_t` and `std::ptrdiff_t` to align with the SYCL standard. Index width
35+
narrowing is now done automatically within the FPGA backends.
36+
The `StencilStream/Index.hpp` header has been removed.
37+
* **2D coordinates.** The custom `ID` / `UID` / `GenericID` types have been
38+
replaced by `sycl::id<2>` and `sycl::range<2>`, again to align with the SYCL standard.
39+
The `StencilStream/GenericID.hpp` header has been removed.
40+
* **Index ordering.** The first index of a 2D coordinate is now the row and the
41+
second is the column, again matching the SYCL standard. Transition functions,
42+
grid construction, and accessor calls written against 3.0.0 must be updated
43+
accordingly.
44+
* **No more Boost dependency.** StencilStream no longer pulls in Boost;
45+
builds and downstream projects can drop the corresponding find/link lines.
46+
* **Internal headers reorganized.** Implementation-detail headers (helpers,
47+
I/O / memory / switch kernels, the per-backend kernel and design classes)
48+
now live under `StencilStream/internal/` and per-backend `internal/`
49+
subdirectories. Public concepts in `Concepts.hpp` and `Stencil.hpp` have
50+
been updated to use the new index types.
51+
52+
## New example and documentation
53+
54+
* **Jacobi example.** A new example under `examples/jacobi/` provides multiple
55+
Jacobi-kernel variants with adjustable computational complexity, and serves
56+
as the primary benchmark in the new performance figures.
57+
* **Documentation overhaul.** The README has been rewritten with up-to-date
58+
build, run, and benchmarking instructions covering all backends, and the
59+
Doxygen documentation now uses the Doxygen Awesome theme with a dark-mode
60+
toggle.
61+
62+
## Build and tooling
63+
64+
* **Toolchain.** Validated on Intel oneAPI 24.2.1.
65+
* **Environment setup.** Separate `scripts/env_fpga.sh` and
66+
`scripts/env_cuda.sh` scripts replace the previous combined setup, so the
67+
FPGA and CUDA toolchains can be loaded independently on Noctua 2.
68+
* **Per-backend benchmark scripts.** Each example now ships
69+
`benchmark_mono.sh`, `benchmark_tiling.sh`, and `benchmark_cuda.sh` driver
70+
scripts on top of the shared Julia benchmark harness.
71+
* **Standalone Conway build.** The Conway example provides a
72+
`CMakeLists.standalone.txt` that can be used to build it outside of the
73+
StencilStream source tree.

0 commit comments

Comments
 (0)