Skip to content

Commit fd1c5ca

Browse files
authored
Merge pull request #5705 from OpenMathLib/develop
Merge from develop for 0.3.32 release
2 parents 76f1be4 + 52178f7 commit fd1c5ca

196 files changed

Lines changed: 7336 additions & 8688 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.cirrus.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ FreeBSD_task:
151151
image_family: freebsd-14-3
152152
install_script:
153153
- pkg update -f && pkg upgrade -y && pkg install -y gmake gcc
154-
- ln -s /usr/local/lib/gcc13/libgfortran.so.5.0.0 /usr/lib/libgfortran.so
154+
- ln -s /usr/local/lib/gcc14/libgfortran.so.5.0.0 /usr/lib/libgfortran.so
155155
compile_script:
156156
- gmake CC=clang FC=gfortran USE_OPENMP=1 CPP_THREAD_SAFETY_TEST=1
157157

.github/workflows/apple_m.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ jobs:
9999
run: |
100100
export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"
101101
export CC="/opt/homebrew/opt/llvm/bin/clang"
102+
export RANLIB=llvm-ranlib
102103
case "${{ matrix.build }}" in
103104
"make")
104105
make -j$(nproc) DYNAMIC_ARCH=1 USE_OPENMP=${{matrix.openmp}} INTERFACE64=${{matrix.ilp64}} FC="ccache ${{ matrix.fortran }}"

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ project(OpenBLAS C ASM)
99

1010
set(OpenBLAS_MAJOR_VERSION 0)
1111
set(OpenBLAS_MINOR_VERSION 3)
12-
set(OpenBLAS_PATCH_VERSION 31)
12+
set(OpenBLAS_PATCH_VERSION 31.dev)
1313

1414
set(OpenBLAS_VERSION "${OpenBLAS_MAJOR_VERSION}.${OpenBLAS_MINOR_VERSION}.${OpenBLAS_PATCH_VERSION}")
1515

CONTRIBUTORS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -272,3 +272,6 @@ In chronological order:
272272

273273
* Anna Mayne <anna.mayne@arm.com>
274274
* [2025-11-19] Update thread throttling profile for SGEMV on NEOVERSEV1 and NEOVERSEV2
275+
276+
* Fadi Arafeh <fadi.arafeh@arm.com>
277+
* [2026-03-05] Accelerate SVE128 SBGEMM/BGEMM

Changelog.txt

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,90 @@
11
OpenBLAS ChangeLog
2+
====================================================================
3+
Version 0.3.32
4+
23-Mar-2026
5+
6+
general:
7+
- Moved the preliminary support for a Web Assembly target to its own WASM
8+
architecture and WASM128_GENERIC target
9+
- Fixed a potential performance difference between dedicated compilation for
10+
a target and its representation in DYNAMIC_ARCH builds by making additional
11+
cpu-specific parameters available to the DYNAMIC_ARCH configuration
12+
- Fixed the reimplementation of LAPACK ?GESV to conform to the reference (i.e.
13+
compute the LU factorization even when NRHS is zero)
14+
- Improved the error message that is displayed when the compile-time allocation
15+
of memory buffers is exceeded
16+
- Fixed a problem with non-serialized accesses to parallelized SYRK by concurrent
17+
callers
18+
- Fixed an ABI mismatch in the internal version of CDOT/ZDOT used by the C fallback
19+
versions of the LAPACK source
20+
- Improved the f_check script for detecting the Fortran compiler to handle embedded
21+
dashes in path names
22+
- Fixed several memory access issues in the utests that were detected by Address
23+
Sanitizer
24+
- Fixed Makefile errors in cases where only a subset of precision types was selected
25+
- Fixed missing function errors in Makefile builds without LAPACK or without threads
26+
- Fixed a syntax error in the benchmarks Makefile
27+
- Fixed compiler warnings in the CBLAS testsuite
28+
- Fixed the OpenMP compiler option used with the Intel Ifx compiler
29+
- Updated the README sections on supported cpus and operating systems, and added
30+
notes pertaining to JAVA
31+
- Updated the documentation page for supported BLAS-like extensions
32+
- included fixes from the Reference-LAPACK project:
33+
- Improved step length selection in the fallback path of ?LAED4
34+
(Reference-LAPACK PR 1191)
35+
- Rounding up of LWORK and removal of redundant type conversions in the GVD
36+
functions (Reference-LAPACK PR 1202)
37+
- internal errors were getting ignored in calculation of selected eigenvalues
38+
(Reference-LAPACK PR 1204)
39+
40+
arm64:
41+
- Fixed a potential miscompilation of the SDOT/DDOT/DSDOT kernels
42+
- Fixed DYNAMIC_ARCH compilation with CMake and compilers lacking SVE support
43+
- Improved the performance of BGEMM and SBGEMM kernels for Neoverse V2
44+
- Added optimized SSUM and DSUM kernels for Neoverse N1
45+
- Added preliminary support for Neoverse V3 cpus as NEOVERSEV2
46+
- Added cpu autodetection of Cortex A725 and X925 cpus
47+
- Fixed a CMake build problem with flang on Mac OS
48+
- Fixed build problems with gcc versions 12 and earlier that do not support fp16
49+
- Fixed compilation of GEMM kernels for VORTEXM4/ARMV9SME without multithreading
50+
- Fixed the optimized CDOT/ZDOT kernel to compile with LLVM under Windows on Arm
51+
- Renamed the copy of the DllMain function used in static linking on MS Windows to
52+
OpenBLASDllMain to avoid symbol name conflicts with other libraries
53+
54+
ioongarch64:
55+
- fixed POTRF returning wrong results on LA464 due to a wrong parameter setting
56+
57+
power:
58+
- Fixed compilation problems caused by missing support for half-precision floats (FP16)
59+
- Fixed a potential miscompilation of the POWER10 DGEMV kernel by limiting its optimization
60+
level
61+
- Fixed a SCAL issue on PPCG4/PPC970 running Linux
62+
- Worked around a SCAL issue on PPC970 running FreeBSD by switching to the generic C kernels
63+
64+
riscv64:
65+
- Optimized the CROT/ZROT kernel for vector length 128 in the non-unit stride path
66+
- Improved SBGEMM/SHGEMM and related helper functions for type conversion
67+
- Fixed probing for BFLOAT16 support in DYNAMIC_ARCH cpu detection at runtime
68+
69+
x86_64:
70+
- Fixed a potential miscompilation (by gcc 15.x) of the AVX512 SGEMM kernel for "small"
71+
matrix sizes
72+
- Fixed the SROT and DROT kernels for Haswell to have consistent (FMA) rounding
73+
in the main loop and tail call
74+
- Added automatic detection of Intel Arrow Lake H/U, Panther Lake and Jasper Lake
75+
- Added automatic detection of Intel Emerald Rapids and upcoming cpu models
76+
- Updated the cache size translation table in the cpu model autodetection code
77+
- Improved cpu detection fallback to also include Nehalem as a non-AVX option
78+
- Fixed a Makefile build issue with clang and the SkylakeX SGEMM kernel
79+
- Renamed the copy of the DllMain function used in static linking on MS Windows to
80+
OpenBLASDllMain to avoid symbol name conflicts with other libraries
81+
82+
wasm:
83+
- Added optimized intrinsics kernels for SGEMM and DGEMM as well as DOT, ROT and SUM
84+
285
====================================================================
386
Version 0.3.31
4-
15-Jan-2025
87+
15-Jan-2026
588

689
general:
790
- reverted a matrix partitioning optimization from 0.3.30 that could lead to

Makefile.rule

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
#
44

55
# This library's version
6-
VERSION = 0.3.31
6+
VERSION = 0.3.31.dev
77

88
# If you set this prefix, the library name will be lib$(LIBNAMESUFFIX)openblas.a
99
# and lib$(LIBNAMESUFFIX)openblas.so, with a matching soname in the shared library

Makefile.wasm

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
CCOMMON_OPT += -msimd128

Makefile.x86_64

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ endif
6161
ifeq ($(CORE), SKYLAKEX)
6262
ifndef NO_AVX512
6363
CCOMMON_OPT += -march=skylake-avx512
64+
ifeq ($(C_COMPILER), CLANG)
65+
CCOMMON_OPT += -mllvm -exhaustive-register-search
66+
endif
6467
ifneq ($(F_COMPILER), NAG)
6568
FCOMMON_OPT += -march=skylake-avx512
6669
endif
@@ -93,6 +96,7 @@ ifeq ($(C_COMPILER), GCC)
9396
endif
9497
endif
9598
else ifeq ($(C_COMPILER), CLANG)
99+
CCOMMON_OPT += -mllvm -exhaustive-register-search
96100
# cooperlake support was added in clang 9
97101
ifeq ($(CLANGVERSIONGTEQ9), 1)
98102
CCOMMON_OPT += -march=cooperlake
@@ -135,6 +139,7 @@ ifeq ($(C_COMPILER), GCC)
135139
endif
136140
endif
137141
else ifeq ($(C_COMPILER), CLANG)
142+
CCOMMON_OPT += -mllvm -exhaustive-register-search
138143
# sapphire rapids support was added in clang 12
139144
ifeq ($(CLANGVERSIONGTEQ12), 1)
140145
CCOMMON_OPT += -march=sapphirerapids

README.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -148,11 +148,12 @@ Please read `GotoBLAS_01Readme.txt` for older CPU models already supported by th
148148
- **Intel Haswell**: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
149149
- **Intel Skylake-X**: Optimized Level-3 and Level-2 BLAS with AVX512 and FMA on x86-64.
150150
- **Intel Cooper Lake**: as Skylake-X with improved BFLOAT16 support.
151+
- **Intel Sapphire Rapids**: as Cooper Lake with improved BFLOAT16 SBGEMM kernel.
151152
- **AMD Bobcat**: Used GotoBLAS2 Barcelona codes.
152153
- **AMD Bulldozer**: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
153154
- **AMD PILEDRIVER**: Uses Bulldozer codes with some optimizations.
154155
- **AMD STEAMROLLER**: Uses Bulldozer codes with some optimizations.
155-
- **AMD ZEN**: Uses Haswell codes with some optimizations for Zen 2/3 (use SkylakeX for Zen4)
156+
- **AMD ZEN**: Uses Haswell codes with some optimizations for Zen 2/3, SkylakeX for Zen4, Cooperlake for Zen5
156157

157158
#### MIPS32
158159

@@ -186,9 +187,13 @@ Please read `GotoBLAS_01Readme.txt` for older CPU models already supported by th
186187
- **EMAG 8180**: preliminary support based on A57
187188
- **Neoverse N1**: (AWS Graviton2) preliminary support
188189
- **Neoverse V1**: (AWS Graviton3) optimized Level-3 BLAS
190+
- **Neoverse N2**: preliminary support
191+
- **Neoverse V2**: preliminary support
189192
- **Apple Vortex**: preliminary support based on ThunderX2/3
193+
- **Apple VortexM4**: preliminary support based on ThunderX2/3, SME kernels for SGEMM,SSYMM,STRMM,SSYRK,SSYR2K
190194
- **A64FX**: preliminary support, optimized Level-3 BLAS
191195
- **ARMV8SVE**: any ARMV8 cpu with SVE extensions
196+
- **ARMV9SME**: any ARMV9 cpu with SVE and SME extensions
192197

193198
#### PPC/PPC64
194199

@@ -249,9 +254,15 @@ e.g.:
249254
```
250255
The old-style TARGET=LOONGSON3R5 is still supported
251256

257+
#### WASM
258+
Not a cpu target in the strict sense, but portable WebAssembly for browser-based applications and the like. See emscripten.org for the compiler and related information
259+
260+
- **WASM128_GENERIC**: Optimized SGEMM,DGEMM, DAXPY, SSUM/DSUM, SDOT/DDOT and SROT/DROT
261+
262+
252263
### Support for multiple targets in a single library
253264

254-
OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying `DYNAMIC_ARCH=1` in Makefile.rule, on the gmake command line or as `-DDYNAMIC_ARCH=TRUE` in cmake.
265+
OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifying `DYNAMIC_ARCH=1` in Makefile.rule, on the gmake command line or as `-DDYNAMIC_ARCH=TRUE` in cmake.
255266

256267
For **x86_64**, the list of targets this activates contains Prescott, Core2, Nehalem, Barcelona, Sandybridge, Bulldozer, Piledriver, Steamroller, Excavator, Haswell, Zen, SkylakeX, Cooper Lake, Sapphire Rapids. For cpu generations not included in this list, the corresponding older model is used. If you also specify `DYNAMIC_OLDER=1`, specific support for Penryn, Dunnington, Opteron, Opteron/SSE3, Bobcat, Atom and Nano is added. Finally there is an option `DYNAMIC_LIST` that allows to specify an individual list of targets to include instead of the default.
257268

@@ -277,23 +288,29 @@ Please note that it is not possible to combine support for different architectur
277288
### Supported OS
278289

279290
- **GNU/Linux**
280-
- **MinGW or Visual Studio (CMake)/Windows**: Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio>.
281-
- **Darwin/macOS/OSX/iOS**: Experimental. Although GotoBLAS2 already supports Darwin, we are not OSX/iOS experts.
291+
- **MinGW or Visual Studio (CMake)/Windows**: Please read <https://github.com/OpenMathLib/OpenBLAS/docs/nstall.md#visual-studio-native-windows-abi>.
292+
- **Darwin/macOS/OSX/iOS**: Already supported on PPC and x86 by the original GotoBLAS, now also on ARM64 but we are not OSX/iOS experts.
282293
- **FreeBSD**: Supported by the community. We don't actively test the library on this OS.
283294
- **OpenBSD**: Supported by the community. We don't actively test the library on this OS.
284295
- **NetBSD**: Supported by the community. We don't actively test the library on this OS.
285296
- **DragonFly BSD**: Supported by the community. We don't actively test the library on this OS.
286-
- **Android**: Supported by the community. Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-build-OpenBLAS-for-Android>.
287-
- **AIX**: Supported on PPC up to POWER10
297+
- **Android**: Supported by the community. Please read <https://github.com/OpenMathLib/OpenBLAS/docs/install.md#android>.
298+
- **AIX**: Supported on PPC up to POWER10 but testing is increasingly problematic due to lack of publicly available systems
288299
- **Haiku**: Supported by the community. We don't actively test the library on this OS.
289300
- **SunOS**: Supported by the community. We don't actively test the library on this OS.
290-
- **Cortex-M**: Supported by the community. Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-on-Cortex-M>.
301+
- **Cortex-M**: Supported by the community. Please read <https://github.com/OpenMathLib/OpenBLAS/docs/install.md#cortex-m>.
291302

292303
## Usage
293304

294305
Statically link with `libopenblas.a` or dynamically link with `-lopenblas` if OpenBLAS was
295306
compiled as a shared library.
296307

308+
### Considerations for using the library from Java
309+
310+
The default stack size of only 1MB may be too small, especially if you built OpenBLAS to support larger matrix sizes than provided for by the default settings. Use the -Xss option to request a larger stack size if you encounter problems.
311+
312+
When a Windows build of OpenBLAS was created using the MINGW gfortran (for the LAPACK parts), the java application may hang on startup due to a deadlock between the gfortran runtime library initialization and any pipes created by a Win11/SBT/Play Framework environment. Use -Djdk.console=jdk.internal.le to work around this.
313+
297314
### Setting the number of threads using environment variables
298315

299316
Environment variables are used to specify a maximum number of threads.

TargetList.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,3 +153,7 @@ EV6
153153
14.CSKY
154154
CSKY
155155
CK860FV
156+
157+
15. WebAssembly/Emscripten:
158+
WASM128_GENERIC
159+

0 commit comments

Comments
 (0)