OpenMathLib
diff --git a/‎.cirrus.yml‎
Lines changed: 1 addition & 1 deletion b/‎.cirrus.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/apple_m.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/apple_m.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 1 addition & 1 deletion b/‎CMakeLists.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTRIBUTORS.md‎
Lines changed: 3 additions & 0 deletions b/‎CONTRIBUTORS.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎Changelog.txt‎
Lines changed: 84 additions & 1 deletion b/‎Changelog.txt‎
Lines changed: 84 additions & 1 deletion
diff --git a/‎Makefile.rule‎
Lines changed: 1 addition & 1 deletion b/‎Makefile.rule‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Makefile.wasm‎
Lines changed: 1 addition & 0 deletions b/‎Makefile.wasm‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Makefile.x86_64‎
Lines changed: 5 additions & 0 deletions b/‎Makefile.x86_64‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 7 deletions b/‎README.md‎
Lines changed: 24 additions & 7 deletions
diff --git a/‎TargetList.txt‎
Lines changed: 4 additions & 0 deletions b/‎TargetList.txt‎
Lines changed: 4 additions & 0 deletions
@@ -151,7 +151,7 @@ FreeBSD_task:
     image_family: freebsd-14-3
   install_script:
   - pkg update -f && pkg upgrade -y && pkg install -y gmake gcc 
-  - ln -s /usr/local/lib/gcc13/libgfortran.so.5.0.0 /usr/lib/libgfortran.so
+  - ln -s /usr/local/lib/gcc14/libgfortran.so.5.0.0 /usr/lib/libgfortran.so
   compile_script:
   - gmake CC=clang FC=gfortran USE_OPENMP=1 CPP_THREAD_SAFETY_TEST=1
 
 
@@ -99,6 +99,7 @@ jobs:
         run: |
           export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"
           export CC="/opt/homebrew/opt/llvm/bin/clang"
+          export RANLIB=llvm-ranlib
           case "${{ matrix.build }}" in
             "make")
               make -j$(nproc) DYNAMIC_ARCH=1 USE_OPENMP=${{matrix.openmp}} INTERFACE64=${{matrix.ilp64}} FC="ccache ${{ matrix.fortran }}"
 
@@ -9,7 +9,7 @@ project(OpenBLAS C ASM)
 
 set(OpenBLAS_MAJOR_VERSION 0)
 set(OpenBLAS_MINOR_VERSION 3)
-set(OpenBLAS_PATCH_VERSION 31)
+set(OpenBLAS_PATCH_VERSION 31.dev)
 
 set(OpenBLAS_VERSION "${OpenBLAS_MAJOR_VERSION}.${OpenBLAS_MINOR_VERSION}.${OpenBLAS_PATCH_VERSION}")
 
 
@@ -272,3 +272,6 @@ In chronological order:
 
 * Anna Mayne <anna.mayne@arm.com>
   * [2025-11-19] Update thread throttling profile for SGEMV on NEOVERSEV1 and NEOVERSEV2
+
+* Fadi Arafeh <fadi.arafeh@arm.com>
+  * [2026-03-05] Accelerate SVE128 SBGEMM/BGEMM
@@ -1,7 +1,90 @@
 OpenBLAS ChangeLog
+====================================================================
+Version 0.3.32
+23-Mar-2026
+
+general:
+ - Moved the preliminary support for a Web Assembly target to its own WASM
+   architecture and WASM128_GENERIC target
+ - Fixed a potential performance difference between dedicated compilation for
+   a target and its representation in DYNAMIC_ARCH builds by making additional
+   cpu-specific parameters available to the DYNAMIC_ARCH configuration
+ - Fixed the reimplementation of LAPACK ?GESV to conform to the reference (i.e.
+   compute the LU factorization even when NRHS is zero)
+ - Improved the error message that is displayed when the compile-time allocation
+   of memory buffers is exceeded
+ - Fixed a problem with non-serialized accesses to parallelized SYRK by concurrent
+   callers
+ - Fixed an ABI mismatch in the internal version of CDOT/ZDOT used by the C fallback
+   versions of the LAPACK source
+ - Improved the f_check script for detecting the Fortran compiler to handle embedded
+   dashes in path names
+ - Fixed several memory access issues in the utests that were detected by Address
+   Sanitizer
+ - Fixed Makefile errors in cases where only a subset of precision types was selected
+ - Fixed missing function errors in Makefile builds without LAPACK or without threads
+ - Fixed a syntax error in the benchmarks Makefile
+ - Fixed compiler warnings in the CBLAS testsuite
+ - Fixed the OpenMP compiler option used with the Intel Ifx compiler
+ - Updated the README sections on supported cpus and operating systems, and added
+   notes pertaining to JAVA
+ - Updated the documentation page for supported BLAS-like extensions
+ - included fixes from the Reference-LAPACK project:
+   - Improved step length selection in the fallback path of ?LAED4 
+     (Reference-LAPACK PR 1191)
+   - Rounding up of LWORK and removal of redundant type conversions in the GVD
+     functions (Reference-LAPACK PR 1202)
+   - internal errors were getting ignored in calculation of selected eigenvalues
+     (Reference-LAPACK PR 1204)
+
+arm64:
+ - Fixed a potential miscompilation of the SDOT/DDOT/DSDOT kernels
+ - Fixed DYNAMIC_ARCH compilation with CMake and compilers lacking SVE support
+ - Improved the performance of BGEMM and SBGEMM kernels for Neoverse V2
+ - Added optimized SSUM and DSUM kernels for Neoverse N1
+ - Added preliminary support for Neoverse V3 cpus as NEOVERSEV2
+ - Added cpu autodetection of Cortex A725 and X925 cpus
+ - Fixed a CMake build problem with flang on Mac OS
+ - Fixed build problems with gcc versions 12 and earlier that do not support fp16
+ - Fixed compilation of GEMM kernels for VORTEXM4/ARMV9SME without multithreading
+ - Fixed the optimized CDOT/ZDOT kernel to compile with LLVM under Windows on Arm
+ - Renamed the copy of the DllMain function used in static linking on MS Windows to
+   OpenBLASDllMain to avoid symbol name conflicts with other libraries
+
+ioongarch64:
+ - fixed POTRF returning wrong results on LA464 due to a wrong parameter setting
+
+power:
+ - Fixed compilation problems caused by missing support for half-precision floats (FP16)
+ - Fixed a potential miscompilation of the POWER10 DGEMV kernel by limiting its optimization
+   level
+ - Fixed a SCAL issue on PPCG4/PPC970 running Linux
+ - Worked around a SCAL issue on PPC970 running FreeBSD by switching to the generic C kernels
+
+riscv64:
+ - Optimized the CROT/ZROT kernel for vector length 128 in the non-unit stride path
+ - Improved SBGEMM/SHGEMM and related helper functions for type conversion
+ - Fixed probing for BFLOAT16 support in DYNAMIC_ARCH cpu detection at runtime
+
+x86_64:
+ - Fixed a potential miscompilation (by gcc 15.x) of the AVX512 SGEMM kernel for "small"
+   matrix sizes
+ - Fixed the SROT and DROT kernels for Haswell to have consistent (FMA) rounding
+   in the main loop and tail call
+ - Added automatic detection of Intel Arrow Lake H/U, Panther Lake and Jasper Lake
+ - Added automatic detection of Intel Emerald Rapids and upcoming cpu models
+ - Updated the cache size translation table in the cpu model autodetection code
+ - Improved cpu detection fallback to also include Nehalem as a non-AVX option  
+ - Fixed a Makefile build issue with clang and the SkylakeX SGEMM kernel 
+ - Renamed the copy of the DllMain function used in static linking on MS Windows to
+   OpenBLASDllMain to avoid symbol name conflicts with other libraries
+
+wasm:
+ - Added optimized intrinsics kernels for SGEMM and DGEMM as well as DOT, ROT and SUM
+
 ====================================================================
 Version 0.3.31
-15-Jan-2025
+15-Jan-2026
 
 general:
  - reverted a matrix partitioning optimization from 0.3.30 that could lead to 
 
@@ -3,7 +3,7 @@
 #
 
 # This library's version
-VERSION = 0.3.31
+VERSION = 0.3.31.dev
 
 # If you set this prefix, the library name will be lib$(LIBNAMESUFFIX)openblas.a
 # and lib$(LIBNAMESUFFIX)openblas.so, with a matching soname in the shared library
 
@@ -0,0 +1 @@
+CCOMMON_OPT += -msimd128
@@ -61,6 +61,9 @@ endif
 ifeq ($(CORE), SKYLAKEX)
 ifndef NO_AVX512
 CCOMMON_OPT += -march=skylake-avx512
+ifeq ($(C_COMPILER), CLANG)
+CCOMMON_OPT += -mllvm -exhaustive-register-search
+endif
 ifneq ($(F_COMPILER), NAG)
 FCOMMON_OPT += -march=skylake-avx512
 endif
@@ -93,6 +96,7 @@ ifeq ($(C_COMPILER), GCC)
   endif
  endif
 else ifeq ($(C_COMPILER), CLANG)
+ CCOMMON_OPT += -mllvm -exhaustive-register-search
  # cooperlake support was added in clang 9
  ifeq ($(CLANGVERSIONGTEQ9), 1)
   CCOMMON_OPT += -march=cooperlake
@@ -135,6 +139,7 @@ ifeq ($(C_COMPILER), GCC)
   endif
  endif
 else ifeq ($(C_COMPILER), CLANG)
+  CCOMMON_OPT += -mllvm -exhaustive-register-search
  # sapphire rapids support was added in clang 12
  ifeq ($(CLANGVERSIONGTEQ12), 1)
   CCOMMON_OPT += -march=sapphirerapids
 
@@ -148,11 +148,12 @@ Please read `GotoBLAS_01Readme.txt` for older CPU models already supported by th
 - **Intel Haswell**: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
 - **Intel Skylake-X**: Optimized Level-3 and Level-2 BLAS with AVX512 and FMA on x86-64.
 - **Intel Cooper Lake**: as Skylake-X with improved BFLOAT16 support.
+- **Intel Sapphire Rapids**: as Cooper Lake with improved BFLOAT16 SBGEMM kernel.
 - **AMD Bobcat**: Used GotoBLAS2 Barcelona codes.
 - **AMD Bulldozer**: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
 - **AMD PILEDRIVER**: Uses Bulldozer codes with some optimizations.
 - **AMD STEAMROLLER**: Uses Bulldozer codes with some optimizations.
-- **AMD ZEN**: Uses Haswell codes with some optimizations for Zen 2/3 (use SkylakeX for Zen4)
+- **AMD ZEN**: Uses Haswell codes with some optimizations for Zen 2/3, SkylakeX for Zen4, Cooperlake for Zen5
 
 #### MIPS32
 
@@ -186,9 +187,13 @@ Please read `GotoBLAS_01Readme.txt` for older CPU models already supported by th
 - **EMAG 8180**: preliminary support based on A57
 - **Neoverse N1**: (AWS Graviton2) preliminary support
 - **Neoverse V1**: (AWS Graviton3) optimized Level-3 BLAS
+- **Neoverse N2**: preliminary support
+- **Neoverse V2**: preliminary support
 - **Apple Vortex**: preliminary support based on ThunderX2/3
+- **Apple VortexM4**: preliminary support based on ThunderX2/3, SME kernels for SGEMM,SSYMM,STRMM,SSYRK,SSYR2K
 - **A64FX**:  preliminary support, optimized Level-3 BLAS
 - **ARMV8SVE**: any ARMV8 cpu with SVE extensions 
+- **ARMV9SME**: any ARMV9 cpu with SVE and SME extensions 
 
 #### PPC/PPC64
 
@@ -249,9 +254,15 @@ e.g.:
   ```
   The old-style TARGET=LOONGSON3R5 is still supported
 
+#### WASM
+  Not a cpu target in the strict sense, but portable WebAssembly for browser-based applications and the like. See emscripten.org for the compiler and related information
+
+- **WASM128_GENERIC**: Optimized SGEMM,DGEMM, DAXPY, SSUM/DSUM, SDOT/DDOT and SROT/DROT
+
+
 ### Support for multiple targets in a single library
 
-OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying `DYNAMIC_ARCH=1` in Makefile.rule, on the gmake command line or as `-DDYNAMIC_ARCH=TRUE` in cmake.
+OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifying `DYNAMIC_ARCH=1` in Makefile.rule, on the gmake command line or as `-DDYNAMIC_ARCH=TRUE` in cmake.
 
 For **x86_64**, the list of targets this activates contains Prescott, Core2, Nehalem, Barcelona, Sandybridge, Bulldozer, Piledriver, Steamroller, Excavator, Haswell, Zen, SkylakeX, Cooper Lake, Sapphire Rapids. For cpu generations not included in this list, the corresponding older model is used. If you also specify `DYNAMIC_OLDER=1`, specific support for Penryn, Dunnington, Opteron, Opteron/SSE3, Bobcat, Atom and Nano is added. Finally there is an option `DYNAMIC_LIST` that allows to specify an individual list of targets to include instead of the default.
 
@@ -277,23 +288,29 @@ Please note that it is not possible to combine support for different architectur
 ### Supported OS
 
 - **GNU/Linux**
-- **MinGW or Visual Studio (CMake)/Windows**: Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio>.
-- **Darwin/macOS/OSX/iOS**: Experimental. Although GotoBLAS2 already supports Darwin, we are not OSX/iOS experts.
+- **MinGW or Visual Studio (CMake)/Windows**: Please read <https://github.com/OpenMathLib/OpenBLAS/docs/nstall.md#visual-studio-native-windows-abi>.
+- **Darwin/macOS/OSX/iOS**: Already supported on PPC and x86 by the original GotoBLAS, now also on ARM64 but we are not OSX/iOS experts.
 - **FreeBSD**: Supported by the community. We don't actively test the library on this OS.
 - **OpenBSD**: Supported by the community. We don't actively test the library on this OS.
 - **NetBSD**: Supported by the community. We don't actively test the library on this OS.
 - **DragonFly BSD**: Supported by the community. We don't actively test the library on this OS.
-- **Android**: Supported by the community. Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-build-OpenBLAS-for-Android>.
-- **AIX**: Supported on PPC up to POWER10
+- **Android**: Supported by the community. Please read <https://github.com/OpenMathLib/OpenBLAS/docs/install.md#android>.
+- **AIX**: Supported on PPC up to POWER10 but testing is increasingly problematic due to lack of publicly available systems
 - **Haiku**: Supported by the community. We don't actively test the library on this OS.
 - **SunOS**: Supported by the community. We don't actively test the library on this OS.
-- **Cortex-M**: Supported by the community. Please read <https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-on-Cortex-M>.
+- **Cortex-M**: Supported by the community. Please read <https://github.com/OpenMathLib/OpenBLAS/docs/install.md#cortex-m>.
 
 ## Usage
 
 Statically link with `libopenblas.a` or dynamically link with `-lopenblas` if OpenBLAS was
 compiled as a shared library.
 
+### Considerations for using the library from Java
+
+The default stack size of only 1MB may be too small, especially if you built OpenBLAS to support larger matrix sizes than provided for by the default settings. Use the -Xss option to request a larger stack size if you encounter problems.
+
+When a Windows build of OpenBLAS was created using the MINGW gfortran (for the LAPACK parts), the java application may hang on startup due to a deadlock between the gfortran runtime library initialization and any pipes created by a Win11/SBT/Play Framework environment. Use -Djdk.console=jdk.internal.le to work around this. 
+
 ### Setting the number of threads using environment variables
 
 Environment variables are used to specify a maximum number of threads.
 
@@ -153,3 +153,7 @@ EV6
 14.CSKY
 CSKY
 CK860FV
+
+15. WebAssembly/Emscripten:
+WASM128_GENERIC
+
Original file line number	Diff line number	Diff line change
`@@ -3,7 +3,7 @@`
`3`	`3`	`#`
`4`	`4`
`5`	`5`	`# This library's version`
`6`		`-VERSION = 0.3.31`
	`6`	`+VERSION = 0.3.31.dev`
`7`	`7`
`8`	`8`	`# If you set this prefix, the library name will be lib$(LIBNAMESUFFIX)openblas.a`
`9`	`9`	`# and lib$(LIBNAMESUFFIX)openblas.so, with a matching soname in the shared library`