Add NumPy optimization guide by vchamarthi · Pull Request #36 · intel/optimization-zone

vchamarthi · 2026-06-23T00:42:09Z

Adds a new tuning guide documenting how to run NumPy with Intel® oneMKL-backed performance (BLAS/LAPACK plus optional FFT/random/umath patching), and links it from the repository’s main README

Changes:

Add software/numpy/README.md with installation, activation patterns, verification steps, and benchmark summaries for oneMKL-backed NumPy.
Update the root README.md table of contents to include the new NumPy guide.

CC @xaleryb @jharlow-intel @napetrov for addition review

Add intel-numpy mkl extension optimizations readme

david-cortes-intel · 2026-06-24T11:49:14Z

Overall comment: this guide recommends setting the IOMP threading layter for MKL, but pretty much every other PyPI package outside of Intel-distributed NumPy will bundle LibGOMP and could potentially cause incompatibilities.

Perhaps it could recommend setting MKL_THREADING_LAYER=GNU instead.

david-cortes-intel · 2026-06-25T05:58:38Z

+conda install -y \
+  -c https://software.repos.intel.com/python/conda \
+  -c conda-forge --override-channels \
+  "blas=*=*_intelmkl" \


What about lapack? If this is done on an existing environment, there's no guarantee that the user won't have different backends for blas and lapack.

david-cortes-intel · 2026-06-25T06:00:09Z

+  mkl mkl_fft mkl_random mkl_umath mkl-service
+```
+
+`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.


I assume this advice might have been copied from other documentation pages.

The reason why it was there was to avoid pulling packages from the Anaconda channel which have higher priority. That's worth mentioning here.

david-cortes-intel · 2026-06-25T06:00:52Z

+  conda activate idp_env
+```
+
+Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy):


I think this part is redundant:

Pin python=<version> to match your project if you need a specific interpreter

Since this is not a general conda guide.

david-cortes-intel · 2026-06-25T06:01:19Z

+  mkl mkl_fft mkl_random mkl_umath mkl-service
+```
+
+`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.


Suggested change

`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.

`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl*`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.

david-cortes-intel · 2026-06-25T06:02:07Z

+
+Use `--index-url`, not `--extra-index-url`: Intel's index is a partial mirror, and with `--extra-index-url` pip would see PyPI's higher-numbered OpenBLAS wheel and install that instead. Packages Intel does not mirror (for example `threadpoolctl`, used for [verification](#verifying-onemkl-is-active)) install normally from PyPI in a separate step. The Intel wheels target Linux and Windows; if `pip` reports no matching distribution, check that your platform and Python version are covered on the index.
+
+Whichever path you take, choose the OpenMP threading layer and set it **before anything imports NumPy or MKL**. The variable is read once at MKL load time, so exporting it after the import has no effect. Which value to pick is explained under [Threads and NUMA](#threads-and-numa); the safe default for a typical pip or mixed environment is:


Also applicable to SciPy.

david-cortes-intel · 2026-06-25T06:29:10Z

+
+The `threading_layer` value matches `MKL_THREADING_LAYER` (`gnu`, `intel`, or `sequential`); the field that confirms the backend is `internal_api: mkl`.
+
+`np.show_config()` will show `name: blas, version: 3.9.0` even with oneMKL active. That is expected: it reflects the generic interface NumPy compiled against, not the runtime library. `threadpoolctl` is the reliable check.


These hard-coded version numbers are prone to get outdated over time.

david-cortes-intel · 2026-06-25T06:30:11Z

+MKL_VERBOSE DGEMM(N,N,4096,4096,4096,...) 2.1s CNT=1
+```
+
+If only the banner appears and no `DGEMM`/`DFFT`/`VML` lines follow, oneMKL loaded but is not being called.


It should mention here that which of those show depends on what the code is doing. Maybe could provide a sample script with a matrix multiplication that would trigger dgemm.

david-cortes-intel · 2026-06-25T06:32:03Z

+
+**The extension packages do not activate themselves.** `mkl_fft`, `mkl_random`, and `mkl_umath` do not replace NumPy functions on import. Use the patch function or context manager. Since the 2026.0 release installs the standard conda-forge NumPy rather than a bundled Intel build, there is no longer anything that activates them at build time, so explicit activation is required even in the full Intel® Distribution for Python.
+
+**The activation model is release-specific; this guide targets 2026.0 and later.** The explicit `patch_*` workflow described here matches the package generation in [Benchmark results](#benchmark-results) (NumPy 2.4.3, mkl_fft 2.2.0, mkl_random 1.4.0, mkl_umath 0.4.0). Earlier releases behave differently, verified on `intelpython3_full=2025.3.0`:


This makes it sounds as if this were expected to change in the future. Maybe it could mention that it applies to versions starting with 2026.0.

david-cortes-intel · 2026-06-25T06:33:38Z

+conda install -y \
+  -c https://software.repos.intel.com/python/conda \
+  -c conda-forge --override-channels \
+  "blas=*=*_intelmkl" \


'blas' is a development package providing headers, .pc files, and similar, depending in turn on 'libblas'. 'libblas' is the runtime that sets the backend.

david-cortes-intel · 2026-06-25T06:37:38Z

+conda install -c conda-forge _openmp_mutex=*=*_llvm
+```
+
+On Windows, `_openmp_mutex` offers Intel and LLVM variants but no GNU one, consistent with there being no GNU threading on the platform.


This is not correct:

david-cortes-intel · 2026-06-25T06:39:31Z

+Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy):
+
+```bash
+conda install -y \


Very important to mention here that packages from the Intel channel are meant to be compatible with packages from conda-forge but not with packages from Anaconda, which is the default channel.

david-cortes-intel · 2026-06-25T06:40:36Z

Comment again that the guide specifically mentions AVX-512 as the highest level of SIMD instructions, but that will become outdated soon as hardware with avx10.2 gets released.

david-cortes-intel · 2026-06-25T06:42:42Z

+```python
+from threadpoolctl import threadpool_info
+import pprint
+pprint.pprint(threadpool_info())


This should be executed after importing numpy.

david-cortes-intel · 2026-06-25T07:20:48Z

+| `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling |
+| `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) |
+
+`KMP_AFFINITY` is an Intel OpenMP setting, so it applies only when oneMKL is on the Intel runtime (`MKL_THREADING_LAYER=INTEL`); under the GNU layer use `GOMP_CPU_AFFINITY` or `numactl` instead. `KMP_AFFINITY=granularity=fine,compact,1,0` is appropriate for single-socket systems or when running one process per socket. On multi-socket systems without `numactl` it may bind threads across sockets; verify the actual binding with `KMP_AFFINITY=verbose`.


What about OMP_PROC_BIND?

david-cortes-intel · 2026-06-25T13:39:12Z

+| Variable | Recommended value | Effect |
+|---|---|---|
+| `MKL_THREADING_LAYER` | `GNU` (mixed env) or `INTEL` (all-Intel) | Select MKL's OpenMP runtime; see note below |
+| `MKL_NUM_THREADS` | physical core count | Cap MKL thread count |


Is this guaranteed to work as intended if you set MKL_NUM_THREADS to number of physical cores, then bind the threads to numbers from the system, but don't specify something like OMP_PLACES=threads? Wouldn't it potentially end up using hyperthreads if the system enumerates them in an interleaved order?

david-cortes-intel · 2026-06-25T14:43:50Z

+The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide:
+
+- **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change.
+- **FFT, random, and vectorized math** come from three separate packages (`mkl_fft`, `mkl_random`, `mkl_umath`). These do not activate on import; you switch them on explicitly in code.


It could link to the github repositories of those packages.

adgubrud · 2026-07-01T16:59:45Z

@@ -0,0 +1,443 @@
+# Intel® Optimized NumPy with oneMKL
+
+This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.


Suggested change

This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.

This guide describes how to get optimal NumPy performance on Intel® processors, from servers to laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, Fast Fourier Transform (FFT), random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.

adgubrud · 2026-07-01T17:00:29Z

@@ -0,0 +1,443 @@
+# Intel® Optimized NumPy with oneMKL


Suggested change

# Intel® Optimized NumPy with oneMKL

# Intel® Optimized NumPy With oneMKL

adgubrud · 2026-07-01T17:00:40Z

+
+This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.
+
+## Table of contents


Suggested change

## Table of contents

## Table Of Contents

adgubrud · 2026-07-01T17:02:27Z

+
+---
+
+## Where NumPy performance comes from


Suggested change

## Where NumPy performance comes from

## NumPy Performance Contributors

Please also update the table of contents

adgubrud · 2026-07-01T17:03:18Z

+
+---
+
+## Optimization levers


Suggested change

## Optimization levers

## Optimization Levers

adgubrud · 2026-07-01T17:54:53Z

+
+NumPy runs much of its work in its own compiled code, but its heaviest numerical kernels are handed off to external libraries: linear algebra to a BLAS/LAPACK library, FFTs to an FFT library, and large element-wise transcendental math (`sin`, `exp`, `log`) to vectorized loops. For those kernels, performance is largely decided by *which* native library the call lands in.
+
+That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates FFT, random number generation, and vectorized math, which a BLAS library does not cover. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs.


Suggested change

That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates FFT, random number generation, and vectorized math, which a BLAS library does not cover. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs.

That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates a number of functionalities not covered by a BLAS library such as FFT, random number generation, and vectorized math. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs.

adgubrud · 2026-07-01T17:59:21Z

+
+## Accelerating NumPy with oneMKL
+
+Intel® oneAPI Math Kernel Library (oneMKL) supplies AVX-512 implementations for every one of those backends on 3rd Gen Intel® Xeon® (Ice Lake) and newer: BLAS, LAPACK, FFT, random number generation, and vectorized math. Pointing NumPy at oneMKL is how you turn that hardware capability into wall-clock speedup, with no change to your NumPy code. Across a representative set of NumPy-heavy workloads this is a 3.95x geomean speedup at one socket; the full breakdown is in [Benchmark results](#benchmark-results).


This paragraph seems to hit most of the same points of the previous paragraph. Please consider rewording/trimming it down.

In other parts of this article, you highlight laptop support in addition to servers. Is there a reason why you only specify Xeon support in this paragraph?

adgubrud · 2026-07-01T18:02:35Z

+
+The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide:
+
+- **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change.


Suggested change

- **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change.

- **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend (details for this in the [installation](#installation) section). `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change.

adgubrud · 2026-07-01T18:19:26Z

+
+### Installation
+
+There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)).


Suggested change

There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)).

There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)). [Miniforge](https://github.com/conda-forge/miniforge) distribution is recommended.

adgubrud · 2026-07-01T18:21:42Z

+
+There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)).
+
+**conda.** A single command installs NumPy, SciPy, the three extension packages (mkl_fft, mkl_random, mkl_umath), and the runtime libraries. The BLAS/LAPACK backend routes to oneMKL automatically; the extensions are installed but still need explicit activation.


Please provide a link to an example of this explicit activation. Is that in the "Optimization Levers" section?

adgubrud · 2026-07-01T19:38:07Z

+
+The three explicit extensions share an activation model, and the key point is how little code it takes. Activation is a single one-time call: a **context manager** around a block, best when you want oneMKL for one section and stock NumPy elsewhere, or a **patch/restore pair**, best when oneMKL should stay active for the life of the process. That one call is the only addition. It redirects NumPy's internals so your existing `np.fft.*`, `np.random.*`, and `np.sin`/`np.exp`/`np.log` call sites dispatch to oneMKL with their source unchanged.
+
+Concretely, given an existing function, the only edit is the import-and-activate block at the top. The function body is untouched:


This makes it sound like the imports and activation are supposed to be at the top. Does import mkl_fft [...] have to come after the function definition or can it before? If it can be before, I think it will be less confusing to the reader to move it to the top of the example source code.

adgubrud · 2026-07-01T20:01:34Z

+
+### Linear algebra: BLAS and LAPACK
+
+This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g.,  AVX-512 on current Xeons). This is the largest single contributor to the geomean in [Benchmark results](#benchmark-results).


Suggested change

This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean in [Benchmark results](#benchmark-results).

This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean speedup in [Benchmark results](#benchmark-results).

adgubrud · 2026-07-01T20:04:08Z

+print(f"speedup         : {stock_ms / mkl_ms:.1f}x")
+```
+
+Measured on AWS, Intel® Xeon® 6975P-C, 16 cores / 32 threads (HT on), 1 socket, Ubuntu 26.04 LTS. Numbers vary by hardware.


Have these numbers been through PDT?

adgubrud · 2026-07-01T20:08:22Z

+
+def analyze(signal):
+    spectrum = np.fft.fft(signal)    # -> numpy.fft, then mkl_fft after activation
+    power = np.abs(spectrum) ** 2    # -> VML after activation (large arrays)


Please provide the full form of the abbreviation VML

adgubrud · 2026-07-01T20:11:14Z

+
+`mkl_random` is a Python interface to oneMKL's Vector Statistics Library (VSL). It samples from the same distributions as `numpy.random` but is not a fixed-seed drop-in: the same seed produces a different sequence. Use it when generating large volumes of random data is a bottleneck and you do not depend on reproducing specific values.
+
+It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown [below](#random-number-generation-mkl_random)). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler.


Suggested change

It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown [below](#random-number-generation-mkl_random)). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler.

It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown below). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler.

That link just takes the user to top of the "Random number generation" section.

adgubrud · 2026-07-01T20:26:21Z

+| `MKL_THREADING_LAYER` | `GNU` (mixed env) or `INTEL` (all-Intel) | Select MKL's OpenMP runtime; see note below |
+| `MKL_NUM_THREADS` | physical core count | Cap MKL thread count |
+| `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling |
+| `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) |


what is the breakdown of these arguments "fine,compact,1,0"? Is there a reference we can point to?

adgubrud · 2026-07-01T20:28:15Z

+| `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling |
+| `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) |
+
+`KMP_AFFINITY` is an Intel OpenMP setting, so it applies only when oneMKL is on the Intel runtime (`MKL_THREADING_LAYER=INTEL`); under the GNU layer use `GOMP_CPU_AFFINITY` or `numactl` instead. `KMP_AFFINITY=granularity=fine,compact,1,0` is appropriate for single-socket systems or when running one process per socket. On multi-socket systems without `numactl` it may bind threads across sockets; verify the actual binding with `KMP_AFFINITY=verbose`.


Is KMP_AFFINITY=verbose supposed to be tacked on to KMP_AFFINITY=granularity=fine,compact,1,0? What does that look like?

adgubrud · 2026-07-01T20:34:14Z

+
+---
+
+## Verifying oneMKL is active


Suggested change

## Verifying oneMKL is active

## Verifying oneMKL Is Active

adgubrud · 2026-07-01T20:37:10Z

+
+---
+
+## Benchmark results


Have these numbers all been through PDT?

adgubrud · 2026-07-01T20:42:43Z

+
+**`mkl_random` is not a drop-in for `numpy.random`.** The same seed produces a different sequence. Do not swap it into code that depends on reproducible random values.
+
+**AMX does not apply to standard NumPy operations.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for bfloat16 GEMM, which NumPy does not call natively.


Suggested change

**AMX does not apply to standard NumPy operations.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for bfloat16 GEMM, which NumPy does not call natively.

**AMX does not apply to standard NumPy operations, even when using oneMKL.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for int8/bfloat16 GEMM, which NumPy does not call natively.

Is that correction accurate? My understanding is that AMX accelerates int8/bf16 GEMMs and even with oneMKL, NumPy operations will not use AMX.

vchamarthi and others added 4 commits May 15, 2026 13:01

Add intel-numpy mkl extension optimizations readme

e3c5a1b

Merge remote-tracking branch 'upstream/main' into intel-numpy

b6fa2f4

update the readme with latest release notes.

0f77a3b

Merge pull request #1 from vchamarthi/intel-numpy

3a9bf76

Add intel-numpy mkl extension optimizations readme

jharlow-intel reviewed Jun 23, 2026

View reviewed changes

Comment thread software/numpy/README.md Outdated

jharlow-intel reviewed Jun 23, 2026

View reviewed changes

Comment thread software/numpy/README.md Outdated

david-cortes-intel reviewed Jun 24, 2026

View reviewed changes

update guide with pr comments and recommendations

0bb6756

david-cortes-intel reviewed Jun 25, 2026

View reviewed changes

adgubrud requested changes Jul 1, 2026

View reviewed changes

	`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas==_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas==mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.
	`--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas==_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas==mkl*`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds.


		Use `--index-url`, not `--extra-index-url`: Intel's index is a partial mirror, and with `--extra-index-url` pip would see PyPI's higher-numbered OpenBLAS wheel and install that instead. Packages Intel does not mirror (for example `threadpoolctl`, used for [verification](#verifying-onemkl-is-active)) install normally from PyPI in a separate step. The Intel wheels target Linux and Windows; if `pip` reports no matching distribution, check that your platform and Python version are covered on the index.

		Whichever path you take, choose the OpenMP threading layer and set it before anything imports NumPy or MKL. The variable is read once at MKL load time, so exporting it after the import has no effect. Which value to pick is explained under [Threads and NUMA](#threads-and-numa); the safe default for a typical pip or mixed environment is:


		The `threading_layer` value matches `MKL_THREADING_LAYER` (`gnu`, `intel`, or `sequential`); the field that confirms the backend is `internal_api: mkl`.

		`np.show_config()` will show `name: blas, version: 3.9.0` even with oneMKL active. That is expected: it reflects the generic interface NumPy compiled against, not the runtime library. `threadpoolctl` is the reliable check.


		The extension packages do not activate themselves. `mkl_fft`, `mkl_random`, and `mkl_umath` do not replace NumPy functions on import. Use the patch function or context manager. Since the 2026.0 release installs the standard conda-forge NumPy rather than a bundled Intel build, there is no longer anything that activates them at build time, so explicit activation is required even in the full Intel® Distribution for Python.

		The activation model is release-specific; this guide targets 2026.0 and later. The explicit `patch_*` workflow described here matches the package generation in [Benchmark results](#benchmark-results) (NumPy 2.4.3, mkl_fft 2.2.0, mkl_random 1.4.0, mkl_umath 0.4.0). Earlier releases behave differently, verified on `intelpython3_full=2025.3.0`:

		@@ -0,0 +1,443 @@
		# Intel® Optimized NumPy with oneMKL

		This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.


		This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results.

		## Table of contents

	## Where NumPy performance comes from
	## NumPy Performance Contributors


		NumPy runs much of its work in its own compiled code, but its heaviest numerical kernels are handed off to external libraries: linear algebra to a BLAS/LAPACK library, FFTs to an FFT library, and large element-wise transcendental math (`sin`, `exp`, `log`) to vectorized loops. For those kernels, performance is largely decided by which native library the call lands in.

		That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates FFT, random number generation, and vectorized math, which a BLAS library does not cover. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs.


		## Accelerating NumPy with oneMKL

		Intel® oneAPI Math Kernel Library (oneMKL) supplies AVX-512 implementations for every one of those backends on 3rd Gen Intel® Xeon® (Ice Lake) and newer: BLAS, LAPACK, FFT, random number generation, and vectorized math. Pointing NumPy at oneMKL is how you turn that hardware capability into wall-clock speedup, with no change to your NumPy code. Across a representative set of NumPy-heavy workloads this is a 3.95x geomean speedup at one socket; the full breakdown is in [Benchmark results](#benchmark-results).


		The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide:

		- Linear algebra (BLAS and LAPACK) turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change.


		### Installation

		There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)).


		There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)).

		conda. A single command installs NumPy, SciPy, the three extension packages (mkl_fft, mkl_random, mkl_umath), and the runtime libraries. The BLAS/LAPACK backend routes to oneMKL automatically; the extensions are installed but still need explicit activation.


		The three explicit extensions share an activation model, and the key point is how little code it takes. Activation is a single one-time call: a context manager around a block, best when you want oneMKL for one section and stock NumPy elsewhere, or a patch/restore pair, best when oneMKL should stay active for the life of the process. That one call is the only addition. It redirects NumPy's internals so your existing `np.fft.`, `np.random.`, and `np.sin`/`np.exp`/`np.log` call sites dispatch to oneMKL with their source unchanged.

		Concretely, given an existing function, the only edit is the import-and-activate block at the top. The function body is untouched:


		### Linear algebra: BLAS and LAPACK

		This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean in [Benchmark results](#benchmark-results).


		`mkl_random` is a Python interface to oneMKL's Vector Statistics Library (VSL). It samples from the same distributions as `numpy.random` but is not a fixed-seed drop-in: the same seed produces a different sequence. Use it when generating large volumes of random data is a bottleneck and you do not depend on reproducing specific values.

		It can be used two ways. The context manager is the zero-code-change path, like the other extensions: existing `np.random.` call sites keep working and route through VSL (shown [below](#random-number-generation-mkl_random)). The explicit `RandomState` API* is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler.


		`mkl_random` is not a drop-in for `numpy.random`. The same seed produces a different sequence. Do not swap it into code that depends on reproducible random values.

		AMX does not apply to standard NumPy operations. NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for bfloat16 GEMM, which NumPy does not call natively.

Uh oh!

Conversation

vchamarthi commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

david-cortes-intel commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel Jun 25, 2026 •

edited

Loading