Add NumPy optimization guide#36
Conversation
Add intel-numpy mkl extension optimizations readme
|
Overall comment: this guide recommends setting the IOMP threading layter for MKL, but pretty much every other PyPI package outside of Intel-distributed NumPy will bundle LibGOMP and could potentially cause incompatibilities. Perhaps it could recommend setting |
| conda install -y \ | ||
| -c https://software.repos.intel.com/python/conda \ | ||
| -c conda-forge --override-channels \ | ||
| "blas=*=*_intelmkl" \ |
There was a problem hiding this comment.
What about lapack? If this is done on an existing environment, there's no guarantee that the user won't have different backends for blas and lapack.
| mkl mkl_fft mkl_random mkl_umath mkl-service | ||
| ``` | ||
|
|
||
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
There was a problem hiding this comment.
I assume this advice might have been copied from other documentation pages.
The reason why it was there was to avoid pulling packages from the Anaconda channel which have higher priority. That's worth mentioning here.
| conda activate idp_env | ||
| ``` | ||
|
|
||
| Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy): |
There was a problem hiding this comment.
I think this part is redundant:
Pin
python=<version>to match your project if you need a specific interpreter
Since this is not a general conda guide.
| mkl mkl_fft mkl_random mkl_umath mkl-service | ||
| ``` | ||
|
|
||
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
There was a problem hiding this comment.
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. | |
| `--override-channels` resolves only from the two named channels, so conda does not mix in an OpenBLAS build from elsewhere. The `blas=*=*_intelmkl` selector requests the Intel channel's MKL-backed BLAS; conda-forge offers an equivalent under the build string `blas=*=*mkl*`. Either gives an MKL BLAS backend. The Intel channel is required for the three extensions and Intel's latest oneMKL builds. |
|
|
||
| Use `--index-url`, not `--extra-index-url`: Intel's index is a partial mirror, and with `--extra-index-url` pip would see PyPI's higher-numbered OpenBLAS wheel and install that instead. Packages Intel does not mirror (for example `threadpoolctl`, used for [verification](#verifying-onemkl-is-active)) install normally from PyPI in a separate step. The Intel wheels target Linux and Windows; if `pip` reports no matching distribution, check that your platform and Python version are covered on the index. | ||
|
|
||
| Whichever path you take, choose the OpenMP threading layer and set it **before anything imports NumPy or MKL**. The variable is read once at MKL load time, so exporting it after the import has no effect. Which value to pick is explained under [Threads and NUMA](#threads-and-numa); the safe default for a typical pip or mixed environment is: |
There was a problem hiding this comment.
Also applicable to SciPy.
|
|
||
| The `threading_layer` value matches `MKL_THREADING_LAYER` (`gnu`, `intel`, or `sequential`); the field that confirms the backend is `internal_api: mkl`. | ||
|
|
||
| `np.show_config()` will show `name: blas, version: 3.9.0` even with oneMKL active. That is expected: it reflects the generic interface NumPy compiled against, not the runtime library. `threadpoolctl` is the reliable check. |
There was a problem hiding this comment.
These hard-coded version numbers are prone to get outdated over time.
| MKL_VERBOSE DGEMM(N,N,4096,4096,4096,...) 2.1s CNT=1 | ||
| ``` | ||
|
|
||
| If only the banner appears and no `DGEMM`/`DFFT`/`VML` lines follow, oneMKL loaded but is not being called. |
There was a problem hiding this comment.
It should mention here that which of those show depends on what the code is doing. Maybe could provide a sample script with a matrix multiplication that would trigger dgemm.
|
|
||
| **The extension packages do not activate themselves.** `mkl_fft`, `mkl_random`, and `mkl_umath` do not replace NumPy functions on import. Use the patch function or context manager. Since the 2026.0 release installs the standard conda-forge NumPy rather than a bundled Intel build, there is no longer anything that activates them at build time, so explicit activation is required even in the full Intel® Distribution for Python. | ||
|
|
||
| **The activation model is release-specific; this guide targets 2026.0 and later.** The explicit `patch_*` workflow described here matches the package generation in [Benchmark results](#benchmark-results) (NumPy 2.4.3, mkl_fft 2.2.0, mkl_random 1.4.0, mkl_umath 0.4.0). Earlier releases behave differently, verified on `intelpython3_full=2025.3.0`: |
There was a problem hiding this comment.
This makes it sounds as if this were expected to change in the future. Maybe it could mention that it applies to versions starting with 2026.0.
| conda install -y \ | ||
| -c https://software.repos.intel.com/python/conda \ | ||
| -c conda-forge --override-channels \ | ||
| "blas=*=*_intelmkl" \ |
There was a problem hiding this comment.
'blas' is a development package providing headers, .pc files, and similar, depending in turn on 'libblas'. 'libblas' is the runtime that sets the backend.
| conda install -c conda-forge _openmp_mutex=*=*_llvm | ||
| ``` | ||
|
|
||
| On Windows, `_openmp_mutex` offers Intel and LLVM variants but no GNU one, consistent with there being no GNU threading on the platform. |
| Pin `python=<version>` to match your project if you need a specific interpreter. NumPy comes from conda-forge; the Intel channel supplies the `mkl_fft`/`mkl_random`/`mkl_umath` extensions and Intel's latest oneMKL builds. To add oneMKL to an *existing* environment that already has conda-forge NumPy installed, swap its BLAS to the MKL variant and add the extensions in place (this re-links the NumPy you already have, it does not reinstall NumPy): | ||
|
|
||
| ```bash | ||
| conda install -y \ |
There was a problem hiding this comment.
Very important to mention here that packages from the Intel channel are meant to be compatible with packages from conda-forge but not with packages from Anaconda, which is the default channel.
|
Comment again that the guide specifically mentions AVX-512 as the highest level of SIMD instructions, but that will become outdated soon as hardware with avx10.2 gets released. |
| ```python | ||
| from threadpoolctl import threadpool_info | ||
| import pprint | ||
| pprint.pprint(threadpool_info()) |
There was a problem hiding this comment.
This should be executed after importing numpy.
| | `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling | | ||
| | `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) | | ||
|
|
||
| `KMP_AFFINITY` is an Intel OpenMP setting, so it applies only when oneMKL is on the Intel runtime (`MKL_THREADING_LAYER=INTEL`); under the GNU layer use `GOMP_CPU_AFFINITY` or `numactl` instead. `KMP_AFFINITY=granularity=fine,compact,1,0` is appropriate for single-socket systems or when running one process per socket. On multi-socket systems without `numactl` it may bind threads across sockets; verify the actual binding with `KMP_AFFINITY=verbose`. |
There was a problem hiding this comment.
What about OMP_PROC_BIND?
| | Variable | Recommended value | Effect | | ||
| |---|---|---| | ||
| | `MKL_THREADING_LAYER` | `GNU` (mixed env) or `INTEL` (all-Intel) | Select MKL's OpenMP runtime; see note below | | ||
| | `MKL_NUM_THREADS` | physical core count | Cap MKL thread count | |
There was a problem hiding this comment.
Is this guaranteed to work as intended if you set MKL_NUM_THREADS to number of physical cores, then bind the threads to numbers from the system, but don't specify something like OMP_PLACES=threads? Wouldn't it potentially end up using hyperthreads if the system enumerates them in an interleaved order?
| The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide: | ||
|
|
||
| - **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change. | ||
| - **FFT, random, and vectorized math** come from three separate packages (`mkl_fft`, `mkl_random`, `mkl_umath`). These do not activate on import; you switch them on explicitly in code. |
There was a problem hiding this comment.
It could link to the github repositories of those packages.
| @@ -0,0 +1,443 @@ | |||
| # Intel® Optimized NumPy with oneMKL | |||
|
|
|||
| This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results. | |||
There was a problem hiding this comment.
| This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results. | |
| This guide describes how to get optimal NumPy performance on Intel® processors, from servers to laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, Fast Fourier Transform (FFT), random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results. |
| @@ -0,0 +1,443 @@ | |||
| # Intel® Optimized NumPy with oneMKL | |||
There was a problem hiding this comment.
| # Intel® Optimized NumPy with oneMKL | |
| # Intel® Optimized NumPy With oneMKL |
|
|
||
| This guide describes how to get optimal NumPy performance on Intel® processors, from Xeon® servers to AVX-capable laptops, by using Intel® oneAPI Math Kernel Library (oneMKL) as the backend for linear algebra, FFT, random number generation, and vectorized math. It covers installation, how to activate each optimization with minimal code changes, thread and NUMA tuning, and how to verify that oneMKL is active, along with measured benchmark results. | ||
|
|
||
| ## Table of contents |
There was a problem hiding this comment.
| ## Table of contents | |
| ## Table Of Contents |
|
|
||
| --- | ||
|
|
||
| ## Where NumPy performance comes from |
There was a problem hiding this comment.
| ## Where NumPy performance comes from | |
| ## NumPy Performance Contributors |
Please also update the table of contents
|
|
||
| --- | ||
|
|
||
| ## Optimization levers |
There was a problem hiding this comment.
| ## Optimization levers | |
| ## Optimization Levers |
|
|
||
| NumPy runs much of its work in its own compiled code, but its heaviest numerical kernels are handed off to external libraries: linear algebra to a BLAS/LAPACK library, FFTs to an FFT library, and large element-wise transcendental math (`sin`, `exp`, `log`) to vectorized loops. For those kernels, performance is largely decided by *which* native library the call lands in. | ||
|
|
||
| That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates FFT, random number generation, and vectorized math, which a BLAS library does not cover. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs. |
There was a problem hiding this comment.
| That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates FFT, random number generation, and vectorized math, which a BLAS library does not cover. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs. | |
| That backend is a choice made at install time. PyPI and conda-forge NumPy ship with OpenBLAS, a strong general-purpose implementation that uses AVX-512 on recent Intel CPUs. oneMKL goes further in two ways: its kernels are tuned for Intel hardware, and it accelerates a number of functionalities not covered by a BLAS library such as FFT, random number generation, and vectorized math. Both gains apply on Intel® Xeon® servers and on AVX-512-capable Intel client and laptop CPUs. |
|
|
||
| ## Accelerating NumPy with oneMKL | ||
|
|
||
| Intel® oneAPI Math Kernel Library (oneMKL) supplies AVX-512 implementations for every one of those backends on 3rd Gen Intel® Xeon® (Ice Lake) and newer: BLAS, LAPACK, FFT, random number generation, and vectorized math. Pointing NumPy at oneMKL is how you turn that hardware capability into wall-clock speedup, with no change to your NumPy code. Across a representative set of NumPy-heavy workloads this is a 3.95x geomean speedup at one socket; the full breakdown is in [Benchmark results](#benchmark-results). |
There was a problem hiding this comment.
This paragraph seems to hit most of the same points of the previous paragraph. Please consider rewording/trimming it down.
There was a problem hiding this comment.
In other parts of this article, you highlight laptop support in addition to servers. Is there a reason why you only specify Xeon support in this paragraph?
|
|
||
| The speedup arrives in two parts that activate differently, and the distinction matters for the rest of this guide: | ||
|
|
||
| - **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change. |
There was a problem hiding this comment.
| - **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend. `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change. | |
| - **Linear algebra (BLAS and LAPACK)** turns on automatically once oneMKL is the backend (details for this in the [installation](#installation) section). `np.dot`, `np.matmul`, and `np.linalg.*` route to it with no code change. |
|
|
||
| ### Installation | ||
|
|
||
| There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)). |
There was a problem hiding this comment.
| There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)). | |
| There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)). [Miniforge](https://github.com/conda-forge/miniforge) distribution is recommended. |
|
|
||
| There are two practical ways to get a oneMKL-backed NumPy. conda is recommended because it also lets you control the OpenMP runtime (see [Threads and NUMA](#threads-and-numa)). | ||
|
|
||
| **conda.** A single command installs NumPy, SciPy, the three extension packages (mkl_fft, mkl_random, mkl_umath), and the runtime libraries. The BLAS/LAPACK backend routes to oneMKL automatically; the extensions are installed but still need explicit activation. |
There was a problem hiding this comment.
Please provide a link to an example of this explicit activation. Is that in the "Optimization Levers" section?
|
|
||
| The three explicit extensions share an activation model, and the key point is how little code it takes. Activation is a single one-time call: a **context manager** around a block, best when you want oneMKL for one section and stock NumPy elsewhere, or a **patch/restore pair**, best when oneMKL should stay active for the life of the process. That one call is the only addition. It redirects NumPy's internals so your existing `np.fft.*`, `np.random.*`, and `np.sin`/`np.exp`/`np.log` call sites dispatch to oneMKL with their source unchanged. | ||
|
|
||
| Concretely, given an existing function, the only edit is the import-and-activate block at the top. The function body is untouched: |
There was a problem hiding this comment.
This makes it sound like the imports and activation are supposed to be at the top. Does import mkl_fft [...] have to come after the function definition or can it before? If it can be before, I think it will be less confusing to the reader to move it to the top of the example source code.
|
|
||
| ### Linear algebra: BLAS and LAPACK | ||
|
|
||
| This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean in [Benchmark results](#benchmark-results). |
There was a problem hiding this comment.
| This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean in [Benchmark results](#benchmark-results). | |
| This is the lever you get for free. Once oneMKL is the backend, `np.dot`, `np.matmul`, `np.linalg.*`, and everything built on them (covariances, distances, decompositions) run on oneMKL's BLAS and LAPACK with no code change and nothing to activate. These kernels dispatch at runtime to an optimized code path for the CPU's instruction set (e.g., AVX-512 on current Xeons). This is the largest single contributor to the geomean speedup in [Benchmark results](#benchmark-results). |
| print(f"speedup : {stock_ms / mkl_ms:.1f}x") | ||
| ``` | ||
|
|
||
| Measured on AWS, Intel® Xeon® 6975P-C, 16 cores / 32 threads (HT on), 1 socket, Ubuntu 26.04 LTS. Numbers vary by hardware. |
There was a problem hiding this comment.
Have these numbers been through PDT?
|
|
||
| def analyze(signal): | ||
| spectrum = np.fft.fft(signal) # -> numpy.fft, then mkl_fft after activation | ||
| power = np.abs(spectrum) ** 2 # -> VML after activation (large arrays) |
There was a problem hiding this comment.
Please provide the full form of the abbreviation VML
|
|
||
| `mkl_random` is a Python interface to oneMKL's Vector Statistics Library (VSL). It samples from the same distributions as `numpy.random` but is not a fixed-seed drop-in: the same seed produces a different sequence. Use it when generating large volumes of random data is a bottleneck and you do not depend on reproducing specific values. | ||
|
|
||
| It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown [below](#random-number-generation-mkl_random)). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler. |
There was a problem hiding this comment.
| It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown [below](#random-number-generation-mkl_random)). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler. | |
| It can be used two ways. The **context manager** is the zero-code-change path, like the other extensions: existing `np.random.*` call sites keep working and route through VSL (shown below). The **explicit `RandomState` API** is a small code change that lets you pick the generator and the sampling method for the fastest path. The benchmark below uses it with `method='BoxMuller'`, oneMKL's fast normal sampler. |
That link just takes the user to top of the "Random number generation" section.
| | `MKL_THREADING_LAYER` | `GNU` (mixed env) or `INTEL` (all-Intel) | Select MKL's OpenMP runtime; see note below | | ||
| | `MKL_NUM_THREADS` | physical core count | Cap MKL thread count | | ||
| | `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling | | ||
| | `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) | |
There was a problem hiding this comment.
what is the breakdown of these arguments "fine,compact,1,0"? Is there a reference we can point to?
| | `MKL_DYNAMIC` | `FALSE` | Disable automatic thread scaling | | ||
| | `KMP_AFFINITY` | `granularity=fine,compact,1,0` | Pin threads to physical cores (Intel OpenMP only) | | ||
|
|
||
| `KMP_AFFINITY` is an Intel OpenMP setting, so it applies only when oneMKL is on the Intel runtime (`MKL_THREADING_LAYER=INTEL`); under the GNU layer use `GOMP_CPU_AFFINITY` or `numactl` instead. `KMP_AFFINITY=granularity=fine,compact,1,0` is appropriate for single-socket systems or when running one process per socket. On multi-socket systems without `numactl` it may bind threads across sockets; verify the actual binding with `KMP_AFFINITY=verbose`. |
There was a problem hiding this comment.
Is KMP_AFFINITY=verbose supposed to be tacked on to KMP_AFFINITY=granularity=fine,compact,1,0? What does that look like?
|
|
||
| --- | ||
|
|
||
| ## Verifying oneMKL is active |
There was a problem hiding this comment.
| ## Verifying oneMKL is active | |
| ## Verifying oneMKL Is Active |
|
|
||
| --- | ||
|
|
||
| ## Benchmark results |
There was a problem hiding this comment.
Have these numbers all been through PDT?
|
|
||
| **`mkl_random` is not a drop-in for `numpy.random`.** The same seed produces a different sequence. Do not swap it into code that depends on reproducible random values. | ||
|
|
||
| **AMX does not apply to standard NumPy operations.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for bfloat16 GEMM, which NumPy does not call natively. |
There was a problem hiding this comment.
| **AMX does not apply to standard NumPy operations.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for bfloat16 GEMM, which NumPy does not call natively. | |
| **AMX does not apply to standard NumPy operations, even when using oneMKL.** NumPy's `float32` and `float64` operations use oneMKL's AVX-512 code paths. AMX tiles only activate for int8/bfloat16 GEMM, which NumPy does not call natively. |
Is that correction accurate? My understanding is that AMX accelerates int8/bf16 GEMMs and even with oneMKL, NumPy operations will not use AMX.

Adds a new tuning guide documenting how to run NumPy with Intel® oneMKL-backed performance (BLAS/LAPACK plus optional FFT/random/umath patching), and links it from the repository’s main README
Changes:
CC @xaleryb @jharlow-intel @napetrov for addition review