Skip to content

Commit c3b7edc

Browse files
authored
Merge pull request #3162 from ArmDeveloperEcosystem/main
Prod update
2 parents 3602c51 + 58f839d commit c3b7edc

31 files changed

Lines changed: 882 additions & 98 deletions

content/learning-paths/cross-platform/multiplying-matrices-with-sme2/1-get-started.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,7 @@ These Android phones support SME2 natively.
319319
|-------------------------------------|--------------|---------------------------|
320320
| Vivo X300 | 2025 | MediaTek Dimensity 9500 featuring an 8-core Arm C1 CPU cluster and Arm G1-Ultra GPU |
321321
| OPPO Find X9 | 2025 | MediaTek Dimensity 9500 featuring an 8-core Arm C1 CPU cluster and Arm G1-Ultra GPU |
322+
| Samsung Galaxy S26 | 2026 | Exynos 2600 variant |
322323

323324
These Apple devices support SME2 natively.
324325

content/learning-paths/cross-platform/simd-loops/1-about.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ weight: 2
66
layout: learningpathall
77
---
88

9-
## Introduction to SIMD on Arm and why it matters for performance on Arm CPUs
9+
## Introduction to SIMD on Arm
1010

11-
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, so do the SIMD capabilities available to you.
11+
Writing high-performance software on Arm often means using single-instruction, multiple-data (SIMD) technologies. Many developers start with Neon, a familiar fixed-width vector extension. As Arm architectures evolve, the SIMD capabilities available to you also expand.
1212

13-
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. They are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions are not just wider; they are fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
13+
This Learning Path uses the Scalable Vector Extension (SVE) and the Scalable Matrix Extension (SME) to demonstrate modern SIMD patterns. These are two powerful, scalable vector extensions designed for modern workloads. Unlike Neon, these architecture extensions aren't just wider; they're fundamentally different. They introduce predication, vector-length-agnostic (VLA) programming, gather/scatter, streaming modes, and tile-based compute with ZA state. The result is more power and flexibility, but there can be a learning curve to match.
1414

1515
## What is the SIMD Loops project?
1616

@@ -30,6 +30,6 @@ The project includes:
3030
- A simple command-line runner to execute any loop interactively
3131
- Optional standalone binaries for bare-metal and simulator use
3232

33-
You do not need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effect - this is the core learning loop.
33+
You don't need to rely on auto-vectorization or guess at compiler flags. Each loop is handwritten and annotated to make the intended use of SIMD features clear. Study a kernel, modify it, rebuild, and observe the effectthis is the core learning loop.
3434

3535

content/learning-paths/cross-platform/simd-loops/2-using.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Expected output on Linux:
2727
aarch64
2828
```
2929

30-
Expected output on macOS:
30+
On macOS, the expected output is:
3131

3232
```output
3333
arm64
@@ -86,45 +86,45 @@ Each loop is implemented in several SIMD extension variants. Conditional compila
8686

8787
The native C implementation is written first, and it can be generated either when building natively with `-DHAVE_NATIVE` or through compiler auto-vectorization with `-DHAVE_AUTOVEC`.
8888

89-
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support is not available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
89+
When SIMD ACLE is supported (SME, SVE, or Neon), the code is compiled using high-level intrinsics. If ACLE support isn't available, the build process falls back to handwritten inline assembly targeting one of the available SIMD extensions, such as SME2.1, SME2, SVE2.1, SVE2, and others.
9090

9191
The overall code structure also includes setup and cleanup code in the main function, where memory buffers are allocated, the selected loop kernel is executed, and results are verified for correctness.
9292

93-
At compile time, you can select which loop optimization to compile, whether it is based on SME or SVE intrinsics, or one of the available inline assembly variants.
93+
At compile time, you can select which loop optimization to compile, whether it's based on SME or SVE intrinsics, or one of the available inline assembly variants.
94+
95+
To compile the project, run make in the project directory:
9496

9597
```console
9698
make
9799
```
98100

99-
With no target specified, the list of targets is printed:
101+
With no target specified, the output shows the list of available targets:
100102

101103
```output
102104
all fmt clean c-scalar scalar autovec-sve autovec-sve2 neon sve sve2 sme2 sme-ssve sve2p1 sme2p1 sve-intrinsics sme-intrinsics
103105
```
104106

105-
Build all loops for all targets:
107+
To build all loops for all targets, run:
106108

107109
```console
108110
make all
109111
```
110112

111-
Build all loops for a single target, such as Neon:
113+
To build all loops for a single target, such as Neon, run:
112114

113115
```console
114116
make neon
115117
```
116118

117119
As a result of the build, two types of binaries are generated.
118120

119-
The first is a single executable named `simd_loops`, which includes all loop implementations.
120-
121-
Select a specific loop by passing parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
121+
To select a specific loop, pass parameters to the program. For example, to run loop 1 for 5 iterations using the Neon target:
122122

123123
```console
124124
build/neon/bin/simd_loops -k 1 -n 5
125125
```
126126

127-
Example output:
127+
The expected output is:
128128

129129
```output
130130
Loop 001 - FP32 inner product
@@ -140,6 +140,8 @@ To run loop 1 as a standalone binary:
140140
build/neon/standalone/bin/loop_001.elf
141141
```
142142

143+
The expected output is
144+
143145
Example output:
144146

145147
```output

content/learning-paths/cross-platform/simd-loops/3-example.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,6 @@ For instruction semantics and SME/SME2 optimization guidance, see the [SME Progr
229229
230230
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
231231
232-
- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation.
233-
234-
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
235-
236-
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers.
232+
- **Neon**: the Neon version (lines 612–710) uses structure load/store combined with indexed `fmla` to vectorize the computation
233+
- **SVE2.1**: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores
234+
- **SME2.1**: the SME2.1 version uses `movaz`/`svreadz_hor_za8_u8_vg4` to reinitialize `ZA` tile accumulators while moving data out to registers

content/learning-paths/cross-platform/simd-loops/4-conclusion.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: How to learn with SIMD Loops
2+
title: Learning with SIMD Loops
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Bridging the gap between specs and real code
9+
## Bridging the gap between specifications and real code
1010

1111
SIMD Loops is a practical way to learn the intricacies of SVE and SME across modern Arm architectures. By providing small, runnable loop kernels with reference code and optimized variants, it closes the gap between architectural specifications and real applications.
1212

13-
Whether you are moving from Neon or starting directly with SVE and SME, the project offers:
13+
Whether you're moving from Neon or starting directly with SVE and SME, the project offers:
1414
- A broad catalog of kernels that highlight specific features (predication, VLA programming, gather/scatter, streaming mode, ZA tiles)
1515
- Clear, readable implementations in C, ACLE intrinsics, and selected inline assembly
1616
- Flexible build targets and a simple runner to execute and validate loops

content/learning-paths/cross-platform/simd-loops/_index.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,21 @@
11
---
2-
title: "Code kata: perfect your SVE and SME skills with SIMD Loops"
2+
title: Learn SVE and SME programming with SIMD Loops
3+
4+
description: Learn how to write high-performance SIMD code using the SIMD Loops project, with hands-on examples demonstrating SVE, SVE2, and SME2 features on Arm processors.
35

46
minutes_to_complete: 30
57

68
who_is_this_for: This is an advanced topic for software developers who want to learn how to use the full range of features available in SVE, SVE2, and SME2 to improve software performance on Arm processors.
79

810
learning_objectives:
911
- Improve SIMD code performance using Scalable Vector Extension (SVE) and Scalable Matrix Extension (SME)
10-
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE,SVE2, and SME2 variants
12+
- Describe what SIMD Loops contains and how kernels are organized across scalar, Neon, SVE, SVE2, and SME2 variants
1113
- Build and run a selected kernel with the provided runner and validate correctness against the C reference
1214
- Choose the appropriate build target to compare Neon, SVE/SVE2, and SME2 implementations
1315

14-
1516
prerequisites:
16-
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers.
17-
- Some familiarity with SIMD programming and Neon intrinsics.
17+
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers
18+
- Some familiarity with SIMD programming and Neon intrinsics
1819
- Recent toolchains that support SVE/SME (GCC 13+ or Clang 16+ recommended)
1920

2021
author:
@@ -48,6 +49,14 @@ further_reading:
4849
title: SVE Programming Examples
4950
link: https://developer.arm.com/documentation/dai0548/latest
5051
type: documentation
52+
- resource:
53+
title: SIMD Loops Repository
54+
link: https://gitlab.arm.com/architecture/simd-loops
55+
type: documentation
56+
- resource:
57+
title: Scalable Vector Extensions Resources
58+
link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
59+
type: documentation
5160
- resource:
5261
title: Port Code to Arm Scalable Vector Extension (SVE)
5362
link: /learning-paths/servers-and-cloud-computing/sve

content/learning-paths/cross-platform/simd-on-rust/_index.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
11
---
2-
title: Learn how to write SIMD code on Arm using Rust
2+
title: Write SIMD code on Arm using Rust
33

44
minutes_to_complete: 30
55

66
description: Learn how to write SIMD code in Rust on Arm platforms using Neon intrinsics, portable SIMD abstractions, and optimize performance with architecture-specific instructions.
77

8-
who_is_this_for: This is an advanced topic for software developers who want take advantage of SIMD code on Arm systems using Rust.
8+
who_is_this_for: This is an advanced topic for software developers who want to take advantage of SIMD code on Arm systems using Rust.
99

1010
learning_objectives:
11-
- Learn how to write SIMD code with Rust on Arm.
11+
- Write SIMD code with Rust using std::arch and Neon intrinsics on Arm
12+
- Use portable SIMD abstractions with std::simd for cross-platform code
13+
- Apply feature detection and target attributes for architecture-specific optimizations
14+
- Compare C and Rust SIMD implementations and disassembly output
1215

1316
prerequisites:
14-
- An Arm-based computer with recent versions of a C compiler (Clang or GCC) and a Rust compiler installed.
17+
- An Arm-based computer with recent versions of a C compiler (Clang or GCC) and a Rust compiler installed
1518

1619
author: Konstantinos Margaritis
1720

content/learning-paths/cross-platform/simd-on-rust/conclusion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,5 @@ layout: learningpathall
88

99
You have now seen a few examples of writing SIMD code on Arm with Rust.
1010

11-
Performance-wise, there is little difference between C and Rust as Rust is perfectly capable of generating the same assembly code as C in most cases. That said, if you want to program optimal SIMD code using the Arm ASIMD/Neon intrinsics, `std::arch` is the most obvious choice. If, however, your approach needs to be as portable as possible and you don't want to spend time providing multiple implementations for each architecture then `std::simd` is a very viable alternative (even though it's not part of the stable compiler yet).
11+
Performance-wise, there's little difference between C and Rust as Rust is perfectly capable of generating the same assembly code as C in most cases. That said, if you want to program optimal SIMD code using the Arm ASIMD/Neon intrinsics, `std::arch` is the most obvious choice. If, however, your approach needs to be as portable as possible and you don't want to spend time providing multiple implementations for each architecture then `std::simd` is a very viable alternative (even though it's not part of the stable compiler yet).
1212

content/learning-paths/cross-platform/simd-on-rust/intro-to-rust.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ In this Learning Path, you will learn the basics of how to program SIMD code on
1212

1313
Rust is a safe programming language with some key advantages:
1414

15-
* It is a modern, strong-typed language.
16-
* Rust is memory safe by design: it is very difficult to introduce a bug like buffer overflow with Rust.
17-
* Strict language: the Rust compiler is very strict and does not let you make easy mistakes as you might with C.
18-
* The usage and support for Rust is expanding to many architectures and operating systems.
15+
* It's a modern, strong-typed language
16+
* Rust is memory safe by design: it's very difficult to introduce a bug like buffer overflow with Rust
17+
* Strict language: the Rust compiler is very strict and doesn't let you make easy mistakes as you might with C
18+
* The usage and support for Rust is expanding to many architectures and operating systems
1919

2020
## SIMD with Rust
2121

@@ -24,19 +24,19 @@ Support for intrinsics in languages such as C and C++ is generally added by the
2424
Rust is a little different in that regard. While vendors are still very involved in providing the support for SIMD intrinsics in the compiler, there are other alternatives and approaches used to provide SIMD abstraction.
2525

2626
Currently there are 2 SIMD programming interfaces in Rust:
27-
* One under `std::arch` which follows the C intrinsics as much as possible.
28-
* Another, `std::simd`, which provides a portable abstraction to SIMD programming so that code can just be recompiled across different architectures with more or less the same results. While there are similar libraries for C and C++, this is different in that the intent is for it to be merged as an official extension to the Rust standard library under `std::simd`.
27+
* One under `std::arch` which follows the C intrinsics as much as possible
28+
* Another, `std::simd`, which provides a portable abstraction to SIMD programming so that code can be recompiled across different architectures with more or less the same results. While there are similar libraries for C and C++, this is different in that the intent is for it to be merged as an official extension to the Rust standard library under `std::simd`
2929

3030
You will learn how to use both of these interfaces to write code that uses Advanced SIMD/Neon instructions on an Arm CPU.
3131

3232
Before you start, make sure you have the [Rust compiler installed](/install-guides/rust).
3333

34-
You can check if you have a working `rustc` compiler installed by running the following command:
34+
To check if you have a working `rustc` compiler installed, run the following command:
3535

3636
```bash
3737
rustc --version
3838
```
39-
Your output should look similar to the following:
39+
The output should look similar to:
4040

4141
```bash
4242
rustc 1.79.0 (129f3b996 2024-06-10)
@@ -50,15 +50,15 @@ Switch to the `nightly` version to `rustc` by running the following:
5050
rustup default nightly
5151
```
5252

53-
Now run the version command again to check if you have the right version:
53+
To check the version again, run:
5454

5555
```bash
5656
rustc --version
5757
```
58-
Your output should now look similar to the following:
58+
The output should now look similar to:
5959

6060
```bash
6161
rustc 1.82.0-nightly (92c6c0380 2024-07-21)
6262
```
6363

64-
Now that you have a working Rust compiler with the features supported in the nightly version, you can continue with building and running the examples included in this learning path. Please note that the code examples in this learning path are not optimally written for Rust (to do that you would have to use `cargo`, find the proper `crates` to do specific tasks, for example for 2D arrays, which would increase the complexity of this learning path significantly).
64+
Now that you have a working Rust compiler with the features supported in the nightly version, you can continue with building and running the examples included in this Learning Path. The code examples in this Learning Path aren't optimally written for Rust (to do that you would have to use `cargo`, find the proper `crates` to do specific tasks, for example for 2D arrays, which would increase the complexity of this Learning Path significantly).

0 commit comments

Comments
 (0)