The programming examples are a number of sample designs that further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI.
The passthrough example is the simplest "getting started" example. It copies 4096 bytes from the input to output using vectorized loads and stores. The design example shows a typical project organization which is easy to reproduce with other examples. There are only really 4 important files here.
passthrough_kernel.pyThe AIE structural design which includes the shim tile connected to the external memory, and a single AIE core for performing the copy. It also shows a simple use of the Object FIFOs described in section 2.passThrough.ccThis is a C++ file which performs the vectorized copy operation.test.cpportest.pyA C++ or Python main application for exercising the design, and comparing against a CPU referenceMakefileA Makefile documenting (and implementing) the build process for the various artifacts.
The passthrough DMAs example shows an alternate method of performing a copy without involving the cores, and instead performing a loopback.
| Design name | Data type | Description |
|---|---|---|
| Vector Scalar Add | i32 | Adds 1 to every element in vector |
| Vector Scalar Mul | i32 | Returns a vector multiplied by a scale factor |
| Vector Vector Add | i32 | Returns a vector summed with another vector |
| Vector Vector Modulo | i32 | Returns vector % vector |
| Vector Vector Multiply | i32 | Returns a vector multiplied by a vector |
| Vector Reduce Add | bfloat16 | Returns the sum of all elements in a vector |
| Vector Reduce Max | bfloat16 | Returns the maximum of all elements in a vector |
| Vector Reduce Min | bfloat16 | Returns the minimum of all elements in a vector |
| Vector Exp | bfloat16 | Returns a vector representing ex of the inputs |
| DMA Transpose | i32 | Transposes a matrix with the Shim DMA using npu_dma_memcpy_nd |
| Matrix Scalar Add | i32 | Returns a matrix multiplied by a scalar |
| Single core GEMM | bfloat16 | A single core matrix-matrix multiply |
| Multi core GEMM | bfloat16 | A matrix-matrix multiply using 16 AIEs with operand broadcast. Uses a simple "accumulate in place" strategy |
| GEMV | bfloat16 | A vector-matrix multiply returning a vector |
| Design name | Data type | Description |
|---|---|---|
| Eltwise Add | bfloat16 | An element by element addition of two vectors |
| Eltwise Mul | i32 | An element by element multiplication of two vectors |
| ReLU | bfloat16 | Rectified linear unit (ReLU) activation function on a vector |
| Softmax | bfloat16 | Softmax operation on a matrix |
| Conv2D | i8 | A single core 2D convolution for CNNs |
| Conv2D+ReLU | i8 | A Conv2D with a ReLU fused at the vector register level |
-
Can you modify the passthrough design to copy more (or less) data?

-
Take a look at the testbench in our Vector Exp example test.cpp. Take note of the data type and the size of the test vector. What do you notice?

-
What is the communication-to-computation ratio in ReLU?

-
HARD Which basic example is a component in Softmax?
](/fifield/mlir-aie/raw/main/mlir_exercises/images/answer1.jpg)