When we program the AIE-array, we need to declare and configure its structural building blocks: compute tiles for vector processing, memory tiles as larger level-2 shared scratchpads, and shim tiles supporting data movement to NPU-external memory (i.e., main memory). In this programming guide, we will utilize the IRON Python library, which allows us to describe our overall NPU design, including selecting which AI Engine tiles we wish to use, what code each tile should run, how to move data between tiles, and how our design can be invoked from the CPU-side. Later on, we will explore vector programming in C/C++, which will be useful for optimizing computation kernels for individual compute tiles.
Let's first look at a basic Python source file (named aie2.py) for an IRON design at the highest level of abstraction:
At the top of this Python source, we include modules that define the IRON libraries aie.iron for high-level abstraction constructs and target architecture aie.iron.device.
from aie.iron import Program, Runtime, Worker, Buffer
from aie.iron.device import NPU1Col1, TileData movement inside the AIE-array is usually also declared at this step, however that part of design configuration has its own dedicated section and is not covered in detail here.
# Dataflow configuration
# described in a future section of the guide...In the AIE array, computational kernels are run on compute tiles, which are represented by Workers.
A Worker takes as input a routine to run, and the list of arguments needed to run it. The Worker class is defined below and can be found in worker.py. The Worker can be explicitly placed on a tile in the AIE array or its placement can be left to the compiler, as is explained further in this section. Finally, the while_true input is set to True by default as Workers typically run continuously once the design is started.
class Worker(ObjectFifoEndpoint):
def __init__(
self,
core_fn: Callable | None,
fn_args: list = [],
tile: Tile = AnyComputeTile,
while_true: bool = True,
)In our simple design there is only one Worker which will perform the core_fn routine. The compute routine iterates over a data buffer and initializes each entry to zero. The compute routine in this case has no inputs other than a handle to the buffer. As we will see in the next section of the guide, computational tasks usually run on data that is brought into the AIE array from external memory and the output produced is sent back out. Note that in this example design the Worker is explicitly placed on a Compute tile with coordinates (0,2) in the AIE array.
buffer = Buffer(data_ty, name="buff")
# Task for the worker to perform
def core_fn(buff):
for i in range_(data_size):
buff[i] = 0
# Create a worker to perform the task
my_worker = Worker(core_fn, [buffer], tile=Tile(0, 2), while_true=False)NOTE 1: Did you notice the underscore in
range_? Although IRON makes NPU designs look mostly like normal Python programs, it is important to understand that the code you write here is not directly executed on the NPU; instead, the code you write in an IRON design generates other code (metaprogramming), kind of like if you wrote a print-statement with a string of code inside. Our toolchain then compiles this generated other code, and it can then run directly on the NPU.All of this means that if you wrote
rangeinstead ofrange_in the example above, the resulting generated NPU code would contain a manylocal[i] = 0instructions, but no loop at all (the loop is "unrolled", which can lead to a large binary and means the number of loop iterations must be fixed at code-generation-time). On the other hand, when you userange_, Python only executes the loop body once (to collect the instructions contained therein), then emits a loop into the NPU code. The NPU then executes the loop. The same applies to other branching constructs likeif; using Python's native construct will mean no actual branches are emitted for the NPU code!
NOTE 2: The Worker in the code above is instantiated with
while_true=False. By default, this attribute is set toTrue, in which case the kernel code expressed by the task will be wrapped in a for loop that iterates untilsys.maxsizewith a step of one. This simulates awhile(True)with the intention to loop over the code in the Worker infinitely. Depending on the task code, such as when creating a local buffer with a unique name, this can cause compiler issues.
In the previous code snippet it was mentioned that the data movement between Workers needs to be configured. This does not include data movement to/from the AIE array which is handled inside the Runtime sequence. The programming guide has a dedicated section for runtime data movement. In this example, as we do not look in-depth at data movement configuration, the runtime sequence will only start the Worker.
# Runtime operations to move data to/from the AIE-array
rt = Runtime()
with rt.sequence(data_ty, data_ty, data_ty) as (_, _, _):
rt.start(my_worker)All the components are tied together into a Program which represents all design information needed to run the design on a device. The program emits aie.logical_tile ops for unplaced tiles, and the --aie-place-tiles compiler pass assigns physical tile coordinates during compilation. Finally, the program is printed to produce the corresponding MLIR definitions from the IRON library and python language bindings.
# Create the program from the device type and runtime
my_program = Program(NPU1Col1(), rt)
# Generate an MLIR module
module = my_program.resolve_program()
# Print the generated MLIR
print(module)NOTE: All components described or mentioned above inherit from the
resolvableinterface which defers the creation of MLIR operations until theirresolve()function is called. That is the task of theresolve_program()function of theProgramwhich will raise an error if one of the IRON classes does not have enough information to generate its MLIR equivalent.
IRON also enables users to describe their design at the tile level of granularity where components are explicitly placed on AIE tiles using coordinates. Let's again look through a basic Python source file (named aie2_placed.py) for an IRON design at this level.
At the top of this Python source, we include modules that define the IRON AIE libraries aie.dialects.aie and the mlir-aie context aie.extras.context, which binds to MLIR definitions for AI Engines.
from aie.dialects.aie import * # primary mlir-aie dialect definitions
from aie.extras.context import mlir_mod_ctx # mlir-aie contextThen we declare a structural design function that will expand into MLIR code when it will get called from within an mlir-aie context (see last part of this subsection).
# AI Engine structural design function
def mlir_aie_design():
<... AI Engine device, blocks, and connections ...>Let's look at how we declare the AI Engine device, blocks, and connections. We start off by declaring our AIE device via @device(AIEDevice.npu1) or @device(AIEDevice.npu2). The blocks and connections themselves will then be declared inside the def device_body():. Here, we instantiate our AI Engine blocks, which are AIE compute tiles in this first example.
The arguments for the tile declaration are the tile coordinates (column, row). We assign each declared tile to a variable in our Python program.
NOTE: The actual tile coordinates used on the device when the program is run may deviate from the ones declared here. For example, on the NPU on Ryzen™ AI (
@device(AIEDevice.npu1)), these coordinates tend to be relative coordinates as the runtime scheduler may assign it to a different available column during runtime.
# Device declaration - here using aie2 device NPU
@device(AIEDevice.npu1)
def device_body():
# Tile declarations
ComputeTile1 = tile(1, 3)
ComputeTile2 = tile(2, 3)
ComputeTile3 = tile(2, 4)Compute cores can be mapped to compute tiles. They can also be linked to external kernel functions that can then be called from within the body of the core, however that is beyond the scope of this section and is explained further in the guide. In this example design the compute core declares a local data tensor, iterates over it and initializes each entry to zero.
data_size = 48
data_ty = np.ndarray[(data_size,), np.dtype[np.int32]]
# Compute core declarations
@core(ComputeTile1)
def core_body():
local = buffer(ComputeTile1, data_ty, name="local")
for i in range_(data_size):
local[i] = 0Once we are done declaring our blocks (and connections) within our design function, we move onto the main body of our program where we call the function and output our design in MLIR. This is done by first declaring the MLIR context via the with mlir_mod_ctx() as ctx: line. This indicates that subsequent indented Python code is in the MLIR context, and we follow this by calling our previously defined design function mlir_aie_design(). This means all the code within the design function is understood to be in the MLIR context and contains the IRON custom Python binding definitions of the more detailed MLIR block definitions. The final line is print(ctx.module), which takes the code defined in our MLIR context and prints it to stdout. This will then convert our Python-bound code to its MLIR equivalent and print it to stdout.
# Declares that subsequent code is in mlir-aie context
with mlir_mod_ctx() as ctx:
mlir_aie_design() # Call design function within the mlir-aie context
print(ctx.module) # Print the Python-to-MLIR conversion to stdoutNext to the compute tiles, an AIE-array also contains data movers for accessing L3 memory (also called shim DMAs) and larger L2 scratchpads (called mem tiles), which have been available since the AIE-ML generation - see the introduction of this programming guide. Declaring these other types of structural blocks follows the same syntax but requires physical layout details for the specific target device. Shim DMAs typically occupy row 0, while mem tiles (when available) often reside on row 1. The following code segment declares all the different tile types found in a single NPU column.
# Device declaration - here using aie2 device NPU
@device(AIEDevice.npu1)
def device_body():
# Tile declarations
ShimTile = tile(0, 0)
MemTile = tile(0, 1)
ComputeTile1 = tile(0, 2)
ComputeTile2 = tile(0, 3)
ComputeTile3 = tile(0, 4)
ComputeTile4 = tile(0, 5)-
To run our Python program from the command line, we type
python3 aie2.py, which converts our Python structural design into MLIR source code. This works from the command line if our design environment already contains the mlir-aie Python-bound dialect module. We included this in the Makefile, so go ahead and runmakenow. Then take a look at the generated MLIR source underbuild/aie.mlir. -
Run
make cleanto remove the generated files. In the worker's code (thecore_fn) replacerange_withrange(no underscore). What do you expect to happen? Investigate the generated code inbuild/aie.mlirand observe how the generated code changed.
-
Run
make cleanagain. Then introduce an error to the Python source, such as misspellingsequencetosequenc, and then runmakeagain. What messages do you see?
-
Run
make cleanagain. Now change the error by renamingsequencback tosequence, but place the Worker on a tile with coordinates (-1, 3), which is an invalid location. Runmakeagain. What message do you see now?
-
Run
make cleanagain. Restore the Worker tile to its original coordinates. Remove thewhile_true=Falseattribute from the Worker and runmakeagain. What do you observe?
-
Now let's take a look at the placed version of the code. Run
make placedand look at the generated MLIR source underbuild/aie_placed.mlir. -
Run
make cleanto remove the generated files. Introduce the same error as above by changing the coordinates ofComputeTile1to (-1,3). Runmake placedagain. What message do you see now?
-
No error is generated but our code is invalid. Take a look at the generated MLIR code under
build/aie_placed.mlir. This generated output is invalid MLIR syntax and running our mlir-aie tools on this MLIR source will generate an error. We do, however, have some additional Python structural syntax checks that can be enabled if we use the functionctx.module.operation.verify(). This verifies that our Python-bound code has valid operation within the mlir-aie context.Qualify the
print(ctx.module)call with a check onctx.module.operation.verify()using a code block like the following:res = ctx.module.operation.verify() if res == True: print(ctx.module) else: print(res)
Make this change and run
make placedagain. What message do you see now?