iree-org · tgymnich · Mar 10, 2026 · Mar 10, 2026 · Mar 12, 2026 · Mar 17, 2026
diff --git a/docs/index.rst b/docs/index.rst
@@ -27,6 +27,15 @@ API reference material
    wave/wave
 
 
+Design documentation
+====================
+
+.. toctree::
+   :maxdepth: 1
+   :caption: IR Design
+
+   ir_design
+
 Project documentation
 =====================
 

diff --git a/docs/ir_design.rst b/docs/ir_design.rst
@@ -0,0 +1,159 @@
+Vector Shapes and Hardware Constraints
+======================================
+
+This document describes the ``vector_shapes`` field on
+``#wave.hardware_constraint`` and how it relates to ``mma_type``,
+``elements_per_thread``, and the constraint system in the Water IR.
+
+
+Overview
+--------
+
+``vector_shapes`` is an optional ``DictionaryAttr`` on
+``#wave.hardware_constraint``.  Each entry maps a dimension name (a string
+matching a ``#wave.symbol``) to an integer specifying how many elements a
+single wave processes along that dimension in one instance of an operation
+before expansion has replicated it.
-before expansion has replicated it.
+after the expansion process has replicated it.
-before expansion has replicated it.
+after the expansion process has replicated it.
+
+.. code-block:: mlir
+
+   #wave.hardware_constraint<
+       threads_per_wave = 64,
+       waves_per_block = [2, 2, 1],
+       mma_type = #wave.mma_kind<f32_16x16x16_f16>,
+       vector_shapes = {M = 16, N = 16, K = 16},
+       max_bits_per_load = 128>
+
+``vector_shapes`` is the central piece of information the compiler uses to:
+
+* distribute work across threads within a wave,
+* determine how many elements each thread processes (``elements_per_thread``),
+* compute memory access strides, and
+* drive the expansion (unrolling) pass that replicates operations until the
+  workgroup tile is covered.
-  workgroup tile is covered.
+  wave tile is covered.
-  workgroup tile is covered.
+  wave tile is covered.
+
+
+Where vector_shapes comes from
+-------------------------------
+
+There are two cases, depending on whether ``mma_type`` is present.
+
+**When mma_type is set,** ``vector_shapes`` is derived from the MMA
+instruction geometry.  ``WaveMmaKindAttr::getShape`` returns the ``(M, N, K)``
+tile for the intrinsic and those sizes become the vector shape entries:
+
+.. code-block:: text
+
+   mma_type = f32_16x16x16_f16  →  getShape = (16, 16, 16)
+                                    vector_shapes = {M = 16, N = 16, K = 16}
+
+Additional entries may be provided for dimensions the MMA analysis does not
+cover (e.g. a batch dimension), and in that case both ``mma_type`` and explicit
+``vector_shapes`` coexist.
 if hardware_constraint.vector_shapes: 
     custom.vector_shapes.update(hardware_constraint.vector_shapes) 
 if hardware_constraint.vector_shapes: 
     custom.vector_shapes.update(hardware_constraint.vector_shapes) 
+
+**When mma_type is absent,** ``vector_shapes`` is specified directly or derived
+from workgroup / tiling constraint tile sizes.  In either case it must be
+present for the compiler to proceed.
+
+In MLIR, ``vector_shapes`` entries must all be ``IntegerAttr`` values.  The
+verifier in ``WaveDialect.cpp`` enforces this.
+
+
+The special value 0
+^^^^^^^^^^^^^^^^^^^
+
+A vector shape of ``0`` marks a dimension as *scalar* — the wave does not tile
+along it.  This is used for dimensions like batch (``B``) that should not
+contribute to the intra-wave data distribution:
+
+.. code-block:: mlir
+
+   vector_shapes = {B = 0, M = 16, N = 16}
+
+
+Relationship to workgroup and tiling constraints
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``vector_shapes`` and constraint tile sizes serve different purposes:
+
+* **Tile size** (from ``#wave.workgroup_constraint`` or
+  ``#wave.tiling_constraint``) is the total amount of work assigned to one
+  workgroup or one iteration of a reduction loop along a dimension.
+* **Vector shape** is the amount of work one wave handles in a single instance
+  of an operation along that dimension.
+
+**When mma_type is present,** the vector shapes derive from the MMA geometry
+and are typically smaller than the constraint tile sizes.  The expansion pass
+(which runs on the Python/FX side) replicates each
+operation to cover the tile.  For example, with ``BLOCK_M = 64``,
+``waves_per_block = [2, 2, 1]``, and ``mma_type = f32_16x16x16_f16``
+(vector shape 16 for M):
+
+.. code-block:: text
+
+   expansion_count = ceil(64 / (2 × 16)) = 2
+
+The MLIR IR only sees the already-expanded result: two ``wave.mma`` ops along M
+rather than one.  The ``vector_shapes`` remain on the
+``#wave.hardware_constraint`` for verification and for passes that need to
+reason about the per-wave tile.
+
+**When mma_type is absent,** the MLIR verifier enforces that each
+``vector_shapes`` entry **matches** the resolved tile size from the
+corresponding ``#wave.workgroup_constraint`` or ``#wave.tiling_constraint`` for
+that dimension. Unlike with mma_operations that have a fixed size, element wise operations
+can operate on any number of elements_per_thread and thus don't need to be expanded multiple times.
+A mismatch is a verification error:
+
+.. code-block:: mlir
+
+   // ERROR: vector_shapes entry 'M' (16) does not match
+   //        workgroup constraint tile size (32)
+   #wave.hardware_constraint<threads_per_wave = 64, vector_shapes = {M = 16}>
+
+This means that in non-MMA programs, there is no separate expansion step:
+``vector_shapes`` equals the tile size and each operation appears exactly once
+per dimension.
+
+
+MMA kind and intrinsic shapes
+------------------------------
+
+``WaveMmaKindEnum`` enumerates hardware matrix multiply intrinsics.  Each
+variant encodes the output element type, tile shape (M×N×K), and input element
+type.  Examples:
+
+.. code-block:: mlir
+
+   #wave.mma_kind<f32_16x16x16_f16>               // (M=16, N=16, K=16)
+   #wave.mma_kind<f32_32x32x8_f16>                 // (M=32, N=32, K=8)
+   #wave.mma_kind<f32_16x16x128_f8f6f4>            // (M=16, N=16, K=128)
+
+``WaveMmaKindAttr::getShape(ctx, kind)`` returns the ``(M, N, K)`` tuple.
+
+The ``kind`` attribute on ``wave.mma`` may differ from the ``mma_type`` on the
+hardware constraint.  When ``kind`` is absent, the
+``PropagateDefaultsFromConstraints`` pass fills it from the hardware
+constraint's ``mma_type``.  When multiple ``wave.mma`` ops exist in the same
+function, each carries its own ``kind`` and its own effective vector shapes.
+
+
+Relationship to elements_per_thread
+-------------------------------------
+
+``elements_per_thread`` is an optional ``I64Attr`` on ``wave.read`` and
+``wave.write``.  It specifies how many contiguous elements a single thread
+loads or stores in one operation instance:
+
+.. code-block:: mlir
+
+   %0 = wave.read %mem { elements_per_thread = 8 }
+       : (!wave.tensor<[@M, @K] of f16, <global>>)
+       -> !wave.tensor<[@M, @K] of f16, <register>>
+
+``elements_per_thread`` is related to ``vector_shapes`` conceptually: the
+vector shape for a dimension gives the total elements a wave handles, and
+dividing by ``threads_per_wave`` (for a reduction dimension) or accounting for
+thread count per workgroup dimension gives the per-thread count.  The
+``PropagateElementsPerThread`` pass can infer ``elements_per_thread`` from the
+hardware constraint when it is not explicitly provided.
diff --git a/water/include/water/Dialect/Wave/IR/WaveAttrs.td b/water/include/water/Dialect/Wave/IR/WaveAttrs.td
@@ -237,18 +237,12 @@ def HardwareConstraintAttr : AttrDef<WaveDialect, "HardwareConstraint"> {
     configuration rather than fundamental hardware constraints.
   }];
   let parameters = (ins "unsigned":$threads_per_wave,
-                        OptionalArrayRefParameter<"unsigned">:$waves_per_block,
+                        OptionalParameter<"::mlir::DenseI32ArrayAttr">:$waves_per_block,
                         OptionalParameter<"::wave::WaveMmaKindAttr">:$mma_type,
                         OptionalParameter<"::mlir::DictionaryAttr">:$vector_shapes,
                         DefaultValuedParameter<"unsigned", "128">:$max_bits_per_load);
 
-  let assemblyFormat = [{
-    `<` `threads_per_wave` `=` $threads_per_wave
-    (`,` `waves_per_block` `=` `[` $waves_per_block^ `]`)?
-    (`,` `mma_type` `=` $mma_type^)?
-    (`,` `vector_shapes` `=` $vector_shapes^)?
-    (`,` `max_bits_per_load` `=` $max_bits_per_load^)? `>`
-  }];
+  let assemblyFormat = "`<` struct(params) `>`";
 
   let genVerifyDecl = 1;
 }

diff --git a/water/include/water/Dialect/Wave/IR/WaveInterfaces.h b/water/include/water/Dialect/Wave/IR/WaveInterfaces.h
@@ -558,8 +558,8 @@ class IndexExprsAnalysisInit {
       symbolConstraints;
 
   // Waves-per-block extracted from the hardware constraint or computed from
-  // wave constraints. Always stored here, even if copied from an attribute.
-  llvm::SmallVector<unsigned, 3> wavesPerBlock;
+  // wave constraints.
+  llvm::SmallVector<int32_t, 3> wavesPerBlock;
 };
 
 // Lattice for propagating index expressions across wave dialect operations.

diff --git a/water/include/water/Dialect/Wave/IR/WaveUtils.h b/water/include/water/Dialect/Wave/IR/WaveUtils.h
@@ -62,7 +62,7 @@ llvm::LogicalResult computeWavesPerBlockFromConstraints(
     const llvm::SmallDenseMap<wave::WaveSymbolAttr, wave::WaveConstraintAttr>
         &waveConstraints,
     wave::WaveHyperparameterAttr hyperparams,
-    llvm::SmallVectorImpl<unsigned> &wavesPerBlock);
+    llvm::SmallVectorImpl<int32_t> &wavesPerBlock);
 
 /// Permute the shape according to the mapping.
 void permuteShape(llvm::ArrayRef<wave::WaveSymbolAttr> shape,

diff --git a/water/include/water/c/Dialects.h b/water/include/water/c/Dialects.h
@@ -475,7 +475,7 @@ mlirAttributeIsAHardwareConstraintAttr(MlirAttribute attr);
 /// Creates a new HardwareConstraintAttr
 MLIR_CAPI_EXPORTED MlirAttribute mlirHardwareConstraintAttrGet(
     MlirContext mlirCtx, unsigned threadsPerWave, size_t wavesPerBlockSize,
-    unsigned *wavesPerBlock, MlirAttribute mmaType, MlirAttribute vectorShapes,
+    int32_t *wavesPerBlock, MlirAttribute mmaType, MlirAttribute vectorShapes,
     unsigned maxBitsPerLoad);
 
 /// Returns the typeID of a HardwareConstraintAttr.
@@ -486,7 +486,7 @@ MLIR_CAPI_EXPORTED unsigned
 mlirHardwareConstraintAttrGetThreadsPerWave(MlirAttribute attr);
 MLIR_CAPI_EXPORTED intptr_t
 mlirHardwareConstraintAttrGetNumWavesPerBlock(MlirAttribute attr);
-MLIR_CAPI_EXPORTED unsigned
+MLIR_CAPI_EXPORTED int32_t
 mlirHardwareConstraintAttrGetWavesPerBlockElem(MlirAttribute attr, intptr_t i);
 MLIR_CAPI_EXPORTED MlirAttribute
 mlirHardwareConstraintAttrGetMmaType(MlirAttribute attr);

diff --git a/water/lib/CAPI/Dialects.cpp b/water/lib/CAPI/Dialects.cpp
@@ -516,7 +516,7 @@ bool mlirAttributeIsAHardwareConstraintAttr(MlirAttribute attr) {
 
 MlirAttribute
 mlirHardwareConstraintAttrGet(MlirContext mlirCtx, unsigned threadsPerWave,
-                              size_t wavesPerBlockSize, unsigned *wavesPerBlock,
+                              size_t wavesPerBlockSize, int32_t *wavesPerBlock,
                               MlirAttribute mmaType, MlirAttribute vectorShapes,
                               unsigned maxBitsPerLoad) {
   MLIRContext *ctx = unwrap(mlirCtx);
@@ -525,9 +525,14 @@ mlirHardwareConstraintAttrGet(MlirContext mlirCtx, unsigned threadsPerWave,
   auto vectorShapesAttr =
       llvm::cast_if_present<DictionaryAttr>(unwrap(vectorShapes));
 
+  DenseI32ArrayAttr wavesPerBlockAttr;
+  if (wavesPerBlockSize > 0)
+    wavesPerBlockAttr = DenseI32ArrayAttr::get(
+        ctx, llvm::ArrayRef(wavesPerBlock, wavesPerBlockSize));
+
   return wrap(wave::HardwareConstraintAttr::get(
-      ctx, threadsPerWave, llvm::ArrayRef(wavesPerBlock, wavesPerBlockSize),
-      mmaTypeAttr, vectorShapesAttr, maxBitsPerLoad));
+      ctx, threadsPerWave, wavesPerBlockAttr, mmaTypeAttr, vectorShapesAttr,
+      maxBitsPerLoad));
 }
 
 MlirTypeID mlirWHardwareConstraintAttrGetTypeID() {
@@ -539,14 +544,15 @@ unsigned mlirHardwareConstraintAttrGetThreadsPerWave(MlirAttribute attr) {
       .getThreadsPerWave();
 }
 intptr_t mlirHardwareConstraintAttrGetNumWavesPerBlock(MlirAttribute attr) {
-  return llvm::cast<wave::HardwareConstraintAttr>(unwrap(attr))
-      .getWavesPerBlock()
-      .size();
+  DenseI32ArrayAttr wpb =
+      llvm::cast<wave::HardwareConstraintAttr>(unwrap(attr)).getWavesPerBlock();
+  return wpb ? wpb.size() : 0;
 }
-unsigned mlirHardwareConstraintAttrGetWavesPerBlockElem(MlirAttribute attr,
-                                                        intptr_t i) {
+int32_t mlirHardwareConstraintAttrGetWavesPerBlockElem(MlirAttribute attr,
+                                                       intptr_t i) {
   return llvm::cast<wave::HardwareConstraintAttr>(unwrap(attr))
-      .getWavesPerBlock()[i];
+      .getWavesPerBlock()
+      .asArrayRef()[i];
 }
 MlirAttribute mlirHardwareConstraintAttrGetMmaType(MlirAttribute attr) {
   return wrap(

diff --git a/water/lib/Dialect/Wave/IR/WaveAttrs.cpp b/water/lib/Dialect/Wave/IR/WaveAttrs.cpp
@@ -708,12 +708,11 @@ WaveExprListAttr::verify(function_ref<InFlightDiagnostic()> emitError,
 
 LogicalResult HardwareConstraintAttr::verify(
     function_ref<InFlightDiagnostic()> emitError, unsigned threadsPerWave,
-    ArrayRef<unsigned> wavesPerBlock, WaveMmaKindAttr mmaType,
+    DenseI32ArrayAttr wavesPerBlock, WaveMmaKindAttr mmaType,
     DictionaryAttr vectorShapes, unsigned maxBitsPerLoad) {
 
-  if (!(wavesPerBlock.empty() || wavesPerBlock.size() == 3))
-    return emitError() << "waves_per_block (" << wavesPerBlock
-                       << ") should have 3 elements";
+  if (wavesPerBlock && wavesPerBlock.size() != 3)
+    return emitError() << "waves_per_block should have 3 elements";
 
   if (vectorShapes) {
     for (NamedAttribute attr : vectorShapes) {

diff --git a/water/lib/Dialect/Wave/IR/WaveDialect.cpp b/water/lib/Dialect/Wave/IR/WaveDialect.cpp
@@ -222,6 +222,7 @@ verifyConstraints(ArrayAttr constraints,
   // * The number of workgroups should be greater than or equal to one.
   llvm::SmallDenseMap<wave::WaveSymbolAttr, int64_t> resolvedWorkgroupSizes(
       workgroupConstraints.size());
+  llvm::SmallDenseMap<wave::WaveSymbolAttr, int64_t> resolvedSizes;
   llvm::SmallDenseSet<wave::WaveWorkgroupDimAttr, 4> assignedDims;
   llvm::SmallDenseSet<wave::WaveWorkgroupDimAttr, 4> needsPrimaryDim;
   for (auto &&[symbol, constraint] : workgroupConstraints) {
@@ -250,6 +251,7 @@ verifyConstraints(ArrayAttr constraints,
 
     int64_t workgroupSize = evaluated->front();
     resolvedWorkgroupSizes[symbol] = workgroupSize;
+    resolvedSizes[symbol] = workgroupSize;
 
     std::optional<llvm::SmallVector<int64_t>> resolvedDims =
         wave::resolveSymbolNames(symbol, hyperparams);
@@ -310,15 +312,16 @@ verifyConstraints(ArrayAttr constraints,
                          << workgroupSize << " for dimension: " << symbol;
     }
     resolvedWaveCounts[symbol] = numWaves;
+    resolvedSizes[symbol] = resolvedWaveSize;
   }
 
   // verify consistency between wave constraints and waves_per_block
   // * If both wave constraints and waves_per_block are present, the computed
   // number of waves per dimension should match the waves_per_block attribute.
-  if (hardwareConstraint && !hardwareConstraint.getWavesPerBlock().empty() &&
+  if (hardwareConstraint && hardwareConstraint.getWavesPerBlock() &&
       !waveConstraints.empty()) {
-    llvm::ArrayRef<unsigned> wavesPerBlock =
-        hardwareConstraint.getWavesPerBlock();
+    llvm::ArrayRef<int32_t> wavesPerBlock =
+        hardwareConstraint.getWavesPerBlock().asArrayRef();
     for (auto &&[symbol, waveConstraint] : waveConstraints) {
       wave::WorkgroupConstraintAttr wgConstraint = workgroupConstraints[symbol];
       unsigned wgDim =
@@ -335,6 +338,8 @@ verifyConstraints(ArrayAttr constraints,
 
   // verify TilingConstraint
   // * The number of tiles should be greater than or equal to one.
+  llvm::SmallDenseMap<wave::WaveSymbolAttr, int64_t> resolvedTilingSizes(
+      tilingConstraints.size());
   for (auto &&[symbol, constraint] : tilingConstraints) {
     std::optional<llvm::SmallVector<int64_t>> evaluated =
         wave::evaluateMapWithHyperparams(constraint.getTileSize().getMap(),
@@ -351,6 +356,8 @@ verifyConstraints(ArrayAttr constraints,
            "failed to resolve dimesion symbol");
 
     int64_t resolvedTileSize = evaluated->front();
+    resolvedTilingSizes[symbol] = resolvedTileSize;
+    resolvedSizes[symbol] = resolvedTileSize;
     int64_t resolvedDim = resolvedDims->front();
     int64_t numTiles = resolvedDim / resolvedTileSize;
     if (numTiles < 1) {
@@ -359,6 +366,38 @@ verifyConstraints(ArrayAttr constraints,
     }
   }
 
+  // Verify consistency between constraints and vector_shapes (when mma_type
+  // is absent). Each vector_shapes entry must match the resolved tile size
-  // is absent). Each vector_shapes entry must match the resolved tile size
+  // is absent). Each vector_shapes entry must be less than or equal to the resolved tile size
-  // is absent). Each vector_shapes entry must match the resolved tile size
+  // is absent). Each vector_shapes entry must be less than or equal to the resolved tile size
+  // from the most specific constraint for that dimension: WaveConstraint >
+  // WorkgroupConstraint > TilingConstraint.
+  if (hardwareConstraint && hardwareConstraint.getVectorShapes() &&
+      !hardwareConstraint.getMmaType()) {
+    DictionaryAttr vectorShapes = hardwareConstraint.getVectorShapes();
+    for (NamedAttribute dimension : vectorShapes) {
+      llvm::StringRef symbolName = dimension.getName().getValue();
+      int64_t size = llvm::cast<IntegerAttr>(dimension.getValue()).getInt();
+
+      wave::WaveSymbolAttr symbol =
+          wave::WaveSymbolAttr::get(hyperparams.getContext(), symbolName);
+
+      auto it = resolvedSizes.find(symbol);
+
+      if (it == resolvedSizes.end()) {
+        // Batch dimensions may not be present in the resolved sizes map.
+        continue;
+      }
+
+      int64_t resolvedSize = it->second;
+
+      if (size > resolvedSize) {
+        return emitError() << "vector_shapes entry '" << symbolName << "' ("
+                           << size
+                           << ") is greater than the resolved tile size ("
+                           << resolvedSize << ") for dimension: " << symbol;
+      }
+    }
+  }
+
   return llvm::success();
 }