[Codegen][CPU] Flatten contiguous trailing dims of transfers before unrolling.

bjacob · claude · bjacob · commit cdc74bd9c275 · 2026-05-25T14:11:24.000-04:00
`VectorTransferLoweringPass` runs the MLIR transfer-lowering patterns with `maxTransferRank=1` plus full-unroll, fully unrolling any rank-N>1 transfer to one rank-1 transfer per outer index. For a packed tile whose trailing dim is a tiny contiguous chunk that turns a single wide load into many narrow ones plus a shuffle chain to rebuild the wide register. Concretely, a bf16xbf16->f32 inner_tiled matmul (N=16, K_inner=2) loads each `<16x2xbf16>` RHS K-step as 16 separate `<2xbf16>` loads + a `vpermt2d`/`vpermt2q` chain -- ~3 cycles of extra work per K-step on top of the 29 dpbf16ps. Apply `populateFlattenVectorTransferPatterns` *before* rank reduction, gated on the target's natural word size (the pointer size, via `DataLayout`): flatten only when the trailing dim is *sub-word*. Sub-word loads in bulk are pathological; word-and-up trailing dims (`<2xf32>` ... `<16xf32>`) are already good standalone loads, and flattening *them* fuses register-sized rows into an oversized 1-D transfer + a `vector.shape_cast` re-split, regressing whole-model .vmfb size. (Not `native_vector_size`: that is the *widest* useful vector, not the smallest non-pathological load.) Measured: bf16 4096x4096 inner_tiled matmul on Zen 4, 80.8 -> 67.1 ms per fragment; combined with the m_bcst-fold broadcast routing in a sibling commit, the full matmul reaches ukernel parity (~50 ms). The `sdxl/clip_compstat_cpu` size guard is unchanged at 583k bytes / 2130 dispatches (golden 650k / 2130). Test fallout: `transpose_mask` in vector_lowering now writes a constant `vector<4x2xi1>` mask as a single flat `vector<8xi1>` store; updated the CHECK lines. Progress towards #24515. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
diff --git a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel
@@ -234,6 +234,7 @@ iree_compiler_cc_library(
         "//compiler/src/iree/compiler/Dialect/Util/Transforms",
         "//compiler/src/iree/compiler/Utils",
         "//llvm-external-projects/iree-dialects:IREELinalgTransformDialect",
+        "@llvm-project//llvm:Core",
         "@llvm-project//llvm:Support",
         "@llvm-project//mlir:AMDGPUDialect",
         "@llvm-project//mlir:AMDGPUTransforms",
diff --git a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt
@@ -180,6 +180,7 @@ iree_cc_library(
     ::PassHeaders
     ::PassesIncGen
     IREELinalgTransformDialect
+    LLVMCore
     LLVMSupport
     MLIRAMDGPUDialect
     MLIRAMDGPUTransforms
diff --git a/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp b/compiler/src/iree/compiler/Codegen/Common/VectorTransferLowering.cpp
@@ -5,6 +5,8 @@
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
 #include "iree/compiler/Codegen/Common/Passes.h"
+#include "iree/compiler/Dialect/HAL/IR/HALTypes.h"
+#include "llvm/IR/DataLayout.h"
 #include "mlir/Conversion/VectorToSCF/VectorToSCF.h"
 #include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"
@@ -38,6 +40,32 @@ void VectorTransferLoweringPass::runOnOperation() {
   MLIRContext *ctx = &getContext();
   mlir::FunctionOpInterface funcOp = getOperation();
 
+  // Flatten contiguous trailing dims of multi-dim transfers when the trailing
+  // dim is narrower than the target's natural word (the pointer size), so a
+  // packed `<16x2xbf16>` (32-bit innermost) lowers to one wide load instead
+  // of 16 narrow loads the rank reduction below would reassemble with a
+  // chain of shuffles. Sub-word loads in bulk are uniformly pathological;
+  // word-and-up loads (`<2xf32>` ... `<16xf32>`) are already fine and
+  // flattening *them* fuses register-sized rows into an oversized 1-D
+  // transfer + a `vector.shape_cast` re-split (extracts), regressing whole-
+  // model .vmfb size for no benefit. This is *not* `native_vector_size`:
+  // that is the *widest* useful vector, not the smallest non-pathological
+  // load.
+  unsigned pointerBits = 64;
+  if (auto targetAttr = IREE::HAL::ExecutableTargetAttr::lookup(funcOp)) {
+    if (auto attr =
+            targetAttr.getConfiguration().getAs<StringAttr>("data_layout")) {
+      if (!attr.getValue().empty()) {
+        pointerBits = llvm::DataLayout(attr.getValue()).getPointerSizeInBits();
+      }
+    }
+  }
+  {
+    RewritePatternSet patterns(ctx);
+    vector::populateFlattenVectorTransferPatterns(patterns, pointerBits);
+    (void)applyPatternsGreedily(funcOp, std::move(patterns));
+  }
+
   RewritePatternSet patterns(ctx);
   // Explicitly materialize the mask on transfer_read/transfer_write.
   // Assume we don't have 4 GB vectors.
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_pack_unpack_tests.mlir
@@ -80,6 +80,9 @@ module {
 // CHECK-LABEL:     func.func @aligned_unpack_generic
 // CHECK:             %[[SRC:.+]] = hal.interface.binding.subspan {{.*}} : memref<24x32x16x16xf32, #hal.descriptor_type<storage_buffer>>
 // CHECK:             %[[ASSUMED_SRC:.+]] = memref.assume_alignment %[[SRC]], 64
+// The unpack source tile is `vector<16x16xf32>`: its trailing dim is a full
+// 512-bit `vector<16xf32>`, so transfer flattening leaves it alone and plain
+// rank reduction lowers it to one `vector<16xf32>` load per row.
 // CHECK-COUNT-15:        vector.load %[[ASSUMED_SRC]]
 // CHECK:                 %[[LAST_LOAD:.+]] = vector.load %[[ASSUMED_SRC]]
 // CHECK:                 %[[IN_0:.+]] = vector.broadcast %{{.+}} : vector<16xf32> to vector<16x16xf32>
diff --git a/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir b/compiler/src/iree/compiler/Codegen/LLVMCPU/test/vector_lowering.mlir
@@ -155,7 +155,9 @@ func.func @transpose_mask() {
 //   CHECK-NOT:   vector.shuffle
 //   CHECK-DAG:   %[[MASK:.+]] = arith.constant dense<true>
 //   CHECK-DAG:   %[[OUTPUT:.+]] = hal.interface.binding.subspan
-//       CHECK:   vector.store %[[MASK]], %[[OUTPUT]]
+// VectorTransferLoweringPass flattens the contiguous 4x2 trailing dims of
+// the store into a single `vector<8xi1>` store over the collapsed memref.
+//       CHECK:   vector.store %[[MASK]], %{{.+}}
 
 // -----