dot: size kc for L2 stripe reuse, not L1 per-call stripe

kasper0406 · claude · kasper0406 · commit 543907464bcf · 2026-04-19T13:28:55.000+02:00
schedule_dot's k-block formula previously used block_n in the denominator,
which sized kc so that a per-kernel-call stripe (kc × block_n) fit in the
given cache. The real reuse pattern is different: inside an outer k
iteration the scheduler sweeps all (m, n) tiles, so the cross-m-iteration
stripe (kc × n) is what needs to stay resident in cache.

Changes here:

* schedule.cc — switch the k-block formula from `block_n` to the current
  `n`. The cache_size argument now has its intended meaning (L2/SLC
  budget, not L1 per-call footprint).
* schedule.cc — when k overflows the cache-sized kc_max, split into
  `niter = ceil(k/kc_max)` near-equal iterations instead of one kc_max
  iter plus a tail. Two reasons: small tails amortise kernel-call
  overhead poorly, and a kc that nearly fills the cache with no headroom
  causes the B stripe to spill into the outer cache, which has much
  higher variance than a slightly smaller stripe that fits cleanly.
* schedule.cc — reserve 1/16 headroom from the cache budget before
  computing kc_max. At the boundary where the stripe exactly fills the
  budget, A/C tiles, TLB traffic, and prefetch interference evict parts
  of the resident B stripe back into outer caches, hurting both mean and
  run-to-run variance.
* base/arch.{h,cc} — new `get_l2_cache_size()` returning a GEMM cache
  budget. When cpuinfo is available, iterates all reported L2 caches and
  picks the one with the largest per-thread share (size /
  processor_count), so on asymmetric systems (Apple M-series P+E, Arm
  big.LITTLE) it deterministically selects the performance cluster. The
  budget is max(per_thread * 2, total * 3/4):
    - per_thread * 2 absorbs graceful spillover into outer SLC/L3.
    - total * 3/4 recognises that on physically-shared L2 clusters
      (M-series P-cluster, big.LITTLE P-cluster) a single-threaded GEMM
      gets near-full L2 — the per-thread model under-counts. The 1/4
      headroom covers A/C tiles, TLB, and prefetch interference near
      capacity.
  On per-core L2 systems (sharers=1) the first branch always dominates,
  so behavior there is unchanged. Tuned for single-threaded latency; see
  implementation comment for the multi-thread caveat. Fallback: 1 MiB.
* subgraph/dot.cc — replace the hardcoded `cache_size_l2 = 128 KiB` with
  `get_l2_cache_size()`, the TODO the original code asked for.
* schedule_bench.cc — add an `auto:&lt;cache1&gt;[,&lt;cache2&gt;...]` flag that
  drives `schedule_dot` directly with user-specified cache sizes, so this
  (and the previous) commit can be exercised from the CLI without
  hand-crafting loops.
* schedule_test.cc — targeted tests covering the no-blocking case, kc
  sizing from current n, the even-split policy at various k/kc_max
  ratios, and the fits-exactly boundary. Expectations account for the
  1/16 safety headroom.

Benchmarked on Apple M4 Pro, dot_fp32_sme2 square GEMMs, N=12 samples per
cell with randomised order and cooldowns: geomean +8.2% across n ∈
{64..8192} against master's 128 KiB L1-style schedule, with wins of
+5-20% across n ∈ {1280..3072} and +21%/+51% at n=7168/8192 where the
old schedule's 512-kc stripe no longer fits L2. No regressions outside
the noise floor.

All tests in //ynnpack/kernels/dot/... pass.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/ynnpack/base/arch.cc b/ynnpack/base/arch.cc
@@ -97,4 +97,67 @@ uint64_t get_supported_arch_flags() {
   return flags;
 }
 
+size_t get_l2_cache_size() {
+  static const size_t size = []() -> size_t {
+    // Conservative default when cpuinfo isn't available: 1 MiB. This is
+    // within a small factor of what typical Cortex-A7xx / Neoverse cores
+    // have per core, and large enough that kc stays usefully big for
+    // typical GEMM shapes (for N <= 4096, f32, it keeps kc >= 64).
+    constexpr size_t kFallback = 1 * 1024 * 1024;
+#ifdef YNN_ENABLE_CPUINFO
+    if (!cpuinfo_initialize()) {
+      return kFallback;
+    }
+    const uint32_t count = cpuinfo_get_l2_caches_count();
+    if (count == 0) {
+      return kFallback;
+    }
+    // Pick the L2 with the largest per-thread share. On asymmetric systems
+    // (Apple M-series P+E, Arm big.LITTLE) this deterministically selects
+    // the performance cluster, which is where latency-critical GEMM work
+    // lands. On homogeneous systems every L2 yields the same answer. Also
+    // track the total size of the selected L2 for the second branch below.
+    size_t best_per_thread = 0;
+    size_t best_total = 0;
+    for (uint32_t i = 0; i < count; ++i) {
+      const struct cpuinfo_cache* l2 = cpuinfo_get_l2_cache(i);
+      if (l2 == nullptr || l2->size == 0) continue;
+      const uint32_t sharers =
+          l2->processor_count > 0 ? l2->processor_count : 1;
+      const size_t per_thread = static_cast<size_t>(l2->size) / sharers;
+      if (per_thread > best_per_thread) {
+        best_per_thread = per_thread;
+        best_total = static_cast<size_t>(l2->size);
+      }
+    }
+    if (best_per_thread == 0) return kFallback;
+    // Two bounds, take the larger:
+    //
+    //   per_thread * 2: assumes all sharers run GEMM concurrently; the 2x
+    //     absorbs graceful spillover into outer SLC/L3 on Apple M-series
+    //     and Neoverse cores.
+    //
+    //   total * 3/4: on a physically-shared L2 cluster (M-series P-cluster,
+    //     big.LITTLE P-cluster) a single-threaded GEMM gets near-full L2 —
+    //     the per-thread model under-counts. The 1/4 headroom covers A/C
+    //     tiles, TLB, and prefetch interference near capacity.
+    //
+    // On per-core L2 systems (sharers=1) the first always dominates, so
+    // behavior there is unchanged. The second fires only on shared clusters,
+    // which is exactly the topology the per-thread model under-counts.
+    //
+    // Both are tuned for single-threaded latency. Under MT on clustered L2,
+    // threads each using 3/4 of total oversubscribe the cache — the
+    // long-term fix is plumbing active-thread count from pthreadpool.
+    const size_t per_thread_budget = best_per_thread * 2;
+    const size_t shared_cluster_budget = best_total - best_total / 4;
+    return per_thread_budget > shared_cluster_budget ? per_thread_budget
+                                                     : shared_cluster_budget;
+#else
+    return kFallback;
+#endif
+  }();
+  return size;
+}
+
 }  // namespace ynn
diff --git a/ynnpack/base/arch.h b/ynnpack/base/arch.h
@@ -6,6 +6,7 @@
 #ifndef XNNPACK_YNNPACK_BASE_ARCH_H_
 #define XNNPACK_YNNPACK_BASE_ARCH_H_
 
+#include <cstddef>
 #include <cstdint>
 
 namespace ynn {
@@ -67,6 +68,20 @@ inline bool is_arch_supported(
   return (arch_flags & supported_arch_flags) == arch_flags;
 }
 
+// Returns the L2-class cache budget (in bytes) to use for GEMM kc-blocking
+// heuristics — see schedule_dot in kernels/dot/schedule.cc for the formula.
+//
+// When cpuinfo is available, picks the L2 with the largest per-thread share
+// (size / processor_count) across all reported L2 caches — on asymmetric
+// systems (Apple M-series P+E, Arm big.LITTLE) this selects the performance
+// cluster; on homogeneous systems every L2 yields the same answer. The
+// selected share is doubled to account for the stripe gracefully spilling
+// into the outer SLC/L3 present on Apple M-series and Neoverse cores. This
+// heuristic is tuned for single-threaded latency — see the implementation
+// for the multi-thread caveat. Falls back to a conservative 1 MiB when
+// cpuinfo is not available.
+size_t get_l2_cache_size();
+
 }  // namespace ynn
 
 #endif  // XNNPACK_YNNPACK_BASE_ARCH_H_
diff --git a/ynnpack/kernels/dot/schedule.cc b/ynnpack/kernels/dot/schedule.cc
@@ -38,16 +38,33 @@ span<dot_loop> schedule_dot(span<const size_t> cache_sizes, size_t m, size_t n,
     *loop++ = dot_loop{dot_loop::n, blocks};
     n = block_n * blocks;
   };
-  auto make_k_loop = [&](size_t blocks) {
-    if (blocks == 0 || k1 <= block_k * blocks) return;
+  auto make_k_loop = [&](size_t blocks_max) {
+    if (blocks_max == 0) return;
+    const size_t k_blocks = ceil_div(k1, block_k);
+    // Fits in a single iteration — no loop needed.
+    if (k_blocks <= blocks_max) return;
+    // Split into N near-equal iterations rather than one cache-max iter plus
+    // a tail. Two reasons: (1) a small tail amortises kernel-call overhead
+    // poorly, and (2) a kc that almost-fills the cache with no headroom
+    // causes the B stripe to spill into the outer cache, which has much
+    // higher variance than a slightly smaller stripe that fits cleanly.
+    const size_t niter = ceil_div(k_blocks, blocks_max);
+    const size_t blocks = ceil_div(k_blocks, niter);
     *loop++ = dot_loop{dot_loop::k, blocks};
     k1 = block_k * blocks;
   };
 
   for (size_t cache_size : cache_sizes) {
-    // TODO(b/447988052): We can be way smarter about this than we are now.
-    make_k_loop(
-        floor_div(cache_size, k2 * block_n * b_elem_size * block_k));
+    // Size kc so that a (kc × n) stripe of B fits in this cache. Inside each
+    // outer k-iteration we sweep all (m, n) tiles; the B stripe is loaded
+    // once from memory on the first m-iteration and reused from cache on
+    // subsequent m-iterations, so what matters is kc×n×b_elem_size, not the
+    // per-kernel-call kc×block_n stripe.
+    // ~6% headroom so the B stripe doesn't exactly fill the budget — at
+    // that boundary, concurrent A/C/TLB traffic evicts B into the outer
+    // cache, hurting both mean and run-to-run variance.
+    const size_t kc_budget = cache_size - cache_size / 16;
+    make_k_loop(floor_div(kc_budget, k2 * n * b_elem_size * block_k));
     if (n * b_elem_size <= m * a_elem_size) {
       // Tiles of B are smaller than tiles of A, we should assume B fits in
       // cache.
diff --git a/ynnpack/kernels/dot/schedule_bench.cc b/ynnpack/kernels/dot/schedule_bench.cc
@@ -198,7 +198,8 @@ double run_benchmark(TA, TB, TC, const kernel_info& kernel, size_t m, size_t n,
 int main(int argc, char** argv) {
   if (argc < 3) {
     std::cerr << "Usage: " << argv[0]
-              << " <kernel_name> <MxNxK> [<loop1> <loop2> ...]" << std::endl;
+              << " <kernel_name> <MxNxK> [<loop1>|auto:<cache1>[,<cache2>...]]"
+              << std::endl;
     return 1;
   }
 
@@ -209,23 +210,35 @@ int main(int argc, char** argv) {
     return 1;
   }
 
+  // Find the kernel
+  auto kernel = ynn::get_kernel(kernel_name);
+  if (!kernel.kernel) {
+    std::cerr << "Unknown kernel: " << kernel_name << std::endl;
+    return 1;
+  }
+
   std::vector<ynn::dot_loop> loops;
+  std::vector<size_t> auto_cache_sizes;
   for (int i = 3; i < argc; ++i) {
-    ynn::dot_loop loop = ynn::parse_dot_loop(argv[i]);
+    std::string arg = argv[i];
+    if (arg.rfind("auto:", 0) == 0) {
+      std::stringstream ss(arg.substr(5));
+      std::string token;
+      while (std::getline(ss, token, ',')) {
+        if (token.empty()) continue;
+        size_t cs = std::stoul(token);
+        auto_cache_sizes.push_back(cs);
+      }
+      continue;
+    }
+    ynn::dot_loop loop = ynn::parse_dot_loop(arg);
     if (loop.dim < 0 || loop.blocks == 0) {
-      std::cerr << "Error parsing loop specifier: " << argv[i] << std::endl;
+      std::cerr << "Error parsing loop specifier: " << arg << std::endl;
       return 1;
     }
     loops.push_back(loop);
   }
 
-  // Find the kernel
-  auto kernel = ynn::get_kernel(kernel_name);
-  if (!kernel.kernel) {
-    std::cerr << "Unknown kernel: " << kernel_name << std::endl;
-    return 1;
-  }
-
   // Kernels require an outer loop for m, make sure we have one.
   size_t min_block_m = -1;
   // The dot loops are interpreted as a multiple of blocks. To make this CLI
@@ -244,9 +257,36 @@ int main(int argc, char** argv) {
         break;
     }
   }
-  if (min_block_m > 1) loops.push_back({ynn::dot_loop::m, 1});
 
   double t = ynn::SwitchThreeTypes(kernel.type, [&](auto a, auto b, auto c) {
+    using TA = decltype(a);
+    using TB = decltype(b);
+    std::vector<ynn::dot_loop> auto_storage;
+    if (!auto_cache_sizes.empty()) {
+      auto_storage.resize(auto_cache_sizes.size() * 3);
+      size_t ks[] = {static_cast<size_t>(shape.k), 1, 1};
+      ynn::span<const size_t> cs(auto_cache_sizes);
+      auto auto_loops = ynn::schedule_dot(
+          cs, static_cast<size_t>(shape.m), static_cast<size_t>(shape.n),
+          ynn::span<const size_t>(ks), kernel.block_m, kernel.block_n,
+          kernel.block_k, sizeof(TA), sizeof(TB), auto_storage.data());
+      std::cerr << "[auto schedule] ";
+      for (const auto& l : auto_loops) {
+        char d = l.dim == ynn::dot_loop::m ? 'm'
+               : l.dim == ynn::dot_loop::n ? 'n' : 'k';
+        size_t bs = l.dim == ynn::dot_loop::m ? kernel.block_m
+                  : l.dim == ynn::dot_loop::n ? kernel.block_n
+                                              : kernel.block_k;
+        std::cerr << d << (l.blocks * bs) << " ";
+      }
+      std::cerr << std::endl;
+      for (const auto& l : auto_loops) loops.push_back(l);
+      for (const auto& l : auto_loops) {
+        if (l.dim == ynn::dot_loop::m)
+          min_block_m = std::min(min_block_m, l.blocks);
+      }
+    }
+    if (min_block_m > 1) loops.push_back({ynn::dot_loop::m, 1});
     return ynn::run_benchmark(a, b, c, kernel, shape.m, shape.n, shape.k,
                               loops);
   });
diff --git a/ynnpack/kernels/dot/schedule_test.cc b/ynnpack/kernels/dot/schedule_test.cc
@@ -163,4 +163,116 @@ TEST(run_dot, loop_k) {
                           dot_call_at(m, n, block_k, 0, 0, 3 * block_k)));
 }
 
+// -- Targeted tests for schedule_dot itself --
+
+bool operator==(const dot_loop& a, const dot_loop& b) {
+  return a.dim == b.dim && a.blocks == b.blocks;
+}
+
+std::ostream& operator<<(std::ostream& os, const dot_loop& l) {
+  const char* d = l.dim == dot_loop::m   ? "m"
+                  : l.dim == dot_loop::n ? "n"
+                  : l.dim == dot_loop::k ? "k"
+                                         : "?";
+  return os << d << "x" << l.blocks;
+}
+
+// A cache budget much larger than the working set yields no blocking — the
+// default {m, 1} safety loop is emitted so run_dot always has at least one
+// loop to walk.
+TEST(schedule_dot, no_blocking_when_everything_fits) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {8 * 1024 * 1024};  // 8 MiB
+  const size_t ks[] = {64};
+  auto loops = schedule_dot(cache_sizes, /*m=*/16, /*n=*/64, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::m, 1}));
+}
+
+// Large shape vs a 128 KiB cache: kc_max = 15/16 * 128 KiB /
+// (n * b_elem * block_k) = 120 KiB / (2048 * 4 * 1) = 15 block_k units.
+// The 15/16 factor is the safety headroom applied in schedule.cc.
+TEST(schedule_dot, k_loop_sized_from_current_n) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {128 * 1024};
+  const size_t ks[] = {2048};
+  auto loops = schedule_dot(cache_sizes, /*m=*/2048, /*n=*/2048, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 15},
+                                 dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
+// Even-split: when k slightly overflows the natural kc, we split into two
+// near-equal iterations rather than one cache-max iter plus a small tail.
+// With a 16 MiB cache, n = 4096, and the 15/16 safety headroom, kc_max =
+// 15 MiB / (4096 * 4) = 960. For k = 1200, niter = 2, blocks = 600.
+TEST(schedule_dot, k_loop_splits_evenly_when_k_slightly_over_kc_max) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {16ULL * 1024 * 1024};
+  const size_t ks[] = {1200};
+  auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 600},
+                                 dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
+// Even-split at ~1.5x: k = 1536 is ~1.6 * kc_max (960). Old policy would
+// have run one cache-max iter plus a small tail; the new even-split gives
+// two near-equal iters of 768, each comfortably inside cache.
+TEST(schedule_dot, k_loop_splits_evenly_at_1p5x_boundary) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {16ULL * 1024 * 1024};
+  const size_t ks[] = {1024 + 1024 / 2};  // k1 = 1536
+  auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 768},
+                                 dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
+// Larger overflow: k = 1600 gives niter = 2, blocks = ceil(1600/2) = 800.
+TEST(schedule_dot, k_loop_splits_evenly_into_two_when_below_2x_kc_max) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {16ULL * 1024 * 1024};
+  const size_t ks[] = {1600};
+  auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 800},
+                                 dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
+// Many iterations: k = 2880 = 3 * kc_max (960 after the 15/16 headroom).
+// The resulting blocks equals kc_max exactly when k is a clean multiple.
+TEST(schedule_dot, k_loop_uses_kc_max_when_k_is_multiple_of_kc_max) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {16ULL * 1024 * 1024};
+  const size_t ks[] = {2880};
+  auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 960},
+                                 dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
+// Boundary: k = kc_max exactly -> fits in one iteration, no k-loop emitted.
+TEST(schedule_dot, k_loop_skipped_when_k_equals_kc_max) {
+  dot_loop storage[3];
+  const size_t cache_sizes[] = {16ULL * 1024 * 1024};
+  const size_t ks[] = {960};  // kc_max = 960 after the 15/16 headroom
+  auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
+                            /*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
+                            /*a_elem_size=*/4, /*b_elem_size=*/4, storage);
+  EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::m, 1},
+                                 dot_loop{dot_loop::n, 1}));
+}
+
 }  // namespace ynn
diff --git a/ynnpack/subgraph/dot.cc b/ynnpack/subgraph/dot.cc
@@ -18,6 +18,7 @@
 #include <variant>
 #include <vector>
 
+#include "ynnpack/base/arch.h"
 #include "ynnpack/base/arithmetic.h"
 #include "ynnpack/base/base.h"
 #include "ynnpack/base/log.h"
@@ -44,9 +45,14 @@ namespace ynn {
 
 namespace {
 
-// TODO(dsharlet): This should probably be a parameter we learn based on cpuinfo
-// or other source of CPU metadata. This was determined experimentally.
-constexpr index_t cache_size_l2 = 128 * 1024;
+// Effective L2 cache budget for kc-blocking in schedule_dot. Sized so that a
+// (kc × N) stripe of B fits in this many bytes — see the formula in
+// kernels/dot/schedule.cc.  When cpuinfo is available this comes from the
+// running CPU's reported L2 (per-thread share, with a 2x factor for gradual
+// spill into outer caches); otherwise a conservative 1 MiB is used.
+inline index_t cache_size_l2() {
+  return static_cast<index_t>(get_l2_cache_size());
+}
 
 // When we want arithmetic to be consistent, we need to make all tiling
 // decisions independently of any hardware dependent parameters (cache sizes,
@@ -239,7 +245,7 @@ auto make_dot_impl(dot_type type, bool consistent_arithmetic, bool transposed_a,
                  c_stride_m, c);
         };
 
-    const size_t cache_sizes[] = {cache_size_l2};
+    const size_t cache_sizes[] = {static_cast<size_t>(cache_size_l2())};
 
     // We need up to 3 loops per cache level.
     dot_loop loops_storage[std::size(cache_sizes) * 3];
@@ -405,7 +411,7 @@ uint32_t define_pack_b(ynn_subgraph_t subgraph, const dot_type& type,
   slinky::expr k3 = num_k_dims >= 3 ? b.extent(3) : 1;
 
   const index_t elem_size_bits = type_size_bytes(b.type) * 8 / element_count;
-  const index_t cache_elements = cache_size_l2 * 8 / elem_size_bits;
+  const index_t cache_elements = cache_size_l2() * 8 / elem_size_bits;
 
   // When choosing block_n, we have the following concerns:
   // - We want to make the block bigger than the kernel's `block_n`